Skip to content

Add a repeatable performance benchmark harness for tracing hot paths #134

@AbhiPrasad

Description

Add a repeatable performance benchmark harness for tracing hot paths

Summary

We should add a durable benchmarking setup for the Python SDK before landing tracing performance changes like the ones proposed in PR #101.

Today we have strong functional coverage, but no formal performance harness on main. That makes it hard to:

  • validate that a performance-oriented refactor helps the intended hot path
  • separate large wins from noise
  • compare branch results against main
  • catch accidental regressions in shared helpers like merge_dicts()
  • evaluate behavior with and without optional perf dependencies like orjson

This issue proposes a benchmarking design that fits the repo's current nox and packaging setup.

Goals

  • Add repeatable microbenchmarks for tracing hot-path functions
  • Add one realistic end-to-end tracing benchmark
  • Make it easy to compare main vs a feature branch locally
  • Cover both minimal installs and the performance extra
  • Keep performance checks separate from normal functional test runs

Non-goals

  • Do not block CI on strict timing thresholds initially
  • Do not turn benchmark timings into flaky pytest assertions
  • Do not try to benchmark every SDK area at once

Proposed tooling

Use pyperf as the primary benchmark runner.

Rationale:

  • much better statistical discipline than ad hoc time.perf_counter() loops
  • supports warmups, calibration, repetition, and JSON output
  • gives us a clean compare_to workflow for branch vs branch
  • better fit for stable microbenchmarks than plain pytest timing

Keep a separate scenario-style benchmark for end-to-end tracing flows. That can still be implemented in Python, but it should be structured as a benchmark module rather than a one-off exploratory script.

Proposed layout

Under py/:

benchmarks/
  README.md
  conftest.py
  cases/
    bench_bt_json.py
    bench_logger.py
    bench_tracing_e2e.py
  fixtures.py

Notes:

  • bench_bt_json.py should focus on serialization and deep-copy hot paths
  • bench_logger.py should focus on span creation, split/sanitize paths, and internal-only logging
  • bench_tracing_e2e.py should model a realistic root-span + child-span + logging workload
  • fixtures.py should centralize representative payload builders so cases stay consistent

What to benchmark

1. bt_json microbenchmarks

Cover:

  • _to_bt_safe() on primitives
  • _to_bt_safe() on str/int subclasses and enum-backed strings
  • _to_bt_safe() on dataclasses
  • _to_bt_safe() on pydantic-like objects
  • bt_safe_deep_copy() on small, medium, and large nested dict/list payloads
  • bt_safe_deep_copy() on payloads with circular references
  • bt_safe_deep_copy() on payloads with non-string dict keys

Why:

  • these are the core hot paths in the PR under discussion
  • they are easy to measure deterministically in isolation

2. logger microbenchmarks

Cover:

  • _validate_and_sanitize_experiment_log_partial_args() on empty vs populated events
  • _strip_nones() on shallow and nested dicts, with and without None
  • split_logging_data() for:
    • event only
    • internal_data only
    • both present
    • BraintrustStream present
  • SpanImpl creation with explicit name
  • SpanImpl creation without name
  • log_internal() with user event payloads
  • log_internal() with internal-only payloads like end() / set_attributes()

Why:

  • this isolates the logger-local optimizations from broader utility changes
  • it lets us quantify small changes like lazy caller-location lookup separately from bigger serialization wins

3. End-to-end tracing benchmark

Cover one realistic scenario:

  • create root span
  • add representative input / metadata
  • create child span
  • log output / metadata / metrics
  • end child and root spans
  • exercise background logger record construction without external I/O

Parameters:

  • run enough iterations to reduce noise
  • keep payload shapes fixed and version-controlled
  • report separate scenarios for:
    • medium payload
    • large payload
    • internal-only updates

Why:

  • microbenchmarks show where a win comes from
  • the e2e benchmark shows whether the win is real in the path users care about

Benchmark environments

Run each benchmark suite in at least two environments:

A. Minimal environment

  • install the package plus benchmark deps
  • do not install optional perf extras

Purpose:

  • captures baseline behavior for default installs

B. Performance environment

  • install .[performance]

Purpose:

  • measures the effect of optional orjson
  • ensures future performance work does not accidentally optimize only one environment

If needed later, add a second Python minor version to confirm interpreter-sensitive behavior, but that should not be required for the initial rollout.

Nox integration

Add dedicated nox sessions instead of folding this into test_core.

Suggested sessions:

  • perf_bt_json
  • perf_logger
  • perf_e2e
  • optional umbrella session: perf

Behavior:

  • install pyperf
  • install the package from source
  • optionally install .[performance] in dedicated sessions or via a flag
  • run benchmark modules directly
  • emit JSON results to a temp or ignored output path

Example local workflows:

cd py
nox -s perf_bt_json
nox -s perf_logger
nox -s perf_e2e

Comparison workflow:

cd py
nox -s perf_bt_json -- --output /tmp/main-bt-json.json
nox -s perf_bt_json -- --output /tmp/branch-bt-json.json
pyperf compare_to /tmp/main-bt-json.json /tmp/branch-bt-json.json

We can refine exact command shape during implementation, but the key point is to make branch-vs-branch comparison first-class.

Local-first workflow

The initial implementation should optimize for local developer use only.

That means:

  • no CI integration yet
  • no nightly jobs
  • no hard performance thresholds

Instead, the benchmark harness should make it easy to:

  • establish a baseline on main
  • run the same benchmark on a feature branch
  • compare pyperf JSON outputs locally

The command surface should stay very small and obvious:

cd py
nox -s perf_bt_json
nox -s perf_logger
nox -s perf_e2e

And branch comparison should be a first-class local workflow:

cd py
nox -s perf_bt_json -- --output /tmp/main.json
nox -s perf_bt_json -- --output /tmp/branch.json
pyperf compare_to /tmp/main.json /tmp/branch.json

Reporting expectations for performance PRs

Any PR that claims performance improvement should include:

  • benchmark command(s) used
  • environment details:
    • Python version
    • whether .[performance] was installed
  • before/after results for the affected cases
  • note on variance if results are noisy

Prefer reporting per-benchmark improvements rather than a single blended headline number.

Rollout plan

Phase 1

  • add benchmark directory structure
  • add shared benchmark fixtures
  • add bt_json microbenchmarks

Phase 2

  • add logger microbenchmarks
  • add end-to-end tracing benchmark

Phase 3

  • add dedicated nox sessions
  • document local benchmark workflow in py/README.md or benchmark README

Acceptance criteria

  • There is a documented benchmark workflow on main
  • We can benchmark bt_json hot paths in isolation
  • We can benchmark logger hot paths in isolation
  • We can run one realistic tracing end-to-end benchmark
  • We can compare JSON results across branches with pyperf compare_to
  • Normal functional CI remains separate from benchmark execution

Why this should happen before PR #101-style perf work

PR #101 combines several categories of optimizations:

  • serialization dispatch changes
  • deep-copy specialization
  • logger fast paths
  • shared helper changes

Without a benchmark harness, it is too easy to:

  • merge broad refactors based on one aggregate number
  • overestimate wins from noisy runs
  • miss regressions in shared helpers
  • lose the ability to review each incremental optimization on its own merit

This harness should make those changes measurable and allow them to land more incrementally, while staying easy to run locally before we decide whether any CI integration is worthwhile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions