-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Add a repeatable performance benchmark harness for tracing hot paths
Summary
We should add a durable benchmarking setup for the Python SDK before landing tracing performance changes like the ones proposed in PR #101.
Today we have strong functional coverage, but no formal performance harness on main. That makes it hard to:
- validate that a performance-oriented refactor helps the intended hot path
- separate large wins from noise
- compare branch results against
main - catch accidental regressions in shared helpers like
merge_dicts() - evaluate behavior with and without optional perf dependencies like
orjson
This issue proposes a benchmarking design that fits the repo's current nox and packaging setup.
Goals
- Add repeatable microbenchmarks for tracing hot-path functions
- Add one realistic end-to-end tracing benchmark
- Make it easy to compare
mainvs a feature branch locally - Cover both minimal installs and the
performanceextra - Keep performance checks separate from normal functional test runs
Non-goals
- Do not block CI on strict timing thresholds initially
- Do not turn benchmark timings into flaky
pytestassertions - Do not try to benchmark every SDK area at once
Proposed tooling
Use pyperf as the primary benchmark runner.
Rationale:
- much better statistical discipline than ad hoc
time.perf_counter()loops - supports warmups, calibration, repetition, and JSON output
- gives us a clean
compare_toworkflow for branch vs branch - better fit for stable microbenchmarks than plain
pytesttiming
Keep a separate scenario-style benchmark for end-to-end tracing flows. That can still be implemented in Python, but it should be structured as a benchmark module rather than a one-off exploratory script.
Proposed layout
Under py/:
benchmarks/
README.md
conftest.py
cases/
bench_bt_json.py
bench_logger.py
bench_tracing_e2e.py
fixtures.py
Notes:
bench_bt_json.pyshould focus on serialization and deep-copy hot pathsbench_logger.pyshould focus on span creation, split/sanitize paths, and internal-only loggingbench_tracing_e2e.pyshould model a realistic root-span + child-span + logging workloadfixtures.pyshould centralize representative payload builders so cases stay consistent
What to benchmark
1. bt_json microbenchmarks
Cover:
_to_bt_safe()on primitives_to_bt_safe()onstr/intsubclasses and enum-backed strings_to_bt_safe()on dataclasses_to_bt_safe()on pydantic-like objectsbt_safe_deep_copy()on small, medium, and large nested dict/list payloadsbt_safe_deep_copy()on payloads with circular referencesbt_safe_deep_copy()on payloads with non-string dict keys
Why:
- these are the core hot paths in the PR under discussion
- they are easy to measure deterministically in isolation
2. logger microbenchmarks
Cover:
_validate_and_sanitize_experiment_log_partial_args()on empty vs populated events_strip_nones()on shallow and nested dicts, with and withoutNonesplit_logging_data()for:eventonlyinternal_dataonly- both present
BraintrustStreampresent
SpanImplcreation with explicitnameSpanImplcreation withoutnamelog_internal()with user event payloadslog_internal()with internal-only payloads likeend()/set_attributes()
Why:
- this isolates the logger-local optimizations from broader utility changes
- it lets us quantify small changes like lazy caller-location lookup separately from bigger serialization wins
3. End-to-end tracing benchmark
Cover one realistic scenario:
- create root span
- add representative
input/metadata - create child span
- log output / metadata / metrics
- end child and root spans
- exercise background logger record construction without external I/O
Parameters:
- run enough iterations to reduce noise
- keep payload shapes fixed and version-controlled
- report separate scenarios for:
- medium payload
- large payload
- internal-only updates
Why:
- microbenchmarks show where a win comes from
- the e2e benchmark shows whether the win is real in the path users care about
Benchmark environments
Run each benchmark suite in at least two environments:
A. Minimal environment
- install the package plus benchmark deps
- do not install optional perf extras
Purpose:
- captures baseline behavior for default installs
B. Performance environment
- install
.[performance]
Purpose:
- measures the effect of optional
orjson - ensures future performance work does not accidentally optimize only one environment
If needed later, add a second Python minor version to confirm interpreter-sensitive behavior, but that should not be required for the initial rollout.
Nox integration
Add dedicated nox sessions instead of folding this into test_core.
Suggested sessions:
perf_bt_jsonperf_loggerperf_e2e- optional umbrella session:
perf
Behavior:
- install
pyperf - install the package from source
- optionally install
.[performance]in dedicated sessions or via a flag - run benchmark modules directly
- emit JSON results to a temp or ignored output path
Example local workflows:
cd py
nox -s perf_bt_json
nox -s perf_logger
nox -s perf_e2eComparison workflow:
cd py
nox -s perf_bt_json -- --output /tmp/main-bt-json.json
nox -s perf_bt_json -- --output /tmp/branch-bt-json.json
pyperf compare_to /tmp/main-bt-json.json /tmp/branch-bt-json.jsonWe can refine exact command shape during implementation, but the key point is to make branch-vs-branch comparison first-class.
Local-first workflow
The initial implementation should optimize for local developer use only.
That means:
- no CI integration yet
- no nightly jobs
- no hard performance thresholds
Instead, the benchmark harness should make it easy to:
- establish a baseline on
main - run the same benchmark on a feature branch
- compare
pyperfJSON outputs locally
The command surface should stay very small and obvious:
cd py
nox -s perf_bt_json
nox -s perf_logger
nox -s perf_e2eAnd branch comparison should be a first-class local workflow:
cd py
nox -s perf_bt_json -- --output /tmp/main.json
nox -s perf_bt_json -- --output /tmp/branch.json
pyperf compare_to /tmp/main.json /tmp/branch.jsonReporting expectations for performance PRs
Any PR that claims performance improvement should include:
- benchmark command(s) used
- environment details:
- Python version
- whether
.[performance]was installed
- before/after results for the affected cases
- note on variance if results are noisy
Prefer reporting per-benchmark improvements rather than a single blended headline number.
Rollout plan
Phase 1
- add benchmark directory structure
- add shared benchmark fixtures
- add
bt_jsonmicrobenchmarks
Phase 2
- add
loggermicrobenchmarks - add end-to-end tracing benchmark
Phase 3
- add dedicated nox sessions
- document local benchmark workflow in
py/README.mdor benchmark README
Acceptance criteria
- There is a documented benchmark workflow on
main - We can benchmark
bt_jsonhot paths in isolation - We can benchmark logger hot paths in isolation
- We can run one realistic tracing end-to-end benchmark
- We can compare JSON results across branches with
pyperf compare_to - Normal functional CI remains separate from benchmark execution
Why this should happen before PR #101-style perf work
PR #101 combines several categories of optimizations:
- serialization dispatch changes
- deep-copy specialization
- logger fast paths
- shared helper changes
Without a benchmark harness, it is too easy to:
- merge broad refactors based on one aggregate number
- overestimate wins from noisy runs
- miss regressions in shared helpers
- lose the ability to review each incremental optimization on its own merit
This harness should make those changes measurable and allow them to land more incrementally, while staying easy to run locally before we decide whether any CI integration is worthwhile.