Add a repeatable performance benchmark harness for tracing hot paths

# Add a repeatable performance benchmark harness for tracing hot paths

## Summary

We should add a durable benchmarking setup for the Python SDK before landing tracing performance changes like the ones proposed in PR #101.

Today we have strong functional coverage, but no formal performance harness on `main`. That makes it hard to:

- validate that a performance-oriented refactor helps the intended hot path
- separate large wins from noise
- compare branch results against `main`
- catch accidental regressions in shared helpers like `merge_dicts()`
- evaluate behavior with and without optional perf dependencies like `orjson`

This issue proposes a benchmarking design that fits the repo's current `nox` and packaging setup.

## Goals

- Add repeatable microbenchmarks for tracing hot-path functions
- Add one realistic end-to-end tracing benchmark
- Make it easy to compare `main` vs a feature branch locally
- Cover both minimal installs and the `performance` extra
- Keep performance checks separate from normal functional test runs

## Non-goals

- Do not block CI on strict timing thresholds initially
- Do not turn benchmark timings into flaky `pytest` assertions
- Do not try to benchmark every SDK area at once

## Proposed tooling

Use `pyperf` as the primary benchmark runner.

Rationale:

- much better statistical discipline than ad hoc `time.perf_counter()` loops
- supports warmups, calibration, repetition, and JSON output
- gives us a clean `compare_to` workflow for branch vs branch
- better fit for stable microbenchmarks than plain `pytest` timing

Keep a separate scenario-style benchmark for end-to-end tracing flows. That can still be implemented in Python, but it should be structured as a benchmark module rather than a one-off exploratory script.

## Proposed layout

Under `py/`:

```text
benchmarks/
  README.md
  conftest.py
  cases/
    bench_bt_json.py
    bench_logger.py
    bench_tracing_e2e.py
  fixtures.py
```

Notes:

- `bench_bt_json.py` should focus on serialization and deep-copy hot paths
- `bench_logger.py` should focus on span creation, split/sanitize paths, and internal-only logging
- `bench_tracing_e2e.py` should model a realistic root-span + child-span + logging workload
- `fixtures.py` should centralize representative payload builders so cases stay consistent

## What to benchmark

### 1. `bt_json` microbenchmarks

Cover:

- `_to_bt_safe()` on primitives
- `_to_bt_safe()` on `str`/`int` subclasses and enum-backed strings
- `_to_bt_safe()` on dataclasses
- `_to_bt_safe()` on pydantic-like objects
- `bt_safe_deep_copy()` on small, medium, and large nested dict/list payloads
- `bt_safe_deep_copy()` on payloads with circular references
- `bt_safe_deep_copy()` on payloads with non-string dict keys

Why:

- these are the core hot paths in the PR under discussion
- they are easy to measure deterministically in isolation

### 2. `logger` microbenchmarks

Cover:

- `_validate_and_sanitize_experiment_log_partial_args()` on empty vs populated events
- `_strip_nones()` on shallow and nested dicts, with and without `None`
- `split_logging_data()` for:
  - `event` only
  - `internal_data` only
  - both present
  - `BraintrustStream` present
- `SpanImpl` creation with explicit `name`
- `SpanImpl` creation without `name`
- `log_internal()` with user event payloads
- `log_internal()` with internal-only payloads like `end()` / `set_attributes()`

Why:

- this isolates the logger-local optimizations from broader utility changes
- it lets us quantify small changes like lazy caller-location lookup separately from bigger serialization wins

### 3. End-to-end tracing benchmark

Cover one realistic scenario:

- create root span
- add representative `input` / `metadata`
- create child span
- log output / metadata / metrics
- end child and root spans
- exercise background logger record construction without external I/O

Parameters:

- run enough iterations to reduce noise
- keep payload shapes fixed and version-controlled
- report separate scenarios for:
  - medium payload
  - large payload
  - internal-only updates

Why:

- microbenchmarks show where a win comes from
- the e2e benchmark shows whether the win is real in the path users care about

## Benchmark environments

Run each benchmark suite in at least two environments:

### A. Minimal environment

- install the package plus benchmark deps
- do not install optional perf extras

Purpose:

- captures baseline behavior for default installs

### B. Performance environment

- install `.[performance]`

Purpose:

- measures the effect of optional `orjson`
- ensures future performance work does not accidentally optimize only one environment

If needed later, add a second Python minor version to confirm interpreter-sensitive behavior, but that should not be required for the initial rollout.

## Nox integration

Add dedicated nox sessions instead of folding this into `test_core`.

Suggested sessions:

- `perf_bt_json`
- `perf_logger`
- `perf_e2e`
- optional umbrella session: `perf`

Behavior:

- install `pyperf`
- install the package from source
- optionally install `.[performance]` in dedicated sessions or via a flag
- run benchmark modules directly
- emit JSON results to a temp or ignored output path

Example local workflows:

```bash
cd py
nox -s perf_bt_json
nox -s perf_logger
nox -s perf_e2e
```

Comparison workflow:

```bash
cd py
nox -s perf_bt_json -- --output /tmp/main-bt-json.json
nox -s perf_bt_json -- --output /tmp/branch-bt-json.json
pyperf compare_to /tmp/main-bt-json.json /tmp/branch-bt-json.json
```

We can refine exact command shape during implementation, but the key point is to make branch-vs-branch comparison first-class.

## Local-first workflow

The initial implementation should optimize for local developer use only.

That means:

- no CI integration yet
- no nightly jobs
- no hard performance thresholds

Instead, the benchmark harness should make it easy to:

- establish a baseline on `main`
- run the same benchmark on a feature branch
- compare `pyperf` JSON outputs locally

The command surface should stay very small and obvious:

```bash
cd py
nox -s perf_bt_json
nox -s perf_logger
nox -s perf_e2e
```

And branch comparison should be a first-class local workflow:

```bash
cd py
nox -s perf_bt_json -- --output /tmp/main.json
nox -s perf_bt_json -- --output /tmp/branch.json
pyperf compare_to /tmp/main.json /tmp/branch.json
```

## Reporting expectations for performance PRs

Any PR that claims performance improvement should include:

- benchmark command(s) used
- environment details:
  - Python version
  - whether `.[performance]` was installed
- before/after results for the affected cases
- note on variance if results are noisy

Prefer reporting per-benchmark improvements rather than a single blended headline number.

## Rollout plan

### Phase 1

- add benchmark directory structure
- add shared benchmark fixtures
- add `bt_json` microbenchmarks

### Phase 2

- add `logger` microbenchmarks
- add end-to-end tracing benchmark

### Phase 3

- add dedicated nox sessions
- document local benchmark workflow in `py/README.md` or benchmark README

## Acceptance criteria

- There is a documented benchmark workflow on `main`
- We can benchmark `bt_json` hot paths in isolation
- We can benchmark logger hot paths in isolation
- We can run one realistic tracing end-to-end benchmark
- We can compare JSON results across branches with `pyperf compare_to`
- Normal functional CI remains separate from benchmark execution

## Why this should happen before PR #101-style perf work

PR #101 combines several categories of optimizations:

- serialization dispatch changes
- deep-copy specialization
- logger fast paths
- shared helper changes

Without a benchmark harness, it is too easy to:

- merge broad refactors based on one aggregate number
- overestimate wins from noisy runs
- miss regressions in shared helpers
- lose the ability to review each incremental optimization on its own merit

This harness should make those changes measurable and allow them to land more incrementally, while staying easy to run locally before we decide whether any CI integration is worthwhile.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a repeatable performance benchmark harness for tracing hot paths #134

Add a repeatable performance benchmark harness for tracing hot paths

Summary

Goals

Non-goals

Proposed tooling

Proposed layout

What to benchmark

1. `bt_json` microbenchmarks

2. `logger` microbenchmarks

3. End-to-end tracing benchmark

Benchmark environments

A. Minimal environment

B. Performance environment

Nox integration

Local-first workflow

Reporting expectations for performance PRs

Rollout plan

Phase 1

Phase 2

Phase 3

Acceptance criteria

Why this should happen before PR #101-style perf work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a repeatable performance benchmark harness for tracing hot paths #134

Description

Add a repeatable performance benchmark harness for tracing hot paths

Summary

Goals

Non-goals

Proposed tooling

Proposed layout

What to benchmark

1. bt_json microbenchmarks

2. logger microbenchmarks

3. End-to-end tracing benchmark

Benchmark environments

A. Minimal environment

B. Performance environment

Nox integration

Local-first workflow

Reporting expectations for performance PRs

Rollout plan

Phase 1

Phase 2

Phase 3

Acceptance criteria

Why this should happen before PR #101-style perf work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `bt_json` microbenchmarks

2. `logger` microbenchmarks