Benchmarking suite for OpenAI-compatible inference servers. Measures throughput, latency, and steady-state performance.
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -r requirements.txtSends chat completion requests to any OpenAI-compatible server and records per-request and system-wide metrics.
# Burst mode — fires all requests at once
uv run completion_benchmark.py \
--dataset-config examples/dataset_configs/aime_simple.json \
--scenario "N(100,50)/(50,0)" \
--model your-model \
--batch-sizes 1,2,4,8
# Sustained mode — maintains constant concurrency via semaphore
uv run completion_benchmark.py \
--dataset-config examples/dataset_configs/aime_simple.json \
--scenario "N(100,50)/(50,0)" \
--model your-model \
--max-concurrency 1,2,4,8 \
--num-prompts 128The two modes are mutually exclusive. Burst is good for peak throughput; sustained gives realistic production numbers.
| Pattern | Example | Description |
|---|---|---|
D(in,out) |
D(100,50) |
Fixed token counts |
N(mu,sigma)/(mu,sigma) |
N(100,50)/(50,0) |
Normal distribution |
U(min,max)/(min,max) |
U(50,150)/(20,80) |
Uniform distribution |
I(w,h) |
I(512,512) |
Image input |
| Flag | Description |
|---|---|
--dataset-config |
Path to JSON dataset config (see examples/dataset_configs/) |
--scenario |
Traffic pattern |
--model |
Model name |
--api-base |
Server URL (default: http://localhost:8000/v1) |
--batch-sizes |
Burst mode sweep points |
--max-concurrency |
Sustained mode sweep points |
--num-prompts |
Prompts per sweep point in sustained mode |
--num-runs |
Runs per sweep point (default: 3) |
--results-file |
Output JSON path |
--lora-strategy |
LoRA distribution: single, uniform, zipf, mixed, all-unique |
--lora-names |
Comma-separated LoRA adapter names |
Per-request:
- TTFT — time to first token (prefill latency)
- Decode throughput — output tokens/s per request
- TPOT — time per output token
System-wide:
- End-to-end output throughput —
total_output_tokens / wall_time, includes ramp-up and drain - Steady-state output throughput — median tok/s across time buckets where the batch is >= 80% full, isolating true decode performance
# Single benchmark
uv run plot_completion_benchmark.py results.json plot.png
# Compare multiple benchmarks
uv run plot_completion_benchmark.py comparison.png results1.json results2.jsonProduces a 6-panel dashboard:
| Left | Right | |
|---|---|---|
| Row 1 | TTFT | Decode throughput per request |
| Row 2 | End-to-end output throughput | Latency breakdown (prefill vs decode) |
| Row 3 | Steady-state output throughput | Time-series token buckets |
Tests concurrent embedding throughput.
uv run embedding_benchmark.py \
--model Qwen/Qwen3-Embedding-4B \
--sequence_lengths "200" \
--batch_sizes "1,8,16,32,64,128,256,512" \
--num_runs 3 \
--results_file embedding_results.json
uv run plot_embedding_benchmark.py embedding_results.json embedding_plot.png
