Tokenomics

Benchmarking suite for OpenAI-compatible inference servers. Measures throughput, latency, and steady-state performance.

Install

uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -r requirements.txt

Completion Benchmark

Sends chat completion requests to any OpenAI-compatible server and records per-request and system-wide metrics.

Usage

# Burst mode — fires all requests at once
uv run completion_benchmark.py \
  --dataset-config examples/dataset_configs/aime_simple.json \
  --scenario "N(100,50)/(50,0)" \
  --model your-model \
  --batch-sizes 1,2,4,8

# Sustained mode — maintains constant concurrency via semaphore
uv run completion_benchmark.py \
  --dataset-config examples/dataset_configs/aime_simple.json \
  --scenario "N(100,50)/(50,0)" \
  --model your-model \
  --max-concurrency 1,2,4,8 \
  --num-prompts 128

The two modes are mutually exclusive. Burst is good for peak throughput; sustained gives realistic production numbers.

Traffic Scenarios

Pattern	Example	Description
`D(in,out)`	`D(100,50)`	Fixed token counts
`N(mu,sigma)/(mu,sigma)`	`N(100,50)/(50,0)`	Normal distribution
`U(min,max)/(min,max)`	`U(50,150)/(20,80)`	Uniform distribution
`I(w,h)`	`I(512,512)`	Image input

Key Options

Flag	Description
`--dataset-config`	Path to JSON dataset config (see `examples/dataset_configs/`)
`--scenario`	Traffic pattern
`--model`	Model name
`--api-base`	Server URL (default: `http://localhost:8000/v1`)
`--batch-sizes`	Burst mode sweep points
`--max-concurrency`	Sustained mode sweep points
`--num-prompts`	Prompts per sweep point in sustained mode
`--num-runs`	Runs per sweep point (default: 3)
`--results-file`	Output JSON path
`--lora-strategy`	LoRA distribution: single, uniform, zipf, mixed, all-unique
`--lora-names`	Comma-separated LoRA adapter names

Metrics

Per-request:

TTFT — time to first token (prefill latency)
Decode throughput — output tokens/s per request
TPOT — time per output token

System-wide:

End-to-end output throughput — total_output_tokens / wall_time, includes ramp-up and drain
Steady-state output throughput — median tok/s across time buckets where the batch is >= 80% full, isolating true decode performance

Plotting

# Single benchmark
uv run plot_completion_benchmark.py results.json plot.png

# Compare multiple benchmarks
uv run plot_completion_benchmark.py comparison.png results1.json results2.json

Produces a 6-panel dashboard:

	Left	Right
Row 1	TTFT	Decode throughput per request
Row 2	End-to-end output throughput	Latency breakdown (prefill vs decode)
Row 3	Steady-state output throughput	Time-series token buckets

Embedding Benchmark

Tests concurrent embedding throughput.

uv run embedding_benchmark.py \
  --model Qwen/Qwen3-Embedding-4B \
  --sequence_lengths "200" \
  --batch_sizes "1,8,16,32,64,128,256,512" \
  --num_runs 3 \
  --results_file embedding_results.json

uv run plot_embedding_benchmark.py embedding_results.json embedding_plot.png

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
examples		examples
sampling		sampling
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
completion_benchmark.py		completion_benchmark.py
embedding_benchmark.py		embedding_benchmark.py
plot_completion_benchmark.py		plot_completion_benchmark.py
plot_embedding_benchmark.py		plot_embedding_benchmark.py
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
run_embedding_benchmark.sh		run_embedding_benchmark.sh
run_lora_benchmark.sh		run_lora_benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenomics

Install

Completion Benchmark

Usage

Traffic Scenarios

Key Options

Metrics

Plotting

Embedding Benchmark

About

Uh oh!

Releases 3

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tokenomics

Install

Completion Benchmark

Usage

Traffic Scenarios

Key Options

Metrics

Plotting

Embedding Benchmark

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages