Skip to content

tugot17/tokenomics

Repository files navigation

Tokenomics

Benchmarking suite for OpenAI-compatible inference servers. Measures throughput, latency, and steady-state performance.

Example benchmark

Install

uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -r requirements.txt

Completion Benchmark

Sends chat completion requests to any OpenAI-compatible server and records per-request and system-wide metrics.

Usage

# Burst mode — fires all requests at once
uv run completion_benchmark.py \
  --dataset-config examples/dataset_configs/aime_simple.json \
  --scenario "N(100,50)/(50,0)" \
  --model your-model \
  --batch-sizes 1,2,4,8

# Sustained mode — maintains constant concurrency via semaphore
uv run completion_benchmark.py \
  --dataset-config examples/dataset_configs/aime_simple.json \
  --scenario "N(100,50)/(50,0)" \
  --model your-model \
  --max-concurrency 1,2,4,8 \
  --num-prompts 128

The two modes are mutually exclusive. Burst is good for peak throughput; sustained gives realistic production numbers.

Traffic Scenarios

Pattern Example Description
D(in,out) D(100,50) Fixed token counts
N(mu,sigma)/(mu,sigma) N(100,50)/(50,0) Normal distribution
U(min,max)/(min,max) U(50,150)/(20,80) Uniform distribution
I(w,h) I(512,512) Image input

Key Options

Flag Description
--dataset-config Path to JSON dataset config (see examples/dataset_configs/)
--scenario Traffic pattern
--model Model name
--api-base Server URL (default: http://localhost:8000/v1)
--batch-sizes Burst mode sweep points
--max-concurrency Sustained mode sweep points
--num-prompts Prompts per sweep point in sustained mode
--num-runs Runs per sweep point (default: 3)
--results-file Output JSON path
--lora-strategy LoRA distribution: single, uniform, zipf, mixed, all-unique
--lora-names Comma-separated LoRA adapter names

Metrics

Per-request:

  • TTFT — time to first token (prefill latency)
  • Decode throughput — output tokens/s per request
  • TPOT — time per output token

System-wide:

  • End-to-end output throughputtotal_output_tokens / wall_time, includes ramp-up and drain
  • Steady-state output throughput — median tok/s across time buckets where the batch is >= 80% full, isolating true decode performance

Plotting

# Single benchmark
uv run plot_completion_benchmark.py results.json plot.png

# Compare multiple benchmarks
uv run plot_completion_benchmark.py comparison.png results1.json results2.json

Produces a 6-panel dashboard:

Left Right
Row 1 TTFT Decode throughput per request
Row 2 End-to-end output throughput Latency breakdown (prefill vs decode)
Row 3 Steady-state output throughput Time-series token buckets

Embedding Benchmark

Tests concurrent embedding throughput.

uv run embedding_benchmark.py \
  --model Qwen/Qwen3-Embedding-4B \
  --sequence_lengths "200" \
  --batch_sizes "1,8,16,32,64,128,256,512" \
  --num_runs 3 \
  --results_file embedding_results.json

uv run plot_embedding_benchmark.py embedding_results.json embedding_plot.png

Embedding performance

About

Estimate the throughput of OAI compatible servers

Resources

License

Stars

Watchers

Forks

Contributors