LLM Benchmark Dashboard

Avg Latency

1825.88ms

across 225 runs

Avg Throughput

176.81t/s

tokens per second

Avg Judge Score

8.74/10

overall quality

Total Est. Cost

$0.1091

for 225 API calls

Failed Runs

of 225 total

Avg Latency (ms)

Total response time per model

Tokens / Second

Inference throughput per model

Quality vs Speed

Judge score vs latency — ideal: top-left

Score by Temperature

How temperature affects quality

TTFT — Time to First Token (ms)

Perceived responsiveness

Estimated Cost (USD / 1K prompts)

Cost efficiency comparison

Quality Heatmap — Model × Category

Average judge score (overall) per model per category

	reasoning	coding	structured_output	multilingual	safety
Llama 3.1 8B	8.27	9.67	9.67	7.07	8.67
Qwen3 32B	8.87	7.93	9.57	8.4	8.8
GPT-OSS 120B	10	10	9.6	9.07	8.13

All Results

ID ↕	Model ↕	Category ↕	Temp ↕	Latency ↕	TPS ↕	TTFT ↕	Score ↕	Response

Loading benchmark data...