Avg Latency (ms)
Total response time per model
Tokens / Second
Inference throughput per model
Quality vs Speed
Judge score vs latency — ideal: top-left
Score by Temperature
How temperature affects quality
TTFT — Time to First Token (ms)
Perceived responsiveness
Estimated Cost (USD / 1K prompts)
Cost efficiency comparison
Quality Heatmap — Model × Category
Average judge score (overall) per model per category
| reasoning | coding | structured_output | multilingual | safety |
|---|
| Llama 3.1 8B | 8.27 | 9.67 | 9.67 | 7.07 | 8.67 |
| Qwen3 32B | 8.87 | 7.93 | 9.57 | 8.4 | 8.8 |
| GPT-OSS 120B | 10 | 10 | 9.6 | 9.07 | 8.13 |