// LLM Evaluation Framework

Benchmark
Dashboard

225 runs · 3 models · 5 categories · 3 temperatures
Models 3
Prompts 25
Total Runs 225
Judge Llama 3.3 70B

Loading benchmark data...

Fetching results from outputs/all_results.json...

Model
Category
Temp
Avg Latency
1825.88ms
across 225 runs
Avg Throughput
176.81t/s
tokens per second
Avg Judge Score
8.74/10
overall quality
Total Est. Cost
$0.1091
for 225 API calls
Failed Runs
0
of 225 total
Avg Latency (ms)
Total response time per model
Tokens / Second
Inference throughput per model
Quality vs Speed
Judge score vs latency — ideal: top-left
Score by Temperature
How temperature affects quality
TTFT — Time to First Token (ms)
Perceived responsiveness
Estimated Cost (USD / 1K prompts)
Cost efficiency comparison
Quality Heatmap — Model × Category
Average judge score (overall) per model per category
reasoningcodingstructured_outputmultilingualsafety
Llama 3.1 8B8.279.679.677.078.67
Qwen3 32B8.877.939.578.48.8
GPT-OSS 120B10109.69.078.13
All Results
ID ↕ Model ↕ Category ↕ Temp ↕ Latency ↕ TPS ↕ TTFT ↕ Score ↕ Response