| eval-run:arc-challenge.claude-sonnet-4-5.2025-09 | eval-run:arc-challenge.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:bfcl.claude-sonnet-4-5.2025-09 | eval-run:bfcl.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:gpqa-diamond.claude-opus-4-5.2025-09 | eval-run:gpqa-diamond.claude-opus-4-5.2025-09 | benchmarks |
| eval-run:gpqa.claude-sonnet-4-5.2025-09 | eval-run:gpqa.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:gsm8k.claude-sonnet-4-5.2025-09 | eval-run:gsm8k.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:harmbench.claude-opus-4-5.2025-09 | eval-run:harmbench.claude-opus-4-5.2025-09 | benchmarks |
| eval-run:hellaswag.claude-opus-4-5.2025-09 | eval-run:hellaswag.claude-opus-4-5.2025-09 | benchmarks |
| eval-run:human-eval-plus.claude-sonnet-4-5.2025-09 | eval-run:human-eval-plus.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:os-world.claude-sonnet-4-5.2025-09 | eval-run:os-world.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:swe-bench-verified.claude-opus-4-5.2025-09 | eval-run:swe-bench-verified.claude-opus-4-5.2025-09 | benchmarks |
| eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09 | eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:terminal-bench.claude-sonnet-4-5.2025-09 | eval-run:terminal-bench.claude-sonnet-4-5.2025-09 | benchmarks |
| eval-run:truthful-qa.claude-opus-4-5.2025-09 | eval-run:truthful-qa.claude-opus-4-5.2025-09 | benchmarks |