| test-set:bfcl-v3 | Berkeley Function Calling Leaderboard v3 | benchmarks |
| test-set:bigcode-evalplus | BigCode EvalPlus | benchmarks |
| test-set:flores-200-devtest | FLORES-200 devtest | benchmarks |
| test-set:gaia-validation | GAIA validation split | benchmarks |
| test-set:gpqa-diamond | GPQA Diamond | benchmarks |
| test-set:gpqa-diamond-2024 | GPQA Diamond — 2024 release | benchmarks |
| test-set:gsm8k-test | GSM8K test split | benchmarks |
| test-set:hellaswag-validation | HellaSwag validation | benchmarks |
| test-set:livecodebench-2024-12 | LiveCodeBench 2024-12 cut | benchmarks |
| test-set:math-test | MATH test split | benchmarks |
| test-set:swe-bench-verified-2024-12 | SWE-bench Verified 2024-12 | benchmarks |
| test-set:terminal-bench-v1 | Terminal-Bench v1 | benchmarks |
| test-set:truthful-qa-mc | TruthfulQA — multiple-choice | benchmarks |