| id | displayName | cluster |
|---|---|---|
| benchmark:advbench | AdvBench | benchmarks |
| benchmark:apps | APPS | benchmarks |
| benchmark:arc-challenge | ARC-Challenge | benchmarks |
| benchmark:bbh | BIG-Bench Hard (BBH) | benchmarks |
| benchmark:berkeley-function-calling | Berkeley Function Calling Leaderboard (BFCL) | benchmarks |
| benchmark:bias-bench | BBQ (Bias Benchmark for QA) | benchmarks |
| benchmark:bigcode-evalplus | EvalPlus | benchmarks |
| benchmark:bigcodebench | BigCodeBench | benchmarks |
| benchmark:ds1000 | DS-1000 | benchmarks |
| benchmark:fin-bench | FinBench | benchmarks |
| benchmark:flores-200 | FLORES-200 | benchmarks |
| benchmark:frontier-math | FrontierMath | benchmarks |
| benchmark:gpqa | GPQA | benchmarks |
| benchmark:gsm-symbolic | GSM-Symbolic | benchmarks |
| benchmark:gsm8k | GSM8K | benchmarks |
| benchmark:harmbench | HarmBench | benchmarks |
| benchmark:hellaswag | HellaSwag | benchmarks |
| benchmark:hle | Humanity's Last Exam (HLE) | benchmarks |
| benchmark:human-eval | HumanEval | benchmarks |
| benchmark:jailbreakbench | JailbreakBench | benchmarks |
| benchmark:legal-bench | LegalBench | benchmarks |
| benchmark:livecodebench | LiveCodeBench | benchmarks |
| benchmark:lmsys-arena | Chatbot Arena (LMSYS) | benchmarks |
| benchmark:m-mmlu | Multilingual MMLU (mMMLU) | benchmarks |
| benchmark:math | MATH | benchmarks |
| benchmark:mbpp | MBPP | benchmarks |
| benchmark:mbpp-plus | MBPP+ | benchmarks |
| benchmark:medqa | MedQA | benchmarks |
| benchmark:mgsm | MGSM | benchmarks |
| benchmark:mmlu | MMLU | benchmarks |
| benchmark:mt-bench | MT-Bench | benchmarks |
| benchmark:multipl-e | MultiPL-E | benchmarks |
| benchmark:olympiad-bench | OlympiadBench | benchmarks |
| benchmark:promptbench | PromptBench | benchmarks |
| benchmark:repobench | RepoBench | benchmarks |
| benchmark:truthful-qa | TruthfulQA | benchmarks |
| benchmark:xnli | XNLI | benchmarks |