| id | displayName | cluster |
|---|---|---|
| benchmark:advbench | AdvBench | benchmarks |
| benchmark:arc-challenge | ARC-Challenge | benchmarks |
| benchmark:berkeley-function-calling | Berkeley Function Calling Leaderboard (BFCL) | benchmarks |
| benchmark:bias-bench | BBQ (Bias Benchmark for QA) | benchmarks |
| benchmark:flores-200 | FLORES-200 | benchmarks |
| benchmark:gpqa | GPQA | benchmarks |
| benchmark:harmbench | HarmBench | benchmarks |
| benchmark:jailbreakbench | JailbreakBench | benchmarks |
| benchmark:m-mmlu | Multilingual MMLU (mMMLU) | benchmarks |
| benchmark:mgsm | MGSM | benchmarks |
| benchmark:olympiad-bench | OlympiadBench | benchmarks |
| benchmark:promptbench | PromptBench | benchmarks |
| benchmark:truthful-qa | TruthfulQA | benchmarks |
| benchmark:xnli | XNLI | benchmarks |