subjectId
inScope
Multi-domain agent leaderboard (web, embodied, tool-use, game) with per-task success metrics across heterogeneous environments.
outOfScope
Pure language-only benchmarks (MMLU, HellaSwag), code-only suites (HumanEval, MBPP), and benchmarks without an interactive environment harness.
outOfScopeReasonIds