subjectId
inScope
Holistic agent benchmark spanning OS, DB, web shopping, web browsing, knowledge graph, card game, lateral-thinking, and house-holding tasks.
outOfScope
Single-turn language-only evaluation, repository-scale software-engineering tasks (use SWE-bench), and pure code-generation suites.
outOfScopeReasonIds