subjectId
inScope
Sentence-completion commonsense-reasoning benchmark with 10k multiple-choice items over everyday-activity narratives. Scored by accuracy.
outOfScope
Code-generation, mathematical reasoning, agentic tool-use, and free-form generation quality.
outOfScopeReasonIds