subjectId
inScope
Terminal-task benchmark — agent operates a real shell to complete sysadmin / dev-tooling tasks; scored by per-task success and step-budget.
outOfScope
GUI-based agents, browser-only tasks (use WebArena), and benchmarks without a real-shell harness.
outOfScopeReasonIds