subjectId
inScope
Conversational tool-use benchmark — agent must complete user-facing tasks (airline / retail) over multi-turn dialogue while invoking domain tools and respecting policy.
outOfScope
Single-turn evaluations, code-generation tasks, and benchmarks without a tool-use harness.
outOfScopeReasonIds