II.
TestSet overview
Reference · livetest-set:terminal-bench-v1
Terminal-Bench v1 overview
Canonical Terminal-Bench v1 set referenced in the original paper and the public leaderboard.
Attributes
displayName
Terminal-Bench v1
benchmarkId
caseCount
80
releasedAt
2024-10-01
composition
The v1 release of Terminal-Bench from Stanford NLP / Princeton.
Each task is a multi-step shell scenario evaluated end-to-end in a
Docker sandbox; success requires the agent to reach a target file
state via real shell commands.
homepageUrl
description
Canonical Terminal-Bench v1 set referenced in the original paper
and the public leaderboard.
Outgoing edges
belongs_to_benchmark1
- benchmark:terminal-bench·BenchmarkTerminal-Bench
Incoming edges
None.