III.
Node kind ledger
Page 1 of 1Benchmark
Benchmark records
Browse all Benchmark records in the current atlas snapshot.
description: Massive Multitask Language Understanding — 57-subject knowledge benchmark.
xhomepageUrl: https://github.com/THUDM/AgentBench xkind: research-engineering xclear all
Filters & facets3 active · 4 groups
homepageUrl
https://www.swebench.com/ · 2https://github.com/THUDM/AgentBench · 1https://github.com/hendrycks/apps · 1https://os-world.github.io/ · 1https://google-research.github.io/android_world/ · 1https://metr.org/AI_R_D_Evaluation_Report.pdf · 1https://appworld.dev/ · 1https://assistantbench.github.io/ · 1https://the-agent-company.com/ · 1https://agentclinic.github.io/ · 1https://osu-nlp-group.github.io/TravelPlanner/ · 1https://openai.com/index/browsecomp/ · 1
kind
description
General AI Assistants benchmark — real-world agent reasoning tasks.
· 1Hand-written programming problems for evaluating code generation.
· 1MBPP+ from EvalPlus — augmented MBPP with substantially expanded test suites.
· 1Machine learning engineering tasks drawn from Kaggle competitions.
· 1Massive Multitask Language Understanding — 57-subject knowledge benchmark.
· 1Real-world software engineering issues from open-source Python repos.
· 1
targetsKind
| id | displayName | cluster |
|---|---|---|
| No records match the current filters. | ||