Agentic AI Atlas

Agentic AI Atlasby a5c.ai

GitHub Docs Discord

Dark mode

iiiNode kind

Agentic AI Atlas · Benchmark

0 recordsa5c.ai

III.

Node kind ledger

Benchmark

Page 1 of 1

Benchmark records

Browse all Benchmark records in the current atlas snapshot.

Cluster · benchmarksTotal · 65Visible · 0

description: Hand-written programming problems for evaluating code generation. x kind: knowledge x clear all

Filters & facets2 active · 4 groups

homepageUrl

https://www.swebench.com/ · 2 https://github.com/THUDM/AgentBench · 1 https://github.com/hendrycks/apps · 1 https://os-world.github.io/ · 1 https://google-research.github.io/android_world/ · 1 https://metr.org/AI_R_D_Evaluation_Report.pdf · 1 https://appworld.dev/ · 1 https://assistantbench.github.io/ · 1 https://the-agent-company.com/ · 1 https://agentclinic.github.io/ · 1 https://osu-nlp-group.github.io/TravelPlanner/ · 1 https://openai.com/index/browsecomp/ · 1

kind

model-only · 14 code-generation · 7 full-stack · 7 web-agent · 7 reasoning · 5 math · 4 tool-use · 3 domain-specific · 2 agent-leaderboard · 2 knowledge · 2 research-engineering · 1 planning · 1

description

General AI Assistants benchmark — real-world agent reasoning tasks. · 1 Hand-written programming problems for evaluating code generation. · 1 MBPP+ from EvalPlus — augmented MBPP with substantially expanded test suites. · 1 Machine learning engineering tasks drawn from Kaggle competitions. · 1 Massive Multitask Language Understanding — 57-subject knowledge benchmark. · 1 Real-world software engineering issues from open-source Python repos. · 1

targetsKind

ModelVersion · 37 AgentVersion · 28

id	displayName	cluster
No records match the current filters.