Agentic AI Atlas

Agentic AI Atlasby a5c.ai

GitHub Docs Discord

Dark mode

iiiNode kind

Agentic AI Atlas · Benchmark

14 recordsa5c.ai

III.

Node kind ledger

Benchmark

Page 1 of 1

Benchmark records

Browse all Benchmark records in the current atlas snapshot.

Cluster · benchmarksTotal · 65Visible · 14

kind: model-only x clear all

Filters & facets1 active · 4 groups

homepageUrl

https://www.swebench.com/ · 2 https://github.com/THUDM/AgentBench · 1 https://github.com/hendrycks/apps · 1 https://os-world.github.io/ · 1 https://google-research.github.io/android_world/ · 1 https://metr.org/AI_R_D_Evaluation_Report.pdf · 1 https://appworld.dev/ · 1 https://assistantbench.github.io/ · 1 https://the-agent-company.com/ · 1 https://agentclinic.github.io/ · 1 https://osu-nlp-group.github.io/TravelPlanner/ · 1 https://openai.com/index/browsecomp/ · 1

kind

model-only · 14 code-generation · 7 full-stack · 7 web-agent · 7 reasoning · 5 math · 4 tool-use · 3 domain-specific · 2 agent-leaderboard · 2 knowledge · 2 research-engineering · 1 planning · 1

description

General AI Assistants benchmark — real-world agent reasoning tasks. · 1 Hand-written programming problems for evaluating code generation. · 1 MBPP+ from EvalPlus — augmented MBPP with substantially expanded test suites. · 1 Machine learning engineering tasks drawn from Kaggle competitions. · 1 Massive Multitask Language Understanding — 57-subject knowledge benchmark. · 1 Real-world software engineering issues from open-source Python repos. · 1

targetsKind

ModelVersion · 37 AgentVersion · 28

id	displayName	cluster
benchmark:advbench	AdvBench	benchmarks
benchmark:arc-challenge	ARC-Challenge	benchmarks
benchmark:berkeley-function-calling	Berkeley Function Calling Leaderboard (BFCL)	benchmarks
benchmark:bias-bench	BBQ (Bias Benchmark for QA)	benchmarks
benchmark:flores-200	FLORES-200	benchmarks
benchmark:gpqa	GPQA	benchmarks
benchmark:harmbench	HarmBench	benchmarks
benchmark:jailbreakbench	JailbreakBench	benchmarks
benchmark:m-mmlu	Multilingual MMLU (mMMLU)	benchmarks
benchmark:mgsm	MGSM	benchmarks
benchmark:olympiad-bench	OlympiadBench	benchmarks
benchmark:promptbench	PromptBench	benchmarks
benchmark:truthful-qa	TruthfulQA	benchmarks
benchmark:xnli	XNLI	benchmarks