Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
iiiNode kind
Agentic AI Atlas · EvalRun
70 recordsa5c.ai
Search kind facets/
Atlas · node kind

Current kind and facets

III.EvalRunpp. 1 - 1
configHash: sha256:placeholder-gpt-5-evalplusconfigHash: sha256:placeholder-qwen-2-5-72b-mmlutarget: model:gpt-5@currenttarget: model:claude-sonnet-4-5@currenttargetId: model:gpt-5@currenttargetId: model:claude-sonnet-4-5@currentrunAt: 2025-09-29T00:00:00ZrunAt: 2025-08-07T00:00:00Z
III.
Node kind ledger

EvalRun

Page 1 of 2

EvalRun records

Browse all EvalRun records in the current atlas snapshot.

Cluster · benchmarksTotal · 70Visible · 70
Filters & facets7 groups

configHash

sha256:placeholder-gpt-5-evalplus · 2sha256:placeholder-qwen-2-5-72b-mmlu · 1sha256:placeholder-qwen-2-5-72b-humaneval · 1sha256:placeholder-qwen-2-5-coder-32b-humaneval · 1sha256:placeholder-qwen-2-5-coder-32b-lcb · 1sha256:placeholder-qwen-2-5-coder-32b-mbpp · 1sha256:placeholder-claude-haiku-4-5-swe-bench-verified · 1sha256:placeholder-claude-haiku-4-5-gpqa · 1sha256:placeholder-claude-sonnet-4-6-human-eval · 1sha256:placeholder-claude-sonnet-4-6-mmlu · 1sha256:placeholder-claude-sonnet-4-5-bfcl-v3 · 1sha256:placeholder-claude-opus-4-5-gpqa-diamond · 1

target

model:gpt-5@current · 9model:claude-sonnet-4-5@current · 8model:gemini-2-5-pro@current · 6model:claude-opus-4-5@current · 5model:qwen-2-5-coder-32b@current · 3model:deepseek-v3@current · 3model:deepseek-r1@current · 3model:llama-4-405b-instruct@current · 3model:llama-3-1-405b-instruct@current · 3model:qwen-2-5-72b-instruct@current · 2model:claude-haiku-4-5@current · 2model:claude-sonnet-4-6@current · 2

targetId

model:gpt-5@current · 9model:claude-sonnet-4-5@current · 8model:gemini-2-5-pro@current · 6model:claude-opus-4-5@current · 5model:qwen-2-5-coder-32b@current · 3model:deepseek-v3@current · 3model:deepseek-r1@current · 3model:llama-4-405b-instruct@current · 3model:llama-3-1-405b-instruct@current · 3model:qwen-2-5-72b-instruct@current · 2model:claude-haiku-4-5@current · 2model:claude-sonnet-4-6@current · 2

runAt

2025-09-29T00:00:00Z · 132025-08-07T00:00:00Z · 92025-06-17T00:00:00Z · 72024-07-23T00:00:00Z · 62024-11-12T00:00:00Z · 32024-12-26T00:00:00Z · 32025-01-20T00:00:00Z · 32024-09-19T00:00:00Z · 22025-10-15T00:00:00Z · 22025-11-15T00:00:00Z · 22024-12-06T00:00:00Z · 22024-07-24T00:00:00Z · 2

benchmarkId

benchmark:mmlu · 12benchmark:swe-bench-verified · 12benchmark:gpqa · 12benchmark:human-eval · 10benchmark:livecodebench · 3benchmark:bigcode-evalplus · 3benchmark:math · 3benchmark:berkeley-function-calling · 2benchmark:gsm8k · 2benchmark:mbpp · 1benchmark:os-world · 1benchmark:truthful-qa · 1

runBy

anthropic · 16openai · 11google-deepmind · 9meta · 7deepseek · 6qwen-team · 5mistral · 4evalplus-leaderboard · 3berkeley-gorilla · 2google · 2@a5c-ai/team · 2artificial-analysis · 1

testSetId

test-set:swe-bench-verified-2024-12 · 26test-set:gpqa-diamond-2024 · 12test-set:bfcl-v3 · 2test-set:truthful-qa-mc · 1test-set:gaia-validation · 1
id-ascid-descname-ascname-desc
iddisplayNamecluster
eval-run:android-world.gemini-2-5-pro.2025-06eval-run:android-world.gemini-2-5-pro.2025-06benchmarks
eval-run:arc-challenge.claude-sonnet-4-5.2025-09eval-run:arc-challenge.claude-sonnet-4-5.2025-09benchmarks
eval-run:bfcl.claude-sonnet-4-5.2025-09eval-run:bfcl.claude-sonnet-4-5.2025-09benchmarks
eval-run:bfcl.gpt-5.2025-08eval-run:bfcl.gpt-5.2025-08benchmarks
eval-run:evalplus.gpt-5.2025-08eval-run:evalplus.gpt-5.2025-08benchmarks
eval-run:gaia.claude-code.2025eval-run:gaia.claude-code.2025benchmarks
eval-run:gpqa-diamond.claude-opus-4-5.2025-09eval-run:gpqa-diamond.claude-opus-4-5.2025-09benchmarks
eval-run:gpqa-diamond.gemini-2-5-pro.2025-06eval-run:gpqa-diamond.gemini-2-5-pro.2025-06benchmarks
eval-run:gpqa-diamond.gemini-3-1-pro.2026-02-19eval-run:gpqa-diamond.gemini-3-1-pro.2026-02-19benchmarks
eval-run:gpqa-diamond.gemini-3-pro.2025-11-18eval-run:gpqa-diamond.gemini-3-pro.2025-11-18benchmarks
eval-run:gpqa-diamond.gpt-5-4-mini.2026-03-17eval-run:gpqa-diamond.gpt-5-4-mini.2026-03-17benchmarks
eval-run:gpqa-diamond.gpt-5-4.2026-03-17eval-run:gpqa-diamond.gpt-5-4.2026-03-17benchmarks
eval-run:gpqa-diamond.gpt-5.2025-08eval-run:gpqa-diamond.gpt-5.2025-08benchmarks
eval-run:gpqa.claude-haiku-4-5.2025-10eval-run:gpqa.claude-haiku-4-5.2025-10benchmarks
eval-run:gpqa.claude-sonnet-4-5.2025-09eval-run:gpqa.claude-sonnet-4-5.2025-09benchmarks
eval-run:gpqa.deepseek-r1.2025-01eval-run:gpqa.deepseek-r1.2025-01benchmarks
eval-run:gpqa.gemini-2-5-pro.2025-06eval-run:gpqa.gemini-2-5-pro.2025-06benchmarks
eval-run:gpqa.gpt-5.2025-08eval-run:gpqa.gpt-5.2025-08benchmarks
eval-run:gsm8k.claude-sonnet-4-5.2025-09eval-run:gsm8k.claude-sonnet-4-5.2025-09benchmarks
eval-run:gsm8k.gemma-2-27b.2024-06eval-run:gsm8k.gemma-2-27b.2024-06benchmarks
eval-run:harmbench.claude-opus-4-5.2025-09eval-run:harmbench.claude-opus-4-5.2025-09benchmarks
eval-run:hellaswag.claude-opus-4-5.2025-09eval-run:hellaswag.claude-opus-4-5.2025-09benchmarks
eval-run:human-eval-plus.claude-sonnet-4-5.2025-09eval-run:human-eval-plus.claude-sonnet-4-5.2025-09benchmarks
eval-run:human-eval-plus.gpt-5.2025-08eval-run:human-eval-plus.gpt-5.2025-08benchmarks
eval-run:human-eval.claude-sonnet-4-6.2025-11eval-run:human-eval.claude-sonnet-4-6.2025-11benchmarks
eval-run:human-eval.codestral-25-01.2025-01eval-run:human-eval.codestral-25-01.2025-01benchmarks
eval-run:human-eval.deepseek-v3.2024-12eval-run:human-eval.deepseek-v3.2024-12benchmarks
eval-run:human-eval.gpt-5.2025-08eval-run:human-eval.gpt-5.2025-08benchmarks
eval-run:human-eval.llama-3-1-405b.2024-07eval-run:human-eval.llama-3-1-405b.2024-07benchmarks
eval-run:human-eval.llama-3-3-70b.2024-12eval-run:human-eval.llama-3-3-70b.2024-12benchmarks
eval-run:human-eval.llama-4-405b.2024-07eval-run:human-eval.llama-4-405b.2024-07benchmarks
eval-run:human-eval.mistral-large-2.2024-07eval-run:human-eval.mistral-large-2.2024-07benchmarks
eval-run:human-eval.qwen-2-5-72b.2024-09eval-run:human-eval.qwen-2-5-72b.2024-09benchmarks
eval-run:human-eval.qwen-2-5-coder-32b.2024-11eval-run:human-eval.qwen-2-5-coder-32b.2024-11benchmarks
eval-run:livecodebench.gemini-2-5-pro.2025-06eval-run:livecodebench.gemini-2-5-pro.2025-06benchmarks
eval-run:livecodebench.gpt-5.2025-08eval-run:livecodebench.gpt-5.2025-08benchmarks
eval-run:livecodebench.qwen-2-5-coder-32b.2024-11eval-run:livecodebench.qwen-2-5-coder-32b.2024-11benchmarks
eval-run:math.deepseek-r1.2025-01eval-run:math.deepseek-r1.2025-01benchmarks
eval-run:math.gpt-5.2025-08eval-run:math.gpt-5.2025-08benchmarks
eval-run:math.o3.2025-04eval-run:math.o3.2025-04benchmarks
eval-run:mbpp.qwen-2-5-coder-32b.2024-11eval-run:mbpp.qwen-2-5-coder-32b.2024-11benchmarks
eval-run:mgsm.gemini-2-5-pro.2025-06eval-run:mgsm.gemini-2-5-pro.2025-06benchmarks
eval-run:mmlu.claude-sonnet-4-6.2025-11eval-run:mmlu.claude-sonnet-4-6.2025-11benchmarks
eval-run:mmlu.command-r-plus.2024-08eval-run:mmlu.command-r-plus.2024-08benchmarks
eval-run:mmlu.deepseek-r1.2025-01eval-run:mmlu.deepseek-r1.2025-01benchmarks
eval-run:mmlu.deepseek-v3.2024-12eval-run:mmlu.deepseek-v3.2024-12benchmarks
eval-run:mmlu.gemma-2-27b.2024-06eval-run:mmlu.gemma-2-27b.2024-06benchmarks
eval-run:mmlu.llama-3-1-405b.2024-07eval-run:mmlu.llama-3-1-405b.2024-07benchmarks
eval-run:mmlu.llama-3-3-70b.2024-12eval-run:mmlu.llama-3-3-70b.2024-12benchmarks
eval-run:mmlu.llama-4-405b.2024-07eval-run:mmlu.llama-4-405b.2024-07benchmarks
Page 1 of 2Next

Active filters

No active facet filters.

Sort

id-asc
id-desc
name-asc
name-desc