Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
iiiNode kind
Agentic AI Atlas · EvalRun
70 recordsa5c.ai
Search kind facets/
Atlas · node kind

Current kind and facets

III.EvalRunpp. 2 - 2
configHash: sha256:placeholder-gpt-5-evalplusconfigHash: sha256:placeholder-qwen-2-5-72b-mmlutarget: model:gpt-5@currenttarget: model:claude-sonnet-4-5@currenttargetId: model:gpt-5@currenttargetId: model:claude-sonnet-4-5@currentrunAt: 2025-09-29T00:00:00ZrunAt: 2025-08-07T00:00:00Z
III.
Node kind ledger

EvalRun

Page 2 of 2

EvalRun records

Browse all EvalRun records in the current atlas snapshot.

Cluster · benchmarksTotal · 70Visible · 70
Filters & facets7 groups

configHash

sha256:placeholder-gpt-5-evalplus · 2sha256:placeholder-qwen-2-5-72b-mmlu · 1sha256:placeholder-qwen-2-5-72b-humaneval · 1sha256:placeholder-qwen-2-5-coder-32b-humaneval · 1sha256:placeholder-qwen-2-5-coder-32b-lcb · 1sha256:placeholder-qwen-2-5-coder-32b-mbpp · 1sha256:placeholder-claude-haiku-4-5-swe-bench-verified · 1sha256:placeholder-claude-haiku-4-5-gpqa · 1sha256:placeholder-claude-sonnet-4-6-human-eval · 1sha256:placeholder-claude-sonnet-4-6-mmlu · 1sha256:placeholder-claude-sonnet-4-5-bfcl-v3 · 1sha256:placeholder-claude-opus-4-5-gpqa-diamond · 1

target

model:gpt-5@current · 9model:claude-sonnet-4-5@current · 8model:gemini-2-5-pro@current · 6model:claude-opus-4-5@current · 5model:qwen-2-5-coder-32b@current · 3model:deepseek-v3@current · 3model:deepseek-r1@current · 3model:llama-4-405b-instruct@current · 3model:llama-3-1-405b-instruct@current · 3model:qwen-2-5-72b-instruct@current · 2model:claude-haiku-4-5@current · 2model:claude-sonnet-4-6@current · 2

targetId

model:gpt-5@current · 9model:claude-sonnet-4-5@current · 8model:gemini-2-5-pro@current · 6model:claude-opus-4-5@current · 5model:qwen-2-5-coder-32b@current · 3model:deepseek-v3@current · 3model:deepseek-r1@current · 3model:llama-4-405b-instruct@current · 3model:llama-3-1-405b-instruct@current · 3model:qwen-2-5-72b-instruct@current · 2model:claude-haiku-4-5@current · 2model:claude-sonnet-4-6@current · 2

runAt

2025-09-29T00:00:00Z · 132025-08-07T00:00:00Z · 92025-06-17T00:00:00Z · 72024-07-23T00:00:00Z · 62024-11-12T00:00:00Z · 32024-12-26T00:00:00Z · 32025-01-20T00:00:00Z · 32024-09-19T00:00:00Z · 22025-10-15T00:00:00Z · 22025-11-15T00:00:00Z · 22024-12-06T00:00:00Z · 22024-07-24T00:00:00Z · 2

benchmarkId

benchmark:mmlu · 12benchmark:swe-bench-verified · 12benchmark:gpqa · 12benchmark:human-eval · 10benchmark:livecodebench · 3benchmark:bigcode-evalplus · 3benchmark:math · 3benchmark:berkeley-function-calling · 2benchmark:gsm8k · 2benchmark:mbpp · 1benchmark:os-world · 1benchmark:truthful-qa · 1

runBy

anthropic · 16openai · 11google-deepmind · 9meta · 7deepseek · 6qwen-team · 5mistral · 4evalplus-leaderboard · 3berkeley-gorilla · 2google · 2@a5c-ai/team · 2artificial-analysis · 1

testSetId

test-set:swe-bench-verified-2024-12 · 26test-set:gpqa-diamond-2024 · 12test-set:bfcl-v3 · 2test-set:truthful-qa-mc · 1test-set:gaia-validation · 1
id-ascid-descname-ascname-desc
iddisplayNamecluster
eval-run:mmlu.mistral-large-2.2024-07eval-run:mmlu.mistral-large-2.2024-07benchmarks
eval-run:mmlu.o1.2024-12eval-run:mmlu.o1.2024-12benchmarks
eval-run:mmlu.phi-3-medium.2024-05eval-run:mmlu.phi-3-medium.2024-05benchmarks
eval-run:mmlu.qwen-2-5-72b.2024-09eval-run:mmlu.qwen-2-5-72b.2024-09benchmarks
eval-run:multipl-e.codestral-25-01.2025-01eval-run:multipl-e.codestral-25-01.2025-01benchmarks
eval-run:os-world.claude-sonnet-4-5.2025-09eval-run:os-world.claude-sonnet-4-5.2025-09benchmarks
eval-run:swe-bench-verified.claude-haiku-4-5.2025-10eval-run:swe-bench-verified.claude-haiku-4-5.2025-10benchmarks
eval-run:swe-bench-verified.claude-opus-4-5.2025-09eval-run:swe-bench-verified.claude-opus-4-5.2025-09benchmarks
eval-run:swe-bench-verified.claude-opus-4-7.2026-01eval-run:swe-bench-verified.claude-opus-4-7.2026-01benchmarks
eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09benchmarks
eval-run:swe-bench-verified.gemini-2-5-flash.2025-06eval-run:swe-bench-verified.gemini-2-5-flash.2025-06benchmarks
eval-run:swe-bench-verified.gemini-2-5-pro.2025-06eval-run:swe-bench-verified.gemini-2-5-pro.2025-06benchmarks
eval-run:swe-bench-verified.gpt-5.2025-08eval-run:swe-bench-verified.gpt-5.2025-08benchmarks
eval-run:swe-bench-verified.llama-4-405b.2024-07eval-run:swe-bench-verified.llama-4-405b.2024-07benchmarks
eval-run:swe-bench-verified.o3.2025-04eval-run:swe-bench-verified.o3.2025-04benchmarks
eval-run:swe-bench.claude-code@1.x.2025-04-29eval-run:swe-bench.claude-code@1.x.2025-04-29benchmarks
eval-run:swe-bench.deepseek-v3.2024-12eval-run:swe-bench.deepseek-v3.2024-12benchmarks
eval-run:swe-bench.llama-3-1-405b.2024-07eval-run:swe-bench.llama-3-1-405b.2024-07benchmarks
eval-run:terminal-bench.claude-sonnet-4-5.2025-09eval-run:terminal-bench.claude-sonnet-4-5.2025-09benchmarks
eval-run:truthful-qa.claude-opus-4-5.2025-09eval-run:truthful-qa.claude-opus-4-5.2025-09benchmarks
PrevPage 2 of 2

Active filters

No active facet filters.

Sort

id-asc
id-desc
name-asc
name-desc