Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
iiiNode kind
Agentic AI Atlas · EvalResult
73 recordsa5c.ai
Search kind facets/
Atlas · node kind

Current kind and facets

III.EvalResultpp. 1 - 1
evalRunId: eval-run:swe-bench-verified.gpt-5.2025-08evalRunId: eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09reportedAt: 2025-09-29T00:00:00ZreportedAt: 2025-08-07T00:00:00ZmetricName: accuracymetricName: pass_rateunit: fractionunit: pct
III.
Node kind ledger

EvalResult

Page 1 of 2

EvalResult records

Browse all EvalResult records in the current atlas snapshot.

Cluster · benchmarksTotal · 73Visible · 73
Filters & facets4 groups

evalRunId

eval-run:swe-bench-verified.gpt-5.2025-08 · 3eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09 · 2eval-run:mmlu.qwen-2-5-72b.2024-09 · 1eval-run:human-eval.qwen-2-5-72b.2024-09 · 1eval-run:human-eval.qwen-2-5-coder-32b.2024-11 · 1eval-run:livecodebench.qwen-2-5-coder-32b.2024-11 · 1eval-run:mbpp.qwen-2-5-coder-32b.2024-11 · 1eval-run:swe-bench-verified.claude-haiku-4-5.2025-10 · 1eval-run:gpqa.claude-haiku-4-5.2025-10 · 1eval-run:human-eval.claude-sonnet-4-6.2025-11 · 1eval-run:mmlu.claude-sonnet-4-6.2025-11 · 1eval-run:bfcl.claude-sonnet-4-5.2025-09 · 1

reportedAt

2025-09-29T00:00:00Z · 142025-08-07T00:00:00Z · 112025-06-17T00:00:00Z · 72024-07-23T00:00:00Z · 62024-11-12T00:00:00Z · 32024-12-26T00:00:00Z · 32025-01-20T00:00:00Z · 32024-09-19T00:00:00Z · 22025-10-15T00:00:00Z · 22025-11-15T00:00:00Z · 22026-05-04T00:00:00Z · 22024-12-06T00:00:00Z · 2

metricName

accuracy · 34pass_rate · 18pass@1 · 13success_rate · 2resolved_rate · 2mc2 · 1attack_success_rate · 1pass_rate_high_compute · 1pass_rate_headline · 1

unit

fraction · 72pct · 1
id-ascid-descname-ascname-desc
iddisplayNamecluster
eval-result:android-world.gemini-2-5-pro.001eval-result:android-world.gemini-2-5-pro.001benchmarks
eval-result:arc-challenge.claude-sonnet-4-5.001eval-result:arc-challenge.claude-sonnet-4-5.001benchmarks
eval-result:bfcl.claude-sonnet-4-5.001eval-result:bfcl.claude-sonnet-4-5.001benchmarks
eval-result:bfcl.gpt-5.001eval-result:bfcl.gpt-5.001benchmarks
eval-result:evalplus.gpt-5.001eval-result:evalplus.gpt-5.001benchmarks
eval-result:gaia.claude-code.001eval-result:gaia.claude-code.001benchmarks
eval-result:gpqa-diamond.claude-opus-4-5.001eval-result:gpqa-diamond.claude-opus-4-5.001benchmarks
eval-result:gpqa-diamond.gemini-2-5-pro.001eval-result:gpqa-diamond.gemini-2-5-pro.001benchmarks
eval-result:gpqa-diamond.gemini-3-1-pro.2026-02-19.accuracyeval-result:gpqa-diamond.gemini-3-1-pro.2026-02-19.accuracybenchmarks
eval-result:gpqa-diamond.gemini-3-pro.2025-11-18.accuracyeval-result:gpqa-diamond.gemini-3-pro.2025-11-18.accuracybenchmarks
eval-result:gpqa-diamond.gpt-5-4-mini.2026-03-17.accuracyeval-result:gpqa-diamond.gpt-5-4-mini.2026-03-17.accuracybenchmarks
eval-result:gpqa-diamond.gpt-5-4.2026-03-17.accuracyeval-result:gpqa-diamond.gpt-5-4.2026-03-17.accuracybenchmarks
eval-result:gpqa-diamond.gpt-5.001eval-result:gpqa-diamond.gpt-5.001benchmarks
eval-result:gpqa.claude-haiku-4-5.001eval-result:gpqa.claude-haiku-4-5.001benchmarks
eval-result:gpqa.claude-sonnet-4-5.001eval-result:gpqa.claude-sonnet-4-5.001benchmarks
eval-result:gpqa.deepseek-r1.001eval-result:gpqa.deepseek-r1.001benchmarks
eval-result:gpqa.gemini-2-5-pro.001eval-result:gpqa.gemini-2-5-pro.001benchmarks
eval-result:gpqa.gpt-5.001eval-result:gpqa.gpt-5.001benchmarks
eval-result:gsm8k.claude-sonnet-4-5.001eval-result:gsm8k.claude-sonnet-4-5.001benchmarks
eval-result:gsm8k.gemma-2-27b.001eval-result:gsm8k.gemma-2-27b.001benchmarks
eval-result:harmbench.claude-opus-4-5.001eval-result:harmbench.claude-opus-4-5.001benchmarks
eval-result:hellaswag.claude-opus-4-5.001eval-result:hellaswag.claude-opus-4-5.001benchmarks
eval-result:human-eval-plus.claude-sonnet-4-5.001eval-result:human-eval-plus.claude-sonnet-4-5.001benchmarks
eval-result:human-eval-plus.gpt-5.001eval-result:human-eval-plus.gpt-5.001benchmarks
eval-result:human-eval.claude-sonnet-4-6.001eval-result:human-eval.claude-sonnet-4-6.001benchmarks
eval-result:human-eval.codestral-25-01.001eval-result:human-eval.codestral-25-01.001benchmarks
eval-result:human-eval.deepseek-v3.001eval-result:human-eval.deepseek-v3.001benchmarks
eval-result:human-eval.gpt-5.001eval-result:human-eval.gpt-5.001benchmarks
eval-result:human-eval.llama-3-1-405b.001eval-result:human-eval.llama-3-1-405b.001benchmarks
eval-result:human-eval.llama-3-3-70b.001eval-result:human-eval.llama-3-3-70b.001benchmarks
eval-result:human-eval.llama-4-405b.001eval-result:human-eval.llama-4-405b.001benchmarks
eval-result:human-eval.mistral-large-2.001eval-result:human-eval.mistral-large-2.001benchmarks
eval-result:human-eval.qwen-2-5-72b.001eval-result:human-eval.qwen-2-5-72b.001benchmarks
eval-result:human-eval.qwen-2-5-coder-32b.001eval-result:human-eval.qwen-2-5-coder-32b.001benchmarks
eval-result:livecodebench.gemini-2-5-pro.001eval-result:livecodebench.gemini-2-5-pro.001benchmarks
eval-result:livecodebench.gpt-5.001eval-result:livecodebench.gpt-5.001benchmarks
eval-result:livecodebench.qwen-2-5-coder-32b.001eval-result:livecodebench.qwen-2-5-coder-32b.001benchmarks
eval-result:math.deepseek-r1.001eval-result:math.deepseek-r1.001benchmarks
eval-result:math.gpt-5.001eval-result:math.gpt-5.001benchmarks
eval-result:math.o3.001eval-result:math.o3.001benchmarks
eval-result:mbpp.qwen-2-5-coder-32b.001eval-result:mbpp.qwen-2-5-coder-32b.001benchmarks
eval-result:mgsm.gemini-2-5-pro.001eval-result:mgsm.gemini-2-5-pro.001benchmarks
eval-result:mmlu.claude-sonnet-4-6.001eval-result:mmlu.claude-sonnet-4-6.001benchmarks
eval-result:mmlu.command-r-plus.001eval-result:mmlu.command-r-plus.001benchmarks
eval-result:mmlu.deepseek-r1.001eval-result:mmlu.deepseek-r1.001benchmarks
eval-result:mmlu.deepseek-v3.001eval-result:mmlu.deepseek-v3.001benchmarks
eval-result:mmlu.gemma-2-27b.001eval-result:mmlu.gemma-2-27b.001benchmarks
eval-result:mmlu.llama-3-1-405b.001eval-result:mmlu.llama-3-1-405b.001benchmarks
eval-result:mmlu.llama-3-3-70b.001eval-result:mmlu.llama-3-3-70b.001benchmarks
eval-result:mmlu.llama-4-405b.001eval-result:mmlu.llama-4-405b.001benchmarks
Page 1 of 2Next

Active filters

No active facet filters.

Sort

id-asc
id-desc
name-asc
name-desc