Atlas Graph Explorer
Wiki
Graph
Edges
Home
EvalRun
eval-run:gaia.claude-code.2025
eval-run:gaia.claude-code.2025
eval-run:gaia.claude-code.2025
EvalRun
benchmarks/eval-runs/gaia-claude-code.yaml
·
Open in Graph →
overview
json
graph
Attributes
benchmarkId
benchmark:gaia
testSetId
test-set:gaia-validation
target
agent-version:claude-code@1.x
targetId
agent-version:claude-code@1.x
runAt
2025-06-01T00:00:00Z
runBy
@a5c-ai/team
configHash
sha256:placeholder-claude-code-gaia
Outgoing edges
(16)
evaluated_by
1
benchmark:gaia
·
Benchmark
GAIA
evaluates_target
1
agent-version:claude-code@1.x
·
AgentVersion
for_benchmark
1
benchmark:gaia
·
Benchmark
GAIA
judged_by
3
judge:gpt-4o-pairwise
·
Judge
GPT-4o pairwise preference judge
judge:claude-3-5-sonnet-rubric
·
Judge
Claude 3.5 Sonnet rubric-based judge
judge:exact-match
·
Judge
Exact-match programmatic judge
produced_result
1
eval-result:mmlu.qwen-2-5-72b.001
·
EvalResult
scored_against_rubric
3
rubric:helpfulness-1-5
·
Rubric
Helpfulness 1-5 rubric
rubric:safety-3-axis
·
Rubric
Safety 3-axis rubric (harm, bias, refusal-appropriateness)
rubric:code-quality
·
Rubric
Code-quality rubric
uses_harness
5
eval-harness:inspect-ai
·
EvalHarness
Inspect AI
eval-harness:helm
·
EvalHarness
Stanford HELM
eval-harness:lm-eval-harness
·
EvalHarness
EleutherAI lm-evaluation-harness
eval-harness:openai-evals
·
EvalHarness
OpenAI Evals
eval-harness:promptfoo
·
EvalHarness
promptfoo
uses_test_set
1
test-set:gaia-validation
·
TestSet
GAIA validation split
Incoming edges
(1)
belongs_to_eval_run
1
eval-result:gaia.claude-code.001
·
EvalResult