eval-run:gaia.claude-code.…
EvalRun
agent-version:claude-code@…
AgentVersion
test-set:gaia-validation
TestSet
eval-harness:inspect-ai
EvalHarness
eval-harness:helm
EvalHarness
eval-harness:lm-eval-harness
EvalHarness
eval-harness:openai-evals
EvalHarness
eval-harness:promptfoo
EvalHarness
judge:gpt-4o-pairwise
Judge
judge:claude-3-5-sonnet-ru…
Judge
rubric:helpfulness-1-5
Rubric
rubric:safety-3-axis
Rubric
rubric:code-quality
Rubric
eval-result:mmlu.qwen-2-5-…
EvalResult
eval-result:gaia.claude-co…
EvalResult