I.
Wiki article
Reading · 1 minprocess/gaps/GAP-L1-P3-benchmarks-stale
GAP-L1-P3-benchmarks-stale reference
schema/examples/benchmarks/ directory absent or sparse. Coverage-checklist OpenQuestion "Benchmark-run primitives at SDK layer" is unresolved. Major 2025/2026 benchmarks not represented:
Continue reading
Nearby pages in the same section.
GAP-L1-P0-claude-code-plugin-component-typesGAP-L1-P0-claude-models-pricing-and-lineupGAP-L1-P0-mcp-spec-2025-11-25GAP-L1-P1-adaptive-thinking-vs-extended-thinkingGAP-L1-P1-anthropic-skills-vs-claude-code-skillsGAP-L1-P1-cursor-profiles-and-modesGAP-L1-P1-mcp-elicitation-and-resource-linksGAP-L1-P1-mcp-oauth-resource-serverGAP-L1-P1-repo-graph-discovery-signalGAP-L1-P1-repo-graph-session-lifecycle-semanticsGAP-L1-P2-gemini-2-5-and-3GAP-L1-P2-mcp-stdio-vs-http-sse-deprecationGAP-L1-P2-openai-codex-and-responses-apiGAP-L1-P2-repo-graph-cisurface-packagesurfaceGAP-L1-P2-repo-graph-pluginartifactGAP-L2-P0-pathdescriptor-undeclared-but-referencedGAP-L2-P1-edge-kinds-md-vs-yaml-parityGAP-L2-P1-mcptransport-status-attribute-undeclaredGAP-L2-P2-cluster-count-mismatchGAP-L2-P2-coverage-checklist-internal-broken-refsGAP-L2-P2-versionrange-attribute-on-modelversion
GAP-L1-P3-benchmarks-stale
| Field | Value |
|---|---|
| id | gap:benchmarks-stale |
| title | Benchmark NodeKind has no current SWE-bench Verified, Aider Polyglot, ARC-AGI 2 examples |
| level | 1 |
| priority | P3 |
| discoveredAt | 2026-04-28T00:00:00Z |
| source | schema/examples/benchmarks/ |
| status | open |
| owner | tbd |
Current state
schema/examples/benchmarks/ directory absent or sparse. Coverage-checklist OpenQuestion "Benchmark-run primitives at SDK layer" is unresolved. Major 2025/2026 benchmarks not represented:
- SWE-bench Verified (current de-facto coding agent benchmark)
- Aider Polyglot
- ARC-AGI 2
- Terminal-Bench
- HumanEval/MBPP (older, but still cited)
Desired state
Add 5 Benchmark example files; add EvalRun examples for at least Claude Opus 4.7 and gpt-5-codex on SWE-bench Verified to demonstrate the eval graph.
Evidence
- swebench.com
- arcprize.org
- aider.chat/docs/leaderboards/
Propagation status
- Level 1: open
- Level 2: not-started
Propagation chain
- Level 1: 5 example files + 2 EvalRun example files.
Notes
P3 — important for usefulness but not for schema correctness.