docs/testing
Testing Strategy guide
This directory defines the replacement testing strategy after the legacy Docker and Docker-E2E workflows were removed. The current CI implementation lives primarily in .github/workflows/publish.yml, with GitHub Actions owning the live-stack scenario and OS matrix. The new plan starts from repository-native package boundaries, Babysitter harness setup commands, the babysitter-agent runtime surface, and explicit model/no-model lanes instead of reusing the retired Docker image and e2e-tests/docker suite.
Pages in this section
Start with the section hub, then move sideways into adjacent pages when you need more detail.
This strategy covers runtime paths after setup is already satisfied. It separates agent-mux sessions, transport carriers, agent-core programmatic sessions, and @a5c-ai/babysitter-agent orchestration. Harness/plugin install coverage lives in Harness And Plugin E2E(./harness-e2e.md), not in babysitter-agent runtime E2E.
wiki/docs/testing/agent-mux-and-runtime-e2e.md
Coverage reporting should make the repository-wide test story visible without turning every test into one slow monolithic gate.
wiki/docs/testing/coverage-and-reporting.md
Status: Current. This inventory implements roadmap slice 0, "Inventory and naming". It maps existing CI-relevant test-like package scripts to package or surface, lane, scope, owner, artifact name, and pipeline placement. Proposed future bundles remain in Pipeline Integration(./pipeline-integration.mdproposed-command-bundles) and are not treated as current commands here.
wiki/docs/testing/current-test-command-inventory.md
This document covers harness setup and plugin-enabled sessions. It intentionally separates two different integration types:
wiki/docs/testing/harness-e2e.md
This roadmap turns the strategy into implementation slices. Each slice must land with docs, package scripts or workflow wiring, and proof artifacts before the next slice depends on it. Status reflects the current unified Publish workflow, where live-stack scenario selection is owned by GitHub Actions rather than test code.
wiki/docs/testing/implementation-roadmap.md
No-model tests are only valuable if their mocks describe the same contracts live providers must satisfy. This document defines fixture expectations for Codex, Claude Code, agent-core, agent-mux, transport-mux, hooks muxes, and babysitter-agent.
wiki/docs/testing/mock-and-fixture-contracts.md
The pipeline should add new testing lanes in stages. No-model tests protect every pull request. Model-backed tests protect promotion and release confidence without making ordinary PRs depend on provider availability.
wiki/docs/testing/pipeline-integration.md
This document maps the main flows that the rebuilt E2E strategy should prove. It is intentionally data-path oriented: every flow names the caller, command/API boundary, state that must be created, hook/session artifacts that should exist, and the identifiers that let a test join evidence across packages.
wiki/docs/testing/primary-flow-data-paths.md
Quality Gates
PageThese gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.
wiki/docs/testing/quality-gates.md
The test strategy must treat the stack as modular. A valid E2E does not need every layer, and some layer combinations are invalid even if the names sound related.
wiki/docs/testing/stack-permutations.md
Test Lanes
PageThe replacement strategy has two top-level lanes. Every new test must declare which lane it belongs to before it is added to CI.
wiki/docs/testing/test-lanes.md
Use this document as the evidence checklist for tests described in Primary Flow Data Paths(./primary-flow-data-paths.md). A scenario should not be marked E2E unless it records the identifiers needed to join the agent session, hook events, Babysitter run state, and transport trace.
wiki/docs/testing/trace-identifiers-and-evidence.md
Testing Strategy
This directory defines the replacement testing strategy after the legacy Docker and Docker-E2E workflows were removed. The current CI implementation lives primarily in .github/workflows/publish.yml, with GitHub Actions owning the live-stack scenario and OS matrix. The new plan starts from repository-native package boundaries, Babysitter harness setup commands, the babysitter-agent runtime surface, and explicit model/no-model lanes instead of reusing the retired Docker image and e2e-tests/docker suite.
Documents
- Test Lanes defines the two top-level lanes: no-model deterministic tests and model-backed tests that require real provider credentials.
- Harness And Plugin E2E separates SDK harness/plugin setup from agent-mux plugin/session E2E.
- Agent Mux And Runtime E2E defines runtime coverage for
agent-mux,transport-mux,agent-core, and@a5c-ai/babysitter-agentflows after setup preconditions are satisfied. - Pipeline Integration defines where each lane belongs in CI, staging, release, scheduled, and manual workflows.
- Coverage And Reporting defines repo-wide coverage reporting, artifacts, logs, and pass/fail evidence.
- Implementation Roadmap defines rollout slices, exit criteria, and stop conditions.
- Current Test Command Inventory maps existing package test-like commands to lane, scope, owner, artifact name, and pipeline placement for roadmap slice 0.
- Mock And Fixture Contracts defines deterministic fixture families and live/mock compatibility rules.
- Quality Gates defines release-evidence gates and adversarial review criteria.
- Stack Permutations defines valid and invalid layer combinations across the modular stack.
- Primary Flow Data Paths maps the full data path for the main agent-mux, babysitter-agent, SDK run, hooks-mux, and transport-mux flows.
- Trace Identifiers And Evidence defines the IDs, logs, files, and artifact bundles required to correlate those flows.
Principles
- Separate tests that need model credentials from tests that can run with mocks, fixtures, or local fakes.
- Make setup explicit and repeatable, but do not conflate setup with runtime: SDK harness/plugin setup, agent-mux plugin/session E2E, and babysitter-agent runtime E2E are separate paths.
- Test mux boundaries at multiple scopes: protocol contracts, adapter translation, transport behavior, gateway/session behavior, UI behavior, and full runtime orchestration.
- Prefer package-local tests for fast feedback, then compose them into broader lanes only when the integration surface matters.
- Treat live model runs as release evidence, not as the first line of feedback for every pull request.
- Promote tests through explicit gates: manual, scheduled, staging preflight, then release preflight.
- Require each model-backed claim to have a no-model fixture or contract counterpart unless the behavior is inherently provider-only.
Status Legend
| Status | Meaning |
|---|---|
| Current | Command, workflow, or package test exists today and can be validated now. |
| Proposed | Contract name or workflow shape this strategy recommends for a future implementation slice; not the current source of truth unless a current workflow or package script is named. |
| Promotion target | A test exists or is planned in a lower lane and should move only after meeting quality gates. |
Unless a document explicitly says Current, command bundles and workflow names are proposed implementation targets.
Current State
The repository already has Vitest, Playwright, package-local test scripts, release verification scripts, docs QA, metadata checks, architecture gates, and staging/release workflows. This strategy names how to organize the next E2E generation around those surfaces rather than around the removed Docker workflows.
Requested Scope Traceability
| Requested scope | Primary docs | Lane | First implementation surface |
|---|---|---|---|
| Codex E2E | Harness And Plugin E2E, Stack Permutations | No-model setup/session first, then capability-gated model-backed | Harness setup smoke, Codex adapter protocol fixture, plugin E2E only after capability proof; babysitter-agent runtime is separate |
| Claude Code E2E | Harness And Plugin E2E, Stack Permutations | No-model setup/session first, then model-backed | Harness setup smoke, agent-mux session, plugin-manager where supported, /babysitter:call plugin smoke, Claude hook/tool-call fixture |
harness:install and plugin setup | Harness And Plugin E2E, Stack Permutations | Setup only | Dry-run install JSON, plugin discovery JSON, idempotency checks; no babysitter-agent runtime claim |
| Agent-mux functionality requiring credentials | Agent Mux And Runtime E2E, Pipeline Integration | Model-backed | Live adapter matrix for Codex and Claude Code |
| Babysitter-agent whole-system flow | Agent Mux And Runtime E2E, Stack Permutations | Both | Mock planner/executor first, bounded live process after staging promotion, no installer commands inside runtime E2E |
| Muxes and transport-mux | Agent Mux And Runtime E2E, Mock And Fixture Contracts, Primary Flow Data Paths | Both | Shared event fixtures, transport roundtrip, live transport smoke with trace identifiers |
| Hooks muxes | Agent Mux And Runtime E2E, Mock And Fixture Contracts, Trace Identifiers And Evidence | Both | Normalized hook fixtures, live hook replay after redaction with session/run correlation |
| Pipeline integration | Pipeline Integration, Implementation Roadmap | Both | New workflow contracts and staged required checks |
| Coverage reporting | Coverage And Reporting | Both | Package coverage baselines plus scenario coverage summaries |