Agentic AI Atlas

Wiki article

docs/testing

Reading · 4 min

Testing Strategy guide

This directory defines the replacement testing strategy after the legacy Docker and Docker-E2E workflows were removed. The current CI implementation lives primarily in .github/workflows/publish.yml, with GitHub Actions owning the live-stack scenario and OS matrix. The new plan starts from repository-native package boundaries, Babysitter harness setup commands, the babysitter-agent runtime surface, and explicit model/no-model lanes instead of reusing the retired Docker image and e2e-tests/docker suite.

Page nodewiki/docs/testing/index.mdSection pages · 12Documents · 0

Pages in this section

Start with the section hub, then move sideways into adjacent pages when you need more detail.

Agent Mux And Runtime E2E

Page

This strategy covers runtime paths after setup is already satisfied. It separates agent-mux sessions, transport carriers, agent-core programmatic sessions, and @a5c-ai/babysitter-agent orchestration. Harness/plugin install coverage lives in Harness And Plugin E2E(./harness-e2e.md), not in babysitter-agent runtime E2E.

wiki/docs/testing/agent-mux-and-runtime-e2e.md

Coverage And Reporting

Page

Coverage reporting should make the repository-wide test story visible without turning every test into one slow monolithic gate.

wiki/docs/testing/coverage-and-reporting.md

Current Test Command Inventory

Page

Status: Current. This inventory implements roadmap slice 0, "Inventory and naming". It maps existing CI-relevant test-like package scripts to package or surface, lane, scope, owner, artifact name, and pipeline placement. Proposed future bundles remain in Pipeline Integration(./pipeline-integration.mdproposed-command-bundles) and are not treated as current commands here.

wiki/docs/testing/current-test-command-inventory.md

Harness And Plugin E2E

Page

This document covers harness setup and plugin-enabled sessions. It intentionally separates two different integration types:

wiki/docs/testing/harness-e2e.md

Implementation Roadmap

Page

This roadmap turns the strategy into implementation slices. Each slice must land with docs, package scripts or workflow wiring, and proof artifacts before the next slice depends on it. Status reflects the current unified Publish workflow, where live-stack scenario selection is owned by GitHub Actions rather than test code.

wiki/docs/testing/implementation-roadmap.md

Mock And Fixture Contracts

Page

No-model tests are only valuable if their mocks describe the same contracts live providers must satisfy. This document defines fixture expectations for Codex, Claude Code, agent-core, agent-mux, transport-mux, hooks muxes, and babysitter-agent.

wiki/docs/testing/mock-and-fixture-contracts.md

Pipeline Integration

Page

The pipeline should add new testing lanes in stages. No-model tests protect every pull request. Model-backed tests protect promotion and release confidence without making ordinary PRs depend on provider availability.

wiki/docs/testing/pipeline-integration.md

Primary Flow Data Paths

Page

This document maps the main flows that the rebuilt E2E strategy should prove. It is intentionally data-path oriented: every flow names the caller, command/API boundary, state that must be created, hook/session artifacts that should exist, and the identifiers that let a test join evidence across packages.

wiki/docs/testing/primary-flow-data-paths.md

Quality Gates

Page

These gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.

wiki/docs/testing/quality-gates.md

Stack Permutations

Page

The test strategy must treat the stack as modular. A valid E2E does not need every layer, and some layer combinations are invalid even if the names sound related.

wiki/docs/testing/stack-permutations.md

Test Lanes

Page

The replacement strategy has two top-level lanes. Every new test must declare which lane it belongs to before it is added to CI.

wiki/docs/testing/test-lanes.md

Trace Identifiers And Evidence

Page

Use this document as the evidence checklist for tests described in Primary Flow Data Paths(./primary-flow-data-paths.md). A scenario should not be marked E2E unless it records the identifiers needed to join the agent session, hook events, Babysitter run state, and transport trace.

wiki/docs/testing/trace-identifiers-and-evidence.md

Testing Strategy

This directory defines the replacement testing strategy after the legacy Docker and Docker-E2E workflows were removed. The current CI implementation lives primarily in .github/workflows/publish.yml, with GitHub Actions owning the live-stack scenario and OS matrix. The new plan starts from repository-native package boundaries, Babysitter harness setup commands, the babysitter-agent runtime surface, and explicit model/no-model lanes instead of reusing the retired Docker image and e2e-tests/docker suite.

Documents

Test Lanes defines the two top-level lanes: no-model deterministic tests and model-backed tests that require real provider credentials.
Harness And Plugin E2E separates SDK harness/plugin setup from agent-mux plugin/session E2E.
Agent Mux And Runtime E2E defines runtime coverage for agent-mux, transport-mux, agent-core, and @a5c-ai/babysitter-agent flows after setup preconditions are satisfied.
Pipeline Integration defines where each lane belongs in CI, staging, release, scheduled, and manual workflows.
Coverage And Reporting defines repo-wide coverage reporting, artifacts, logs, and pass/fail evidence.
Implementation Roadmap defines rollout slices, exit criteria, and stop conditions.
Current Test Command Inventory maps existing package test-like commands to lane, scope, owner, artifact name, and pipeline placement for roadmap slice 0.
Mock And Fixture Contracts defines deterministic fixture families and live/mock compatibility rules.
Quality Gates defines release-evidence gates and adversarial review criteria.
Stack Permutations defines valid and invalid layer combinations across the modular stack.
Primary Flow Data Paths maps the full data path for the main agent-mux, babysitter-agent, SDK run, hooks-mux, and transport-mux flows.
Trace Identifiers And Evidence defines the IDs, logs, files, and artifact bundles required to correlate those flows.

Principles

Separate tests that need model credentials from tests that can run with mocks, fixtures, or local fakes.
Make setup explicit and repeatable, but do not conflate setup with runtime: SDK harness/plugin setup, agent-mux plugin/session E2E, and babysitter-agent runtime E2E are separate paths.
Test mux boundaries at multiple scopes: protocol contracts, adapter translation, transport behavior, gateway/session behavior, UI behavior, and full runtime orchestration.
Prefer package-local tests for fast feedback, then compose them into broader lanes only when the integration surface matters.
Treat live model runs as release evidence, not as the first line of feedback for every pull request.
Promote tests through explicit gates: manual, scheduled, staging preflight, then release preflight.
Require each model-backed claim to have a no-model fixture or contract counterpart unless the behavior is inherently provider-only.

Status Legend

Status	Meaning
Current	Command, workflow, or package test exists today and can be validated now.
Proposed	Contract name or workflow shape this strategy recommends for a future implementation slice; not the current source of truth unless a current workflow or package script is named.
Promotion target	A test exists or is planned in a lower lane and should move only after meeting quality gates.

Unless a document explicitly says Current, command bundles and workflow names are proposed implementation targets.

Current State

The repository already has Vitest, Playwright, package-local test scripts, release verification scripts, docs QA, metadata checks, architecture gates, and staging/release workflows. This strategy names how to organize the next E2E generation around those surfaces rather than around the removed Docker workflows.

Requested Scope Traceability

Requested scope	Primary docs	Lane	First implementation surface
Codex E2E	Harness And Plugin E2E, Stack Permutations	No-model setup/session first, then capability-gated model-backed	Harness setup smoke, Codex adapter protocol fixture, plugin E2E only after capability proof; babysitter-agent runtime is separate
Claude Code E2E	Harness And Plugin E2E, Stack Permutations	No-model setup/session first, then model-backed	Harness setup smoke, agent-mux session, plugin-manager where supported, `/babysitter:call` plugin smoke, Claude hook/tool-call fixture
`harness:install` and plugin setup	Harness And Plugin E2E, Stack Permutations	Setup only	Dry-run install JSON, plugin discovery JSON, idempotency checks; no babysitter-agent runtime claim
Agent-mux functionality requiring credentials	Agent Mux And Runtime E2E, Pipeline Integration	Model-backed	Live adapter matrix for Codex and Claude Code
Babysitter-agent whole-system flow	Agent Mux And Runtime E2E, Stack Permutations	Both	Mock planner/executor first, bounded live process after staging promotion, no installer commands inside runtime E2E
Muxes and transport-mux	Agent Mux And Runtime E2E, Mock And Fixture Contracts, Primary Flow Data Paths	Both	Shared event fixtures, transport roundtrip, live transport smoke with trace identifiers
Hooks muxes	Agent Mux And Runtime E2E, Mock And Fixture Contracts, Trace Identifiers And Evidence	Both	Normalized hook fixtures, live hook replay after redaction with session/run correlation
Pipeline integration	Pipeline Integration, Implementation Roadmap	Both	New workflow contracts and staged required checks
Coverage reporting	Coverage And Reporting	Both	Package coverage baselines plus scenario coverage summaries

Testing Strategy guide

Page nodewiki/docs/testing/index.mdSection pages · 12Documents · 0

Pages in this section

Start with the section hub, then move sideways into adjacent pages when you need more detail.

Agent Mux And Runtime E2E

Page

wiki/docs/testing/agent-mux-and-runtime-e2e.md

Coverage And Reporting

Page

Coverage reporting should make the repository-wide test story visible without turning every test into one slow monolithic gate.

wiki/docs/testing/coverage-and-reporting.md

Current Test Command Inventory

Page

wiki/docs/testing/current-test-command-inventory.md

Harness And Plugin E2E

Page

This document covers harness setup and plugin-enabled sessions. It intentionally separates two different integration types:

wiki/docs/testing/harness-e2e.md

Implementation Roadmap

Page

wiki/docs/testing/implementation-roadmap.md

Mock And Fixture Contracts

Page

wiki/docs/testing/mock-and-fixture-contracts.md

Pipeline Integration

Page

wiki/docs/testing/pipeline-integration.md

Primary Flow Data Paths

Page

wiki/docs/testing/primary-flow-data-paths.md

Quality Gates

Page

These gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.

wiki/docs/testing/quality-gates.md

Stack Permutations

Page

The test strategy must treat the stack as modular. A valid E2E does not need every layer, and some layer combinations are invalid even if the names sound related.

wiki/docs/testing/stack-permutations.md

Test Lanes

Page

The replacement strategy has two top-level lanes. Every new test must declare which lane it belongs to before it is added to CI.

wiki/docs/testing/test-lanes.md

Trace Identifiers And Evidence

Page

wiki/docs/testing/trace-identifiers-and-evidence.md

Testing Strategy

Documents

Test Lanes defines the two top-level lanes: no-model deterministic tests and model-backed tests that require real provider credentials.
Harness And Plugin E2E separates SDK harness/plugin setup from agent-mux plugin/session E2E.
Agent Mux And Runtime E2E defines runtime coverage for agent-mux, transport-mux, agent-core, and @a5c-ai/babysitter-agent flows after setup preconditions are satisfied.
Pipeline Integration defines where each lane belongs in CI, staging, release, scheduled, and manual workflows.
Coverage And Reporting defines repo-wide coverage reporting, artifacts, logs, and pass/fail evidence.
Implementation Roadmap defines rollout slices, exit criteria, and stop conditions.
Current Test Command Inventory maps existing package test-like commands to lane, scope, owner, artifact name, and pipeline placement for roadmap slice 0.
Mock And Fixture Contracts defines deterministic fixture families and live/mock compatibility rules.
Quality Gates defines release-evidence gates and adversarial review criteria.
Stack Permutations defines valid and invalid layer combinations across the modular stack.
Primary Flow Data Paths maps the full data path for the main agent-mux, babysitter-agent, SDK run, hooks-mux, and transport-mux flows.
Trace Identifiers And Evidence defines the IDs, logs, files, and artifact bundles required to correlate those flows.

Principles

Separate tests that need model credentials from tests that can run with mocks, fixtures, or local fakes.
Make setup explicit and repeatable, but do not conflate setup with runtime: SDK harness/plugin setup, agent-mux plugin/session E2E, and babysitter-agent runtime E2E are separate paths.
Test mux boundaries at multiple scopes: protocol contracts, adapter translation, transport behavior, gateway/session behavior, UI behavior, and full runtime orchestration.
Prefer package-local tests for fast feedback, then compose them into broader lanes only when the integration surface matters.
Treat live model runs as release evidence, not as the first line of feedback for every pull request.
Promote tests through explicit gates: manual, scheduled, staging preflight, then release preflight.
Require each model-backed claim to have a no-model fixture or contract counterpart unless the behavior is inherently provider-only.

Status Legend

Status	Meaning
Current	Command, workflow, or package test exists today and can be validated now.
Proposed	Contract name or workflow shape this strategy recommends for a future implementation slice; not the current source of truth unless a current workflow or package script is named.
Promotion target	A test exists or is planned in a lower lane and should move only after meeting quality gates.

Unless a document explicitly says Current, command bundles and workflow names are proposed implementation targets.

Current State

Requested Scope Traceability

Requested scope	Primary docs	Lane	First implementation surface
Codex E2E	Harness And Plugin E2E, Stack Permutations	No-model setup/session first, then capability-gated model-backed	Harness setup smoke, Codex adapter protocol fixture, plugin E2E only after capability proof; babysitter-agent runtime is separate
Claude Code E2E	Harness And Plugin E2E, Stack Permutations	No-model setup/session first, then model-backed	Harness setup smoke, agent-mux session, plugin-manager where supported, `/babysitter:call` plugin smoke, Claude hook/tool-call fixture
`harness:install` and plugin setup	Harness And Plugin E2E, Stack Permutations	Setup only	Dry-run install JSON, plugin discovery JSON, idempotency checks; no babysitter-agent runtime claim
Agent-mux functionality requiring credentials	Agent Mux And Runtime E2E, Pipeline Integration	Model-backed	Live adapter matrix for Codex and Claude Code
Babysitter-agent whole-system flow	Agent Mux And Runtime E2E, Stack Permutations	Both	Mock planner/executor first, bounded live process after staging promotion, no installer commands inside runtime E2E
Muxes and transport-mux	Agent Mux And Runtime E2E, Mock And Fixture Contracts, Primary Flow Data Paths	Both	Shared event fixtures, transport roundtrip, live transport smoke with trace identifiers
Hooks muxes	Agent Mux And Runtime E2E, Mock And Fixture Contracts, Trace Identifiers And Evidence	Both	Normalized hook fixtures, live hook replay after redaction with session/run correlation
Pipeline integration	Pipeline Integration, Implementation Roadmap	Both	New workflow contracts and staged required checks
Coverage reporting	Coverage And Reporting	Both	Package coverage baselines plus scenario coverage summaries