Agentic AI Atlas

Wiki article

docs/testing/quality-gates

Reading · 4 min

Quality Gates reference

These gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.

Page nodewiki/docs/testing/quality-gates.mdNearby pages · 11Documents · 0

Continue reading

Nearby pages in the same section.

Agent Adapter And Runtime E2E Coverage And Reporting Current Test Command Inventory Harness And Plugin E2E Implementation Roadmap Mock And Fixture Contracts Pipeline Integration Primary Flow Data Paths Stack Permutations Test Lanes Trace Identifiers And Evidence

Quality Gates

These gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.

Gate Matrix

Gate	Applies to	Required checks	Failure action
Determinism	No-model tests	No provider secrets, fixed fixtures, repeatable locally, stable timeout budget	Block PR until deterministic
Credential guard	Model-backed tests	Explicit secret detection before setup, clear skip reason, no fallback to fake success	Block staging/release if selected job cannot prove setup
Artifact redaction	All E2E tests	Secret scan over logs/artifacts, redacted paths, no raw token files	Fail job and suppress unsafe upload
Protocol compatibility	Adapter tests	Mock and live event streams satisfy the same schema/version	Open compatibility issue before promotion
Transport-adapter seam evidence	Transport-adapter tests	Route matrix, runtime env injection, proxy auth, launch proxy decision, stream transcript, metrics/cache artifact, and invalid-combination boundaries are explicit	Block transport-adapter coverage promotion until the missing seam has a direct artifact
Runtime completeness	Babysitter-agent E2E	Run creation, session binding, effect emission, task post, terminal state	Block runtime release gate
Cost and flake budget	Model-backed tests	Retry policy, duration budget, provider rate-limit classification	Keep scheduled/manual until stable
Documentation parity	All lanes	Docs name command, owner, trigger, artifacts, skip/failure semantics	Block workflow merge

Adversarial Review Checklist

Every implementation phase should answer these questions before it is accepted:

What would make this pass without testing the promised behavior?
Which secret or credential path could leak into logs?
Which mock assumption could diverge from live Codex or Claude Code behavior?
Which package boundary is only tested indirectly?
Did transport-adapter traffic actually use proxy routes and injected env, or did the harness call the provider directly?
Is this test accidentally proving plugin install, harness install, hooks, or Babysitter journal behavior with transport-adapter evidence only?
Which failure would be misclassified as provider flake instead of product regression?
Which CI trigger would run too often, too late, or not at all?
Which artifact proves the claim to a reviewer who did not watch the run?

Promotion Criteria

A test can move from manual to scheduled when it has three consecutive successful runs or one documented provider-side skip with no product failures.

A test can move from scheduled to staging preflight when:

it has stable credential gating,
it emits redacted artifacts,
transport-adapter bridge tests include launch-plan JSON, redacted proxy config/env diff, route or stream transcript, metrics/cache snapshot, and provider/harness version metadata when they claim proxy coverage,
it adds unique evidence not already covered by no-model tests,
it has an owner for failures,
it has a bounded runtime and retry policy.

A test can move from staging preflight to release preflight only when it protects a production publish risk that cannot be caught earlier.

Quarantine And Demotion

Model-backed tests are allowed to start outside required branch protection. They must be demoted or quarantined when reliability falls below release-gate quality.

Condition	Action
Two provider-infra failures in seven days	Keep scheduled, remove from required staging checks until root cause is classified
One product regression in staging preflight	Keep required and block publish until fixed or explicitly waived
Secret redaction failure	Disable artifact upload for that lane and block promotion until redaction test exists
Runtime exceeds hard timeout twice	Move to manual diagnostics until scope or timeout budget is redesigned
Mock/live schema drift	Block promotion and open a compatibility issue naming the event family

A quarantined test can return to required status after three consecutive clean scheduled runs and one clean manual rerun by the owning maintainer.

Quality Gates reference

These gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.

Page nodewiki/docs/testing/quality-gates.mdNearby pages · 11Documents · 0

Continue reading

Nearby pages in the same section.

Quality Gates

These gates define what must be true before a new test lane, workflow, or model-backed scenario is treated as release evidence.

Gate Matrix

Gate	Applies to	Required checks	Failure action
Determinism	No-model tests	No provider secrets, fixed fixtures, repeatable locally, stable timeout budget	Block PR until deterministic
Credential guard	Model-backed tests	Explicit secret detection before setup, clear skip reason, no fallback to fake success	Block staging/release if selected job cannot prove setup
Artifact redaction	All E2E tests	Secret scan over logs/artifacts, redacted paths, no raw token files	Fail job and suppress unsafe upload
Protocol compatibility	Adapter tests	Mock and live event streams satisfy the same schema/version	Open compatibility issue before promotion
Transport-adapter seam evidence	Transport-adapter tests	Route matrix, runtime env injection, proxy auth, launch proxy decision, stream transcript, metrics/cache artifact, and invalid-combination boundaries are explicit	Block transport-adapter coverage promotion until the missing seam has a direct artifact
Runtime completeness	Babysitter-agent E2E	Run creation, session binding, effect emission, task post, terminal state	Block runtime release gate
Cost and flake budget	Model-backed tests	Retry policy, duration budget, provider rate-limit classification	Keep scheduled/manual until stable
Documentation parity	All lanes	Docs name command, owner, trigger, artifacts, skip/failure semantics	Block workflow merge

Adversarial Review Checklist

Every implementation phase should answer these questions before it is accepted:

What would make this pass without testing the promised behavior?
Which secret or credential path could leak into logs?
Which mock assumption could diverge from live Codex or Claude Code behavior?
Which package boundary is only tested indirectly?
Did transport-adapter traffic actually use proxy routes and injected env, or did the harness call the provider directly?
Is this test accidentally proving plugin install, harness install, hooks, or Babysitter journal behavior with transport-adapter evidence only?
Which failure would be misclassified as provider flake instead of product regression?
Which CI trigger would run too often, too late, or not at all?
Which artifact proves the claim to a reviewer who did not watch the run?

Promotion Criteria

A test can move from manual to scheduled when it has three consecutive successful runs or one documented provider-side skip with no product failures.

A test can move from scheduled to staging preflight when:

it has stable credential gating,
it emits redacted artifacts,
transport-adapter bridge tests include launch-plan JSON, redacted proxy config/env diff, route or stream transcript, metrics/cache snapshot, and provider/harness version metadata when they claim proxy coverage,
it adds unique evidence not already covered by no-model tests,
it has an owner for failures,
it has a bounded runtime and retry policy.

A test can move from staging preflight to release preflight only when it protects a production publish risk that cannot be caught earlier.

Quarantine And Demotion

Model-backed tests are allowed to start outside required branch protection. They must be demoted or quarantined when reliability falls below release-gate quality.

Condition	Action
Two provider-infra failures in seven days	Keep scheduled, remove from required staging checks until root cause is classified
One product regression in staging preflight	Keep required and block publish until fixed or explicitly waived
Secret redaction failure	Disable artifact upload for that lane and block promotion until redaction test exists
Runtime exceeds hard timeout twice	Move to manual diagnostics until scope or timeout budget is redesigned
Mock/live schema drift	Block promotion and open a compatibility issue naming the event family

A quarantined test can return to required status after three consecutive clean scheduled runs and one clean manual rerun by the owning maintainer.