Memorandum: Routing Scenario Test Infrastructure Gap
Date: 2026-06-18Concerns:
complex_shell_prompts scenario class; oracle coverage of conversational tool-gatheringStatus: Active gap — CI gives false green on integration behaviour
Summary
The routing scenario oracle (_oracle_runtime.py) does not observe, assert on,
or control the conversational tool-gathering path (gather_tool_evidence →
run_tool_calling_loop). Every complex_shell_prompts scenario passes in CI
even when zero integrations are queried and the response is entirely hallucinated
text. The test infrastructure provides confidence that does not exist.
1. The Two Execution Paths — Only One Is Tested
When a REPL turn routes tohandle_message_with_agent, two independent paths
can fire:
| Path | What it does | Oracle coverage |
|---|---|---|
Planner → REGISTRY.dispatch | LLM proposes a terminal action (slash, investigation, shell, etc.); the oracle intercepts every call through patch_execution_boundary | Fully observed and asserted |
gather_tool_evidence → run_tool_calling_loop | A bounded ReAct loop queries registered tools (Sentry, GitHub, PostHog, etc.) to ground a conversational answer | Completely unobserved |
REGISTRY.dispatch. It does not patch gather_tool_evidence,
run_tool_calling_loop, _run_parallel, or _resolve_session_integrations.
Tool calls made during the gather pass are invisible to the test.
2. configured_integrations in Fixtures Does Not Isolate the Store
fresh_session applies the fixture’s configured_integrations list to
session.configured_integrations. This field controls the LLM system-prompt
copy and the REPL status bar. It does not control which integrations are
actually loaded for the gather loop.
_resolve_session_integrations ignores session.configured_integrations
entirely:
resolve_integrations({}) reads the developer’s live ~/.opensre/integrations.json
and environment variables. This produces three distinct, silent behaviours across
environments:
| Environment | What resolve_integrations returns | Tool-gathering outcome |
|---|---|---|
| CI (no store, no env keys) | {} — no tools available | Gathering is a no-op; text-only answer |
| Developer machine, no keys | {} | Same no-op |
| Developer machine with real integrations | Real configs (PostHog, GitHub, Sentry, …) | Real tool calls fire with the developer’s live credentials |
configured_integrations: [sentry, github, posthog]
in CI runs exactly the same code path as one that declares
configured_integrations: []. The field is decoration, not isolation.
3. Response Contracts Are Satisfiable by Hallucination Alone
Because the gather loop is a no-op in CI, the response evaluated against the contract is produced by the LLM from its training data, not from any live integration. The contracts for the twocomplex_shell_prompts scenarios are:
313 (configured_integrations: [])
configured_integrations: [sentry, github, posthog])
4. The Behaviour Proven by Current Tests
Across all 54 routing scenarios, what passes in CI is:- Routing correctness —
route_inputreturns the rightroute_kind. This is genuine and valuable. - Deterministic command dispatch — slash commands and aliases resolve correctly. This is genuine and valuable.
- Planned terminal action shape — when a planner action fires (slash, shell, investigation), the oracle records and asserts it correctly. This is genuine and valuable.
- Hallucination-satisfiable text contracts — for all conversational turns, the contract is met by the LLM generating plausible text that mentions the right words. This is not a meaningful signal.
- Whether the gather loop fires at all.
- Whether any specific tool was called.
- Whether any integration returned data.
- Whether the response is grounded in integration data vs. generated from training knowledge.
- Whether a broken integration (validation error, auth failure, timeout) prevents the response from being useful.
5. Why This Is a Large Correctness Risk
Thecomplex_shell_prompts class exists specifically to cover the integration
data-gathering surface — the tests are named and described as covering exactly
what they do not cover. This creates three concrete risks:
Risk 1: Broken tool extraction goes undetected.If
_posthog_mcp_extract_params starts returning bad config fields (as
happened: live PostHog calls received posthog_mode="mcp" from the LLM), the
scenario passes. The broken extraction is only discovered when a user exercises
the feature interactively.
Risk 2: Integration registration silently drops.If a tool’s
is_available check starts returning False for all sessions, or
if the tool is accidentally deregistered, every complex_shell_prompts scenario
still passes. A regression that stops the agent from ever querying GitHub or
PostHog cannot be caught by the current test suite.
Risk 3: The no-mocks policy blocks the obvious fix.AGENTS.md and test_routing_fixture_integrity.py enforce a hard no-mocks
rule on the routing oracle:
“Do not useThe intent of this rule is correct — it prevents tests from faking the LLM and making routing assertions against synthetic planner output. But it accidentally also blocks injecting a controlled integration config into the gather loop, which does not involve the LLM at all. The rule currently prevents the fix.unittest.mock,patch,MagicMock, or equivalent mocking primitives in routing tests.”
6. Root Cause: Architectural Seam Is Missing
The docstring inscenario_loader.py acknowledges the gap explicitly:
tests/cli/interactive_shell/runtime/ test_answer_with_tools.py patches both gather_tool_evidence and
answer_cli_agent entirely, so it tests the wiring between them (gather output
flows to answer), not whether the gather loop calls the right tools with the
right config. The gap noted in the docstring has never been closed.
7. Proposed Remediation
Three changes are required, in dependency order.7.1 — Add a stable test seam for integration injection
Addresolved_integrations_override support to fresh_session in the oracle.
When set, _resolve_session_integrations returns the override instead of
hitting the real store. This does not mock the LLM, does not mock any tool, and
does not violate the spirit of the no-mocks rule — it controls the integration
config the tool is called with, which is fixture input, not LLM output.
run_oracle_once reads this from case.scenario.session.resolved_integrations
when present, and uses {} (no-op gather) otherwise. CI remains fast because
no fixture currently sets this field.
7.2 — Track gather-loop tool calls in the oracle
Wraprun_tool_calling_loop (or _run_parallel) with a thin recorder inside
run_oracle_once so tool calls made during gathering are captured alongside
planned terminal actions. This does not mock the tools themselves; it records
which ones fired.
gathered_tool_calls: list[str] and the OracleRunResult
exposes this for contract assertions.
7.3 — Add gathered_tools_contract to the scenario schema
Extend the YAML schema with an optional section that the scenario loader
validates and the oracle asserts:
314-windows-crash-multisource-query.yml:
is_available check or bad extract_params now fails the
test immediately.
7.4 — Update the no-mocks rule scope
Amend the “no mocks” policy inAGENTS.md and test_routing_fixture_integrity.py
to distinguish between two separate things:
- Mocking the LLM — prohibited. Routing oracle must exercise the real LLM.
- Injecting fixture integration configs — permitted. This is equivalent to providing test credentials and does not involve the LLM.
monkeypatch.setattr on
tool_gathering._resolve_session_integrations and
tool_loop.run_tool_calling_loop while continuing to prohibit patch,
MagicMock, and LLM client stubs.
7.5 — Rename or reclassify misleading existing scenarios
313 (configured_integrations: []) does not test complex shell prompts
with live data — it tests the text-only fallback when no integration is
configured. It should either be moved to docs_no_execute or its notes updated
to reflect that it covers the no-integration fallback path only, not the
data-gathering path.
8. Migration Path and Priority
| Step | Effort | Risk | Priority |
|---|---|---|---|
7.1 — resolved_integrations_override seam | Small (20 lines) | Low | P0 — unblocks everything else |
| 7.2 — gather-loop tool call recorder | Small (30 lines) | Low | P0 — required for assertions |
7.3 — gathered_tools_contract schema + assertions | Medium (100 lines) | Low | P1 — makes contracts meaningful |
| 7.4 — Update no-mocks rule scope | Trivial | None | P1 — prevents the fix from being reverted |
| 7.5 — Reclassify 313 | Trivial | None | P2 — clarity, not correctness |
Write new complex_shell_prompts scenarios with fixture configs | Medium | Low | P1 — actual test coverage |
9. What Does Not Change
- The no-mocks policy on the LLM path. The planner, classifier, and conversational assistant all continue to hit the real LLM in routing tests.
- The routing oracle structure (
route_input→execute_routed_turn). - Any existing passing scenario. The
resolved_integrations_overrideis opt-in; existing scenarios without it keep the current no-op gather behaviour and continue to pass. - CI runtime budget. Fixture-injected integration configs do not make live
network calls (tools check
is_availableagainst the resolved dict, not a live endpoint), so test time stays flat.
10. Acceptance Criteria for “Fixed”
- A scenario in
complex_shell_promptswithresolved_integrationsinjected andgathered_tools_contractdefined fails when the named tools do not fire. - The same scenario passes when the tools fire and return data.
- Introducing a bug in
_posthog_mcp_extract_params(e.g. passingmode="mcp") causes the affected scenario to fail in CI. - A tool whose
is_availableis patched to always returnFalsecauses the scenario to fail if it is ingathered_tools_contract.must_call_any. - No existing scenario changes its pass/fail status.
- CI runtime increases by less than 10 seconds per shard.
Tracer