Memorandum: Routing Scenario Test Infrastructure Gap

Date: 2026-06-18
Concerns: complex_shell_prompts scenario class; oracle coverage of conversational tool-gathering
Status: Active gap — CI gives false green on integration behaviour

Summary

The routing scenario oracle (_oracle_runtime.py) does not observe, assert on, or control the conversational tool-gathering path (gather_tool_evidence → run_tool_calling_loop). Every complex_shell_prompts scenario passes in CI even when zero integrations are queried and the response is entirely hallucinated text. The test infrastructure provides confidence that does not exist.

1. The Two Execution Paths — Only One Is Tested

When a REPL turn routes to handle_message_with_agent, two independent paths can fire:

Path	What it does	Oracle coverage
Planner → `REGISTRY.dispatch`	LLM proposes a terminal action (slash, investigation, shell, etc.); the oracle intercepts every call through `patch_execution_boundary`	Fully observed and asserted
`gather_tool_evidence` → `run_tool_calling_loop`	A bounded ReAct loop queries registered tools (Sentry, GitHub, PostHog, etc.) to ground a conversational answer	Completely unobserved

The oracle patches REGISTRY.dispatch. It does not patch gather_tool_evidence, run_tool_calling_loop, _run_parallel, or _resolve_session_integrations. Tool calls made during the gather pass are invisible to the test.

2. `configured_integrations` in Fixtures Does Not Isolate the Store

fresh_session applies the fixture’s configured_integrations list to session.configured_integrations. This field controls the LLM system-prompt copy and the REPL status bar. It does not control which integrations are actually loaded for the gather loop. _resolve_session_integrations ignores session.configured_integrations entirely:

# app/cli/interactive_shell/chat/tool_gathering.py
def _resolve_session_integrations(session: ReplSession) -> dict[str, Any]:
    if session.resolved_integrations_cache is not None:
        return session.resolved_integrations_cache
    resolved = resolve_integrations({})  # hits the real env and ~/.opensre/integrations.json
    session.resolved_integrations_cache = resolved
    return resolved

resolve_integrations({}) reads the developer’s live ~/.opensre/integrations.json and environment variables. This produces three distinct, silent behaviours across environments:

Environment	What `resolve_integrations` returns	Tool-gathering outcome
CI (no store, no env keys)	`{}` — no tools available	Gathering is a no-op; text-only answer
Developer machine, no keys	`{}`	Same no-op
Developer machine with real integrations	Real configs (PostHog, GitHub, Sentry, …)	Real tool calls fire with the developer’s live credentials

A scenario that declares configured_integrations: [sentry, github, posthog] in CI runs exactly the same code path as one that declares configured_integrations: []. The field is decoration, not isolation.

3. Response Contracts Are Satisfiable by Hallucination Alone

Because the gather loop is a no-op in CI, the response evaluated against the contract is produced by the LLM from its training data, not from any live integration. The contracts for the two complex_shell_prompts scenarios are: 313 (configured_integrations: [])

must_contain_any: [GitHub, issues, Windows, crash]

Any response that mentions “GitHub” passes. The LLM always mentions GitHub when asked about GitHub issues. 314 (configured_integrations: [sentry, github, posthog])

must_contain_all: [Sentry, GitHub, PostHog]
must_contain_any: [Windows, crash]

Any response that mentions all three names passes — including a response that says “I cannot access Sentry, GitHub, or PostHog right now.” The scenario explicitly notes the agent “must commit to checking the connected sources,” but the contract cannot verify this because it cannot observe whether any source was actually checked.

4. The Behaviour Proven by Current Tests

Across all 54 routing scenarios, what passes in CI is:

Routing correctness — route_input returns the right route_kind. This is genuine and valuable.
Deterministic command dispatch — slash commands and aliases resolve correctly. This is genuine and valuable.
Planned terminal action shape — when a planner action fires (slash, shell, investigation), the oracle records and asserts it correctly. This is genuine and valuable.
Hallucination-satisfiable text contracts — for all conversational turns, the contract is met by the LLM generating plausible text that mentions the right words. This is not a meaningful signal.

What is not proven:

Whether the gather loop fires at all.
Whether any specific tool was called.
Whether any integration returned data.
Whether the response is grounded in integration data vs. generated from training knowledge.
Whether a broken integration (validation error, auth failure, timeout) prevents the response from being useful.

5. Why This Is a Large Correctness Risk

The complex_shell_prompts class exists specifically to cover the integration data-gathering surface — the tests are named and described as covering exactly what they do not cover. This creates three concrete risks: Risk 1: Broken tool extraction goes undetected.
If _posthog_mcp_extract_params starts returning bad config fields (as happened: live PostHog calls received posthog_mode="mcp" from the LLM), the scenario passes. The broken extraction is only discovered when a user exercises the feature interactively. Risk 2: Integration registration silently drops.
If a tool’s is_available check starts returning False for all sessions, or if the tool is accidentally deregistered, every complex_shell_prompts scenario still passes. A regression that stops the agent from ever querying GitHub or PostHog cannot be caught by the current test suite. Risk 3: The no-mocks policy blocks the obvious fix.
AGENTS.md and test_routing_fixture_integrity.py enforce a hard no-mocks rule on the routing oracle:

“Do not use unittest.mock, patch, MagicMock, or equivalent mocking primitives in routing tests.”

The intent of this rule is correct — it prevents tests from faking the LLM and making routing assertions against synthetic planner output. But it accidentally also blocks injecting a controlled integration config into the gather loop, which does not involve the LLM at all. The rule currently prevents the fix.

6. Root Cause: Architectural Seam Is Missing

The docstring in scenario_loader.py acknowledges the gap explicitly:

# Answer docstring, path 2:
# "Deeper 'did it actually query the integration?' assertions belong in
# execution-layer tests, not these routing fixtures."

That execution-layer test does not exist. tests/cli/interactive_shell/runtime/ test_answer_with_tools.py patches both gather_tool_evidence and answer_cli_agent entirely, so it tests the wiring between them (gather output flows to answer), not whether the gather loop calls the right tools with the right config. The gap noted in the docstring has never been closed.

7. Proposed Remediation

Three changes are required, in dependency order.

7.1 — Add a stable test seam for integration injection

Add resolved_integrations_override support to fresh_session in the oracle. When set, _resolve_session_integrations returns the override instead of hitting the real store. This does not mock the LLM, does not mock any tool, and does not violate the spirit of the no-mocks rule — it controls the integration config the tool is called with, which is fixture input, not LLM output.

# _oracle_runtime.py
def fresh_session(
    *,
    with_prior_state: bool,
    configured_integrations: tuple[str, ...] = (),
    available_capabilities: dict[str, tuple[str, ...]] | None = None,
    resolved_integrations_override: dict[str, Any] | None = None,
) -> ReplSession:
    session = ReplSession()
    ...
    if resolved_integrations_override is not None:
        session.resolved_integrations_cache = resolved_integrations_override
    return session

run_oracle_once reads this from case.scenario.session.resolved_integrations when present, and uses {} (no-op gather) otherwise. CI remains fast because no fixture currently sets this field.

7.2 — Track gather-loop tool calls in the oracle

Wrap run_tool_calling_loop (or _run_parallel) with a thin recorder inside run_oracle_once so tool calls made during gathering are captured alongside planned terminal actions. This does not mock the tools themselves; it records which ones fired.

# _oracle_runtime.py
gathered_calls: list[str] = []

original_loop = tool_loop.run_tool_calling_loop

def _recording_loop(*args, **kwargs):
    result = original_loop(*args, **kwargs)
    for tc, _ in result.executed:
        gathered_calls.append(tc.name)
    return result

monkeypatch.setattr(tool_loop, "run_tool_calling_loop", _recording_loop)

The oracle result gains gathered_tool_calls: list[str] and the OracleRunResult exposes this for contract assertions.

7.3 — Add `gathered_tools_contract` to the scenario schema

Extend the YAML schema with an optional section that the scenario loader validates and the oracle asserts:

gathered_tools_contract:
  must_call_any:         # at least one of these tool names must appear
  - list_github_issues
  - search_github_issues
  must_not_call:         # none of these must appear
  - run_investigation
  - execute_shell_command

Updated 314-windows-crash-multisource-query.yml:

session:
  configured_integrations: [sentry, github, posthog_mcp]
  resolved_integrations:   # injected into session cache; tool calls run for real
    sentry:
      connection_verified: true
      auth_token: "test-token"
      ...
    github:
      connection_verified: true
      ...
    posthog_mcp:
      connection_verified: true
      mode: streamable-http
      ...

gathered_tools_contract:
  must_call_any:
  - search_sentry_issues
  - list_sentry_issues
  - search_github_issues
  - list_github_issues
  - list_posthog_tools

With the override in the session cache, the tools run with the fixture config (no live credentials needed). With the gather recorder active, the contract is asserted. A broken is_available check or bad extract_params now fails the test immediately.

7.4 — Update the no-mocks rule scope

Amend the “no mocks” policy in AGENTS.md and test_routing_fixture_integrity.py to distinguish between two separate things:

Mocking the LLM — prohibited. Routing oracle must exercise the real LLM.
Injecting fixture integration configs — permitted. This is equivalent to providing test credentials and does not involve the LLM.

Add an AST check that specifically permits monkeypatch.setattr on tool_gathering._resolve_session_integrations and tool_loop.run_tool_calling_loop while continuing to prohibit patch, MagicMock, and LLM client stubs.

7.5 — Rename or reclassify misleading existing scenarios

313 (configured_integrations: []) does not test complex shell prompts with live data — it tests the text-only fallback when no integration is configured. It should either be moved to docs_no_execute or its notes updated to reflect that it covers the no-integration fallback path only, not the data-gathering path.

8. Migration Path and Priority

Step	Effort	Risk	Priority
7.1 — `resolved_integrations_override` seam	Small (20 lines)	Low	P0 — unblocks everything else
7.2 — gather-loop tool call recorder	Small (30 lines)	Low	P0 — required for assertions
7.3 — `gathered_tools_contract` schema + assertions	Medium (100 lines)	Low	P1 — makes contracts meaningful
7.4 — Update no-mocks rule scope	Trivial	None	P1 — prevents the fix from being reverted
7.5 — Reclassify 313	Trivial	None	P2 — clarity, not correctness
Write new `complex_shell_prompts` scenarios with fixture configs	Medium	Low	P1 — actual test coverage

Items 7.1 and 7.2 can land in one PR. Items 7.3 and 7.4 land together. New scenario fixtures follow.

9. What Does Not Change

The no-mocks policy on the LLM path. The planner, classifier, and conversational assistant all continue to hit the real LLM in routing tests.
The routing oracle structure (route_input → execute_routed_turn).
Any existing passing scenario. The resolved_integrations_override is opt-in; existing scenarios without it keep the current no-op gather behaviour and continue to pass.
CI runtime budget. Fixture-injected integration configs do not make live network calls (tools check is_available against the resolved dict, not a live endpoint), so test time stays flat.

10. Acceptance Criteria for “Fixed”

A scenario in complex_shell_prompts with resolved_integrations injected and gathered_tools_contract defined fails when the named tools do not fire.
The same scenario passes when the tools fire and return data.
Introducing a bug in _posthog_mcp_extract_params (e.g. passing mode="mcp") causes the affected scenario to fail in CI.
A tool whose is_available is patched to always return False causes the scenario to fail if it is in gathered_tools_contract.must_call_any.
No existing scenario changes its pass/fail status.
CI runtime increases by less than 10 seconds per shard.

​Memorandum: Routing Scenario Test Infrastructure Gap

​Summary

​1. The Two Execution Paths — Only One Is Tested

​2. configured_integrations in Fixtures Does Not Isolate the Store

​3. Response Contracts Are Satisfiable by Hallucination Alone

​4. The Behaviour Proven by Current Tests

​5. Why This Is a Large Correctness Risk

​6. Root Cause: Architectural Seam Is Missing

​7. Proposed Remediation

​7.1 — Add a stable test seam for integration injection

​7.2 — Track gather-loop tool calls in the oracle

​7.3 — Add gathered_tools_contract to the scenario schema

​7.4 — Update the no-mocks rule scope

​7.5 — Rename or reclassify misleading existing scenarios

​8. Migration Path and Priority

​9. What Does Not Change

​10. Acceptance Criteria for “Fixed”