Memorandum: Turn Scenario Test Infrastructure Gap
Date: 2026-06-18Concerns:
complex_shell_prompts scenario class; oracle coverage of conversational tool-gatheringStatus: Partially addressed (2026-06-26) — gather recording,
tool_actions, fixture resolved_integrations, and @live fail-closed CI are in place; many handoff scenarios still rely on text-only contracts
Update (2026-06-26): Natural-language investigation dispatch is re-enabled (INTERACTIVE_SHELL_INVESTIGATION_ENABLED = True). Scenarios 314, 338, 339, and 315 assert gather dispatch viatool_actionswith fixture integrations; 333–335 and 337 use@livefor canonical per-integration gather. Handoff-only 313 lives underchat_handoff/. Remaining gap: scenarios withouttool_actionsgather entries still pass on hallucination-satisfiable text contracts only.
Update (2026-06-19): The scenario schema has since been trimmed and the oracle’s capability defaults realigned with production.available_capabilitiesis now a three-state knob (omit = enabled/production default,[]= disabled, non-empty = allowlist) instead of disabling slash/cli/synthetic by default, and the deadrisk_level/tier/remote_connected/surfacefields were removed. See the “Scenario schema andavailable_capabilitiessemantics” section ofinteractive_shell/harness/AGENTS.mdfor the canonical contract.
Summary
The turn scenario oracle (_oracle_runtime.py) does not observe, assert on,
or control the conversational tool-gathering path (gather_tool_evidence →
run_tool_calling_loop). Every complex_shell_prompts scenario passes in CI
even when zero integrations are queried and the response is entirely hallucinated
text. The test infrastructure provides confidence that does not exist.
1. The Two Execution Paths — Only One Is Tested
When a REPL turn entershandle_message_with_agent, two independent paths
can fire:
| Path | What it does | Oracle coverage |
|---|---|---|
| Action agent → AgentTool execution | LLM proposes shell action tool calls (slash, investigation, shell, etc.); the oracle observes the terminal side effects recorded by the action tools | Fully observed and asserted |
gather_tool_evidence → shared runtime loop | A bounded ReAct loop queries registered tools (Sentry, GitHub, PostHog, etc.) to ground a conversational answer | Completely unobserved |
gather_tool_evidence, the shared tool-gathering harness, or
_resolve_session_integrations.
Tool calls made during the gather pass are invisible to the test.
2. configured_integrations in Fixtures Does Not Isolate the Store
fresh_session applies the fixture’s configured_integrations list to
session.configured_integrations. This field controls the LLM system-prompt
copy and the REPL status bar. It does not control which integrations are
actually loaded for the gather loop.
_resolve_session_integrations ignores session.configured_integrations
entirely:
resolve_integrations({}) reads the developer’s live ~/.opensre/integrations.json
and environment variables. This produces three distinct, silent behaviours across
environments:
| Environment | What resolve_integrations returns | Tool-gathering outcome |
|---|---|---|
| CI (no store, no env keys) | {} — no tools available | Gathering is a no-op; text-only answer |
| Developer machine, no keys | {} | Same no-op |
| Developer machine with real integrations | Real configs (PostHog, GitHub, Sentry, …) | Real tool calls fire with the developer’s live credentials |
configured_integrations: [sentry, github, posthog]
in CI runs exactly the same code path as one that declares
configured_integrations: []. The field is decoration, not isolation.
3. Response Contracts Are Satisfiable by Hallucination Alone
Because the gather loop is a no-op in CI, the response evaluated against the contract is produced by the LLM from its training data, not from any live integration. The contracts for the twocomplex_shell_prompts scenarios are:
313 (configured_integrations: [])
configured_integrations: [sentry, github, posthog])
4. The Behaviour Proven by Current Tests
Across all 54 turn scenarios, what passes in CI is:- Turn-entry correctness — every turn is handed to the agent entrypoint. This is intentionally static; the valuable behavior is downstream dispatch and planning.
- Deterministic command-text detection — slash commands and aliases resolve correctly for UI policy decisions. This is genuine and valuable.
- Planned terminal action shape — when a planner action fires (slash, shell, investigation), the oracle records and asserts it correctly. This is genuine and valuable.
- Hallucination-satisfiable text contracts — for all conversational turns, the contract is met by the LLM generating plausible text that mentions the right words. This is not a meaningful signal.
- Whether the gather loop fires at all.
- Whether any specific tool was called.
- Whether any integration returned data.
- Whether the response is grounded in integration data vs. generated from training knowledge.
- Whether a broken integration (validation error, auth failure, timeout) prevents the response from being useful.
5. Why This Is a Large Correctness Risk
Thecomplex_shell_prompts class exists specifically to cover the integration
data-gathering surface — the tests are named and described as covering exactly
what they do not cover. This creates three concrete risks:
Risk 1: Broken tool extraction goes undetected.If
_posthog_mcp_extract_params starts returning bad config fields (as
happened: live PostHog calls received posthog_mode="mcp" from the LLM), the
scenario passes. The broken extraction is only discovered when a user exercises
the feature interactively.
Risk 2: Integration registration silently drops.If a tool’s
is_available check starts returning False for all sessions, or
if the tool is accidentally deregistered, every complex_shell_prompts scenario
still passes. A regression that stops the agent from ever querying GitHub or
PostHog cannot be caught by the current test suite.
Risk 3: The no-mocks policy blocks the obvious fix.AGENTS.md and test_turn_fixture_integrity.py enforce a hard no-mocks
rule on the turn oracle:
“Do not useThe intent of this rule is correct — it prevents tests from faking the LLM and making action-planning assertions against synthetic planner output. But it accidentally also blocks injecting a controlled integration config into the gather loop, which does not involve the LLM at all. The rule currently prevents the fix.unittest.mock,patch,MagicMock, or equivalent mocking primitives in turn tests.”
6. Root Cause: Architectural Seam Is Missing
The docstring inscenario_loader.py acknowledges the gap explicitly:
tests/interactive_shell/runtime/ test_answer_with_tools.py patches both gather_tool_evidence and
answer_cli_agent entirely, so it tests the wiring between them (gather output
flows to answer), not whether the gather loop calls the right tools with the
right config. The gap noted in the docstring has never been closed.
7. Proposed Remediation
Three changes are required, in dependency order.7.1 — Add a stable test seam for integration injection
Addresolved_integrations_override support to fresh_session in the oracle.
When set, _resolve_session_integrations returns the override instead of
hitting the real store. This does not mock the LLM, does not mock any tool, and
does not violate the spirit of the no-mocks rule — it controls the integration
config the tool is called with, which is fixture input, not LLM output.
run_oracle_once reads this from case.scenario.session.resolved_integrations
when present, and uses {} (no-op gather) otherwise. CI remains fast because
no fixture currently sets this field.
7.2 — Track gather-loop tool calls in the oracle
Wraprun_tool_calling_loop (or _run_parallel) with a thin recorder inside
run_oracle_once so tool calls made during gathering are captured alongside
planned terminal actions. This does not mock the tools themselves; it records
which ones fired.
gathered_tool_calls: list[str] and the OracleRunResult
exposes this for contract assertions.
7.3 — Add tool_actions gather entries to the scenario schema
Fixtures now use a unified tool_actions list with surface: gather and
expect modes instead of a separate gathered_tools_contract block.
Example:
314-windows-crash-multisource-query.yml:
is_available check or bad extract_params now fails the
test immediately.
7.4 — Update the no-mocks rule scope
Amend the “no mocks” policy inAGENTS.md and test_turn_fixture_integrity.py
to distinguish between two separate things:
- Mocking the LLM — prohibited. Turn oracle must exercise the real LLM.
- Injecting fixture integration configs — permitted. This is equivalent to providing test credentials and does not involve the LLM.
monkeypatch.setattr on
tool_gathering._resolve_session_integrations and
tool_loop.run_tool_calling_loop while continuing to prohibit patch,
MagicMock, and LLM client stubs.
7.5 — Rename or reclassify misleading existing scenarios
313 (configured_integrations: []) is now under chat_handoff/ with
tool_actions gather not_called assertions. It covers the no-integration
handoff path, not live data gathering. 338 and 339 assert gather
call_any with fixture resolved_integrations.
8. Migration Path and Priority
| Step | Effort | Risk | Priority |
|---|---|---|---|
7.1 — resolved_integrations_override seam | Small (20 lines) | Low | P0 — unblocks everything else |
| 7.2 — gather-loop tool call recorder | Small (30 lines) | Low | P0 — required for assertions |
7.3 — gathered_tools_contract schema + assertions | Medium (100 lines) | Low | P1 — makes contracts meaningful |
| 7.4 — Update no-mocks rule scope | Trivial | None | P1 — prevents the fix from being reverted |
| 7.5 — Reclassify 313 | Trivial | None | P2 — clarity, not correctness |
Write new complex_shell_prompts scenarios with fixture configs | Medium | Low | P1 — actual test coverage |
9. What Does Not Change
- The no-mocks policy on the LLM path. The planner, classifier, and conversational assistant all continue to hit the real LLM in turn tests.
- The turn-execution oracle still invokes
handle_message_with_agentdirectly. - Any existing passing scenario. The
resolved_integrations_overrideis opt-in; existing scenarios without it keep the current no-op gather behaviour and continue to pass. - CI runtime budget. Fixture-injected integration configs do not make live
network calls (tools check
is_availableagainst the resolved dict, not a live endpoint), so test time stays flat.
10. Acceptance Criteria for “Fixed”
- A scenario in
complex_shell_promptswithresolved_integrationsinjected andgathered_tools_contractdefined fails when the named tools do not fire. - The same scenario passes when the tools fire and return data.
- Introducing a bug in
_posthog_mcp_extract_params(e.g. passingmode="mcp") causes the affected scenario to fail in CI. - A tool whose
is_availableis patched to always returnFalsecauses the scenario to fail if it is ingathered_tools_contract.must_call_any. - No existing scenario changes its pass/fail status.
- CI runtime increases by less than 10 seconds per shard.
Tracer