Documentation Index
Fetch the complete documentation index at: https://opensre.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
CloudOpsBench is a 452-scenario Kubernetes root-cause-analysis benchmark published by Wang et al (arXiv:2603.00468v1, Feb 2026). The paper uses a State Snapshot paradigm — each fault case is a frozen JSON repository served via mockedkubectl-style tool calls — so
every evaluation is bit-for-bit reproducible and needs no live cluster.
OpenSRE wraps this corpus through a small reusable benchmark framework
that adds cost tracking, integrity guards (pre-registration, per-stratum
reporting, negative results, COI disclosure), per-LLM dispatch with
version pinning, and self-contained markdown + HTML reports. The goal
is to publish the opensre+LLM column against the paper’s LLM-alone
baselines on the same scenarios.
What you need to run it
CloudOpsBench needs no live infrastructure. The frozen snapshots are the environment.Quick start
List adapters
Validate a config
runs_per_case < 3, missing
pre_registration_path, oversized grids, system-path output_dir).
Validation returns non-zero on any failure.
Dev-mode run
--dev skips the integrity gates so you can smoke-test the wiring
without writing a pre-registration file. The run ID gets a dev-
prefix so dev results can’t be silently promoted.
Production run
A production run requires:- A pre-registration YAML at
pre_registration_pathlisting per-model expected deltas, committed to git before the run starts (integrity Mechanism 1) seed:set in config (Mechanism 6)- Adapter declaration of
data_contamination_checked = True(Mechanism 7) - At least one validity metric declared by the adapter (Mechanism 3)
report.json (machine-readable),
report.md (human-readable summary), report.html (self-contained, no
external CSS/JS), and cases/*.json (per-cell artifacts).
Re-render an existing report
Config reference
Env-var overrides for CI
These let CI override knobs without editing the YAML:| Variable | Purpose |
|---|---|
OPENSRE_BENCH_WORKERS | Override workers: |
OPENSRE_BENCH_COST_BUDGET_USD | Override cost_budget_usd: |
Integrity guarantees
The framework enforces 11 honest-results mechanisms at the code level. There is no bypass short of editing the framework itself.Pre-flight (before any case runs)
IntegrityGuard.pre_flight raises IntegrityViolation if any of these
hold:
- M1 — Pre-registration:
pre_registration_pathunset, missing, or empty. Forces the engineer to commit expected deltas before seeing results. - M3 — Validity metrics: adapter declares no validity metric (no Streetlight Effect).
- M6 — Seeded selection:
seed:isNone(no cherry-picking). - M7 — Contamination check: adapter has not declared
data_contamination_checked = True.
Report-validation (before the report is emitted)
IntegrityGuard.report_validation refuses to publish a report if:
- M3 — Not every adapter-declared metric is in the report
- M4 — Per-stratum breakdown missing or contains only
all(no aggregate-only reporting) - M5 — Raw per-case artifacts directory missing
- M9 —
negative_resultsis empty - M10 —
coi_disclosureis empty - M1 — Pre-registration path not carried into the report
Two more mechanisms are operational, not code-enforced
- M8 — External replication of ≥1 cell by a third party before public claim
- M11 — Blinded LLM-as-judge calibration (BDIL Phase B; tracked separately)
Cost tracking
The framework registers a usage hook onapp/services/llm_client.py’s
LLMClient, OpenAILLMClient, and BedrockLLMClient. Every successful
LLM call feeds (model, tokens_in, tokens_out) into a CostTracker.
The tracker enforces the configured cost_budget_usd as a hard cap —
the next call that would exceed budget raises CostBudgetExceeded and
the runner halts cleanly with a partial-completion report.
Per-cell tokens_in / tokens_out / cost_usd is currently 0 (aggregate
cost is correct; per-cell delta capture is a follow-up). Total run cost
in report.json is honest.
Metrics
Paper’s 13 deterministic metrics plus 3 framework-added validity metrics:| Family | Metric | Source |
|---|---|---|
| Outcome | a1, a3, tcr, exact, in_order, any_order | Paper § 4.2.1 |
| Process — alignment | rel, cov | Paper § 4.2.2 |
| Process — efficiency | steps, mtti | Paper § 4.2.2 |
| Process — robustness | iac, rar, ztdr | Paper § 4.2.2 |
| Validity | citation_grounding_rate, entity_existence_rate, kubectl_actionability_rate | Framework (regex + universe check) |
Existing production entry points
make test-cloudopsbench and opensre tests cloudopsbench route through
tests/benchmarks/cloudopsbench/run_suite.py, which is the legacy
imperative-CLI surface. The framework runner is the new YAML-config
surface and coexists with it during the transition. Both call into the
same adapter, scoring code, and replay backend.
Reference
- Paper: Wang et al, Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems, arXiv:2603.00468v1, 28 Feb 2026 — GitHub
- HF dataset:
tracer-cloud/cloud-ops-bench-dataset - Framework source:
tests/benchmarks/_framework/ - Adapter source:
tests/benchmarks/cloudopsbench/
Tracer