CloudOpsBench benchmark - OpenSRE Documentation

Overview

CloudOpsBench is a 452-scenario Kubernetes root-cause-analysis benchmark published by Wang et al (arXiv:2603.00468v1, Feb 2026). The paper uses a State Snapshot paradigm — each fault case is a frozen JSON repository served via mocked kubectl-style tool calls — so every evaluation is bit-for-bit reproducible and needs no live cluster. OpenSRE wraps this corpus through a small reusable benchmark framework that adds cost tracking, integrity guards (pre-registration, per-stratum reporting, negative results, COI disclosure), per-LLM dispatch with version pinning, and self-contained markdown + HTML reports. The goal is to publish the opensre+LLM column against the paper’s LLM-alone baselines on the same scenarios.

Paper baseline                 opensre+LLM (this benchmark)
─────────────                  ──────────────────────────
DeepSeek-V3.2  0.73   A@1  →   target 0.78+
GPT-5          0.67   A@1  →   target 0.78+
GPT-4o         0.49   A@1  →   target 0.65+
Claude-4-Sonnet 0.50  A@1  →   target 0.65+

What you need to run it

CloudOpsBench needs no live infrastructure. The frozen snapshots are the environment.

# 1. Python 3.12+ (project standard)

# 2. Benchmark-dedicated LLM API keys — keep separate from production opensre keys
export ANTHROPIC_API_KEY=...        # Claude-4-Sonnet via Anthropic direct
export OPENAI_API_KEY=...           # GPT-5, GPT-4o
export DEEPSEEK_API_KEY=...         # DeepSeek-V3.2

# 3. Pull the corpus (one-time, a few hundred MB)
make download-cloudopsbench-hf

You do not need: AWS credentials, an EKS cluster, kind/minikube, Bedrock, GPU, Grafana, Datadog, or Prometheus.

Quick start

List adapters

uv run python -m tests.benchmarks._framework.cli list

Validate a config

uv run python -m tests.benchmarks._framework.cli validate \
    tests/benchmarks/configs/cloudopsbench_smoke.yml

The config lint catches anti-patterns (runs_per_case < 3, missing pre_registration_path, oversized grids, system-path output_dir). Validation returns non-zero on any failure.

Dev-mode run

--dev skips the integrity gates so you can smoke-test the wiring without writing a pre-registration file. The run ID gets a dev- prefix so dev results can’t be silently promoted.

uv run python -m tests.benchmarks._framework.cli run \
    tests/benchmarks/configs/cloudopsbench_smoke.yml --dev

Production run

A production run requires:

A pre-registration YAML at pre_registration_path listing per-model expected deltas, committed to git before the run starts (integrity Mechanism 1)
seed: set in config (Mechanism 6)
Adapter declaration of data_contamination_checked = True (Mechanism 7)
At least one validity metric declared by the adapter (Mechanism 3)

uv run python -m tests.benchmarks._framework.cli run \
    tests/benchmarks/configs/cloudopsbench_v1.yml

On completion, the run directory contains report.json (machine-readable), report.md (human-readable summary), report.html (self-contained, no external CSS/JS), and cases/*.json (per-cell artifacts).

Re-render an existing report

uv run python -m tests.benchmarks._framework.cli report \
    .bench-results/example/<run-dir>/

Config reference

benchmark: cloudopsbench

modes:
  - opensre+llm          # opensre wrapping the LLM
  # - llm_alone          # paper provides LLM-alone numbers; rerun only if not trusting them

llms:
  - claude-4-sonnet
  - deepseek-v3.2
  - gpt-5
  - gpt-4o

model_versions:          # pinned to exact provider snapshots
  claude-4-sonnet: claude-sonnet-4-5-20250929
  deepseek-v3.2:   deepseek-chat-v3.2
  gpt-5:           gpt-5-2025-08-07
  gpt-4o:          gpt-4o-2024-11-20

runs_per_case: 3         # replication for variance estimate (Box-Hunter-Hunter Ch 3.4)
workers: 4               # serial across LLMs, parallel within
cost_budget_usd: 1000    # hard cap; run aborts cleanly when exceeded
seed: 42                 # required for reproducible case selection (M6)

filters:                 # optional case subsetting
  systems: [boutique]
  difficulty: [hard, medium]

output_dir: .bench-results/cloudopsbench-v1/
report_formats: [json, markdown, html]
pre_registration_path: tests/benchmarks/configs/preregistrations/v1.yml

Env-var overrides for CI

These let CI override knobs without editing the YAML:

Variable	Purpose
`OPENSRE_BENCH_WORKERS`	Override `workers:`
`OPENSRE_BENCH_COST_BUDGET_USD`	Override `cost_budget_usd:`

Integrity guarantees

The framework enforces 11 honest-results mechanisms at the code level. There is no bypass short of editing the framework itself.

Pre-flight (before any case runs)

IntegrityGuard.pre_flight raises IntegrityViolation if any of these hold:

M1 — Pre-registration: pre_registration_path unset, missing, or empty. Forces the engineer to commit expected deltas before seeing results.
M3 — Validity metrics: adapter declares no validity metric (no Streetlight Effect).
M6 — Seeded selection: seed: is None (no cherry-picking).
M7 — Contamination check: adapter has not declared data_contamination_checked = True.

All violations surface in a single exception so the engineer fixes everything in one pass, not one-fix-rerun-discover-next.

Report-validation (before the report is emitted)

IntegrityGuard.report_validation refuses to publish a report if:

M3 — Not every adapter-declared metric is in the report
M4 — Per-stratum breakdown missing or contains only all (no aggregate-only reporting)
M5 — Raw per-case artifacts directory missing
M9 — negative_results is empty
M10 — coi_disclosure is empty
M1 — Pre-registration path not carried into the report

Two more mechanisms are operational, not code-enforced

M8 — External replication of ≥1 cell by a third party before public claim
M11 — Blinded LLM-as-judge calibration (BDIL Phase B; tracked separately)

Cost tracking

The framework registers a usage hook on core/llm/llm_client.py’s LLMClient, OpenAILLMClient, and BedrockLLMClient. Every successful LLM call feeds (model, tokens_in, tokens_out) into a CostTracker. The tracker enforces the configured cost_budget_usd as a hard cap — the next call that would exceed budget raises CostBudgetExceeded and the runner halts cleanly with a partial-completion report. Per-cell tokens_in / tokens_out / cost_usd is currently 0 (aggregate cost is correct; per-cell delta capture is a follow-up). Total run cost in report.json is honest.

Metrics

Paper’s 13 deterministic metrics plus 3 framework-added validity metrics:

Family	Metric	Source
Outcome	`a1, a3, tcr, exact, in_order, any_order`	Paper § 4.2.1
Process — alignment	`rel, cov`	Paper § 4.2.2
Process — efficiency	`steps, mtti`	Paper § 4.2.2
Process — robustness	`iac, rar, ztdr`	Paper § 4.2.2
Validity	`citation_grounding_rate, entity_existence_rate, kubectl_actionability_rate`	Framework (regex + universe check)

All 16 metrics are deterministic (string / set comparison) — no LLM-as-judge at evaluation time.

Existing production entry points

make test-cloudopsbench and opensre tests cloudopsbench route through tests/benchmarks/cloudopsbench/run_suite.py, which is the legacy imperative-CLI surface. The framework runner is the new YAML-config surface and coexists with it during the transition. Both call into the same adapter, scoring code, and replay backend.

Reference

Paper: Wang et al, Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems, arXiv:2603.00468v1, 28 Feb 2026 — GitHub
HF dataset: tracer-cloud/cloud-ops-bench-dataset
Framework source: tests/benchmarks/_framework/
Adapter source: tests/benchmarks/cloudopsbench/

​Overview

​What you need to run it

​Quick start

​List adapters

​Validate a config

​Dev-mode run

​Production run

​Re-render an existing report

​Config reference

​Env-var overrides for CI

​Integrity guarantees

​Pre-flight (before any case runs)

​Report-validation (before the report is emitted)

​Two more mechanisms are operational, not code-enforced

​Cost tracking

​Metrics

​Existing production entry points

​Reference