Post–week-1 instruments (M2–M5)

Landing page for CLAUDE_MEASURE.md §3 metrics M2–M5. Definitions stay in the repo root file; this page lists how to run rollups from DATA_DIR trace / agent-project data. M1 (tool diversity) is covered in docs/hermes/post_cutover_week1_2026-04-23.md and the scripts below.

M1 (related) — tool diversity & week-1 gate

M2 — Homepage content quality

Composite from homepage diffs (novel_words_ratio, section drift, link density). Planned instrument: backend/app/scripts/measure_homepage_quality.py (per CLAUDE_MEASURE — add when T3 ships). Data: agent_project_revisions/ under DATA_DIR.

M3 — Self-judgment calibration

Correlation of judge_self vs external score. Instrument: judge_self_calibration.jsonl (T4.12) — weekly Pearson when N ≥ 10 pairs.

M4 — Budget meter behavioral effect

Join agent_budget.jsonl with trace_events.jsonl by agent/timestamp; compare tool-cost distribution in low- vs high-budget windows.

M5 — Cost per approved deliverable

Planned script: backend/app/scripts/measure_cost_per_approval.py (per CLAUDE_MEASURE). Roll up weekly medians; attribute LLM/GPU/ai$ costs across job lifetimes.

Hermes retrospective / trace tooling

Live numbers are not rendered here yet — run scripts against prod DATA_DIR on the host or pull traces locally.