Incident Response - Cowboy Protocol

This runbook is symptom-driven: start from what you observe on the health surfaces, follow the decision tree, and restart safely. It assumes a production deployment managed by systemd / an orchestrator — not the local devnet scripts (see the restart warning).

Health surfaces

Endpoint	Returns	Use
`GET /health`	`{"status":"ok"}`	Load-balancer liveness probe (no state read)
`GET /health/detailed` (alias `/health/ready`)	`status`, `version`, `uptime_seconds`, `block_height`, `last_block_timestamp`, `mempool_transaction_count`, `mempool_account_count`, per-component status (`storage` / `mempool` / `rpc`)	Operator triage
`GET /metrics`	Prometheus text	Dashboards + alerting

Key fields on /health/detailed:

block_height — latest finalized height. Compare against a healthy peer to detect lag.
last_block_timestamp — Unix time of the latest finalized block. A growing gap between this and now means the node is stalled (not finalizing).
mempool_transaction_count — the top-level status flips to "degraded" (and the mempool component to "degraded") at ≥ 30,000 pending transactions.
synced — currently hard-coded true; it is not a real sync flag. Judge sync state from block_height vs a peer and the last_block_timestamp gap, not from this field.

Decision tree

Symptom: height not advancing (stalled)

last_block_timestamp is minutes old and block_height is flat.

Check GET /health/detailed on the other validators. If they are advancing, this node is partitioned or behind — go to lagging height.
If no validator is advancing, consensus has stalled network-wide (insufficient online stake to certify). Recover quorum: bring offline validators back; the protocol resumes automatically once enough are live (view-change retries every NULLIFY_RETRY = 10 s).
Check /metrics and logs for the failing component (storage errors, panics).

Relevant compiled-in consensus timeouts (node/validator/src/main.rs, not runtime-configurable): leader timeout 1 s, certification timeout 2 s, nullify/view-change retry 10 s, peer-fetch timeout 2 s.

Symptom: lagging height (behind the tip)

block_height is well below peers but last_block_timestamp is recent-ish.

The node is catching up via fast-sync; watch block_height climb. Syncing nodes get a 10× longer activity timeout, so they are not dropped mid-catch-up.
If height is not climbing, check connectivity to bootstrappers / peers and that a peer actually serves /state/snapshot (a peer with no snapshot returns 404).

Symptom: mempool growth / degraded

status: "degraded", mempool_transaction_count near or above 30,000.

Confirm blocks are still finalizing (block_height advancing). If yes, this is load, not a fault — the backlog drains as blocks include txs.
If height is flat and mempool is growing, the node is stalled — treat as height not advancing above; the backlog is a symptom, not the cause.

Symptom: component unhealthy

/health/detailed components.storage (or rpc) reports non-healthy, or /health itself fails.

Pull logs and /metrics. Storage faults (disk full, QMDB I/O) usually need the node stopped and the disk / state directory inspected.
If state is corrupt, restore from backup or re-fast-sync — see Snapshots & Restore.

Restart safely

Do not run scripts/restart_validator.sh in production. It is a local-devnet helper that deletes state — it rm -rfs test/*.db, test/*.log, and test/storage before restarting. On a real node that wipes the validator’s database.

In production, restart through your service manager, which stops and starts the binary without touching directory:

sudo systemctl restart cowboy-validator
# verify it came back and is finalizing:
curl -s http://localhost:<rpc_port>/health/detailed | jq '.block_height, .last_block_timestamp'

A clean restart resumes from the persisted height and re-joins consensus; if the node fell behind while down, it fast-syncs the gap. Only delete directory when you intend a full resync from genesis or a peer — never as a routine restart step.

Snapshots & Restore CLI Developer Experience

​Health surfaces

​Decision tree

​Symptom: height not advancing (stalled)

​Symptom: lagging height (behind the tip)

​Symptom: mempool growth / degraded

​Symptom: component unhealthy

​Restart safely

Health surfaces

Decision tree

Symptom: height not advancing (stalled)

Symptom: lagging height (behind the tip)

Symptom: mempool growth / degraded

Symptom: component unhealthy

Restart safely