05 – Operations and monitoring [SPEC]

Metrics, alerting, Styx health, scaling, HA, and disaster recovery

Reader orientation: This document specifies operational concerns for Bardo Compute: high availability, monitoring, alerting, and disaster recovery. It belongs to the Compute layer of Bardo (the Rust runtime for Golems, mortal autonomous DeFi agents). The key concept before diving in: Bardo Compute has asymmetric failure costs where a zombie VM leaking past TTL expiry costs real money every minute, so every HA mechanism is designed to fail-safe by destroying machines rather than silently accumulating cost. Terms like Golem, Styx, x402, and Thanatopsis are defined inline on first use; a full glossary lives in 00-overview.md § Terminology.

Why HA is non-negotiable

Bardo Compute has asymmetric failure costs. A leaked zombie VM costs real money every minute it runs; a missed TTL expiry is not “eventually consistent” – it is a billing leak. Every HA mechanism below is designed with zombie prevention as the top priority. The system must fail-safe: if anything goes wrong, machines get destroyed rather than silently accumulating cost.

Multi-region, multi-instance deployment

Fly app	Instances	Regions	Scaling
`bardo-control`	Min 2	ams, ord	Autoscale to 10 (CPU-based)
`bardo-machines`	N/A	Per user request	Per deployment

All control plane logic runs in at least 2 regions. If one region fails entirely, the other continues serving.

Zero-downtime rolling deploys

# bardo-control (main process)
kill_signal = "SIGTERM"
kill_timeout = 30

[processes]
  api = ""
  ssh = "--ssh-mode"

[processes.ssh]
  kill_signal = "SIGTERM"
  kill_timeout = 300

SSH handler runs in a separate process group. Its 5-minute drain window does not block API rolling deploys.

TTL worker HA

Turso-based leader election

UPDATE system_locks
SET holder = ?1, acquired_at = unixepoch()
WHERE name = 'ttl_worker'
  AND (holder IS NULL OR acquired_at < unixepoch() - 60)
RETURNING holder;

Lock TTL: 60 seconds (implicit). Refresh: Every 30s. Consistency: Turso single-writer guarantees CAS atomicity.

TTL worker cycle

Every 30 seconds, the elected leader queries for expired machines and initiates destruction. On destruction, the Golem receives SIGTERM and begins Thanatopsis (four-phase death protocol). See 02-provisioning.md for the graceful shutdown sequence.

Failure scenarios

Scenario	Recovery time	Mechanism
Holding instance crashes	<=60s (lock expiry)	Another instance acquires via CAS
Turso primary failover	<=10s (automatic)	Embedded replicas serve reads
All control instances down	<=60s	Machine-local cron queries Turso

Styx health monitoring

Each Golem maintains an outbound WebSocket to Styx. The health endpoint reports Styx connection status:

#![allow(unused)]
fn main() {
#[derive(Debug, Serialize)]
pub struct GolemHealthResponse {
    pub status: String,       // "ready", "booting", "draining", "crashed"
    pub uptime_seconds: u64,
    pub vitality: f64,        // composite vitality from three death clocks
    pub styx_connected: bool,
    pub styx_latency_ms: Option<u64>,
    pub styx_last_sync: Option<u64>,  // epoch ms of last successful sync
    pub tool_count: u32,
    pub ttl_remaining_seconds: u64,
    pub behavioral_phase: String,     // "thriving", "stable", "conservation", etc.
}
}

Styx degradation handling

The Golem’s bardo-styx extension handles all Styx degradation internally:

Styx status	Golem behavior	TUI indicator
Fully online	Full ecology: backup, clade retrieval, lethe, pheromone field, bloodstain	Green
Degraded	Writes queue locally (up to 1,000 entries). Queries return cached results.	Amber
Offline	Local Grimoire is authoritative. Peer-to-peer clade sync via direct gossip.	Red
Never enabled	Identical to offline. Local Grimoire, file-based death bundles.	Gray

A circuit breaker (exponential backoff: 3 failures -> open, 60s half-open probe) prevents cascading failures. The control plane does NOT monitor or manage the Styx connection – it is entirely the Golem’s responsibility.

Outbound WebSocket model (NAT-friendly)

All Styx connections are outbound from the Golem. Zero inbound ports required. This means:

Fly.io VMs behind 6PN private networking work without configuration
Self-hosted Golems behind NAT/firewall work without port forwarding
The TUI can reach remote Golems through Styx (events flow: Golem -> Styx -> TUI)

TUI <-- WSS ---- Styx <-- WSS ---- Golem (behind NAT)
         |                              |
         |  Events flow: Golem -> Styx -> TUI
         |  Steers flow: TUI -> Styx -> Golem

Reconciliation job

Every 5 minutes, independent of the TTL worker:

SELECT id, machine_id FROM machines
WHERE status = 'ready' AND expires_at < unixepoch() - 120

Catches machines >2 minutes past expiry. Also catches orphaned Fly machines (VMs running without corresponding DB records).

Redis: optional only

Redis is not required for any correctness-critical operation. All authoritative state lives in Turso.

Use case	Without Redis	With Redis
Rate limiting	In-memory token bucket (per-instance)	Distributed sliding window
Nonce dedup	Turso UNIQUE constraint	Same
TTL enforcement	Turso poll worker	Same
Proxy cache	In-process LRU	Same
Region capacity	Turso COUNT query	Same

Database HA

Turso

Automatic primary failover: Transparent
Embedded replicas: Zero-latency reads from the same process
Write routing: Automatically routes to primary
Read-after-write: Use sync() after billing-critical writes

Turso backup

Nightly export of billing_events to Cloudflare R2:

Schedule:  0 2 * * * (2 AM UTC daily)
Target:    r2://bardo-backups/billing/{date}.jsonl.gz
Retention: 90 days
RPO:       ~24 hours
RTO:       <10 minutes (import from R2)

Grimoire snapshot storage

Cloudflare R2 stores Grimoire snapshots. Latest 5 snapshots retained per machine. ~50MB typical per machine.

Snapshot types: periodic (every 6 hours), shutdown (Thanatopsis Phase II), manual (user-triggered).

Grimoire also streams continuously to Styx Archive layer. R2 snapshots are the backup-of-last-resort.

Disk monitoring

Health endpoint includes disk usage. Alert threshold: >85%. Remediation: >90% triggers log truncation; >95% pauses Golem process.

Golem-level health

Extended health with agent-specific metrics:

#![allow(unused)]
fn main() {
#[derive(Debug, Serialize)]
pub struct ExtendedGolemHealth {
    pub status: String,
    pub uptime_seconds: u64,
    pub disk_usage_percent: u8,
    pub agent: AgentHealth,
}

#[derive(Debug, Serialize)]
pub struct AgentHealth {
    pub heartbeat_count: u64,
    pub last_heartbeat_age_seconds: u64,
    pub tool_calls_total: u64,
    pub tool_calls_last_hour: u32,
    pub inference_tokens_used: u64,
    pub inference_tokens_allowance: u64,
    pub model: String,
    pub strategy_type: String,
    pub crash_count: u32,
    pub vitality: f64,
    pub behavioral_phase: String,
    pub sustainability_ratio: Option<f64>,
    pub styx_connected: bool,
}
}

Control plane alerting thresholds

Metric	Threshold	Severity	Action
`last_heartbeat_age > 300`	5 min without heartbeat	Warning	Check process
`crash_count > 0`	Any supervisor restart	Warning	Investigate
`crash_count >= 5`	Max restarts reached	Critical	Machine will self-terminate
`inference_tokens_used > allowance * 0.9`	90% of token budget	Warning	Notify user
`disk_usage_percent > 85`	Disk filling	Warning	Trigger log cleanup
`vitality < 0.3`	Golem entering Declining phase	Info	Notify owner
`styx_connected == false` for 10 min	Styx connection lost	Warning	Check network

OpenTelemetry metrics

14 metrics exported via OpenTelemetry protocol:

Metric	Type	Labels	Alert threshold
`bardo_machines_active`	Gauge	region, vm_size	>100/region
`bardo_provision_duration_seconds`	Histogram	region, vm_size, outcome	p99 > 30s
`bardo_provision_total`	Counter	region, vm_size, outcome	failure_rate > 10%
`bardo_extension_total`	Counter	payer_type	–
`bardo_ttl_worker_runs_total`	Counter	outcome	stuck > 90s
`bardo_ttl_worker_duration_seconds`	Histogram	–	p99 > 25s
`bardo_revenue_micro_usdc_total`	Counter	type	daily < 1,000,000
`bardo_zombie_machines`	Gauge	–	>0 for 5 min
`bardo_reconciliation_corrections`	Counter	type	>0
`bardo_turso_latency_seconds`	Histogram	operation	p99 > 50ms
`bardo_proxy_request_duration_seconds`	Histogram	subdomain, status	p99 > 5s
`bardo_warm_pool_size`	Gauge	region	<2
`bardo_styx_connections_active`	Gauge	region	–
`bardo_golem_vitality`	Gauge	machine_name, phase	<0.1 (terminal)

Alerting

Alert	Condition	Severity	Action
Zombie machines	`bardo_zombie_machines > 0` for 5 min	Critical	Investigate TTL worker
TTL worker stuck	Same holder >90s without refresh	High	Check instance health
Cost spike	Revenue today >1.5x yesterday	Warning	Verify growth is organic
Provision failure rate	>10% in 15 min window	High	Check Fly Machines API
Turso latency	p99 >50ms for 5 min	Warning	Check Turso dashboard
Region approaching cap	>80 machines in any region	Warning	Consider adding regions
Control instances < 2	Instance count drops	Critical	Investigate autoscaler
Warm pool depleted	Pool size <2 in any region	Warning	Check pool manager
Golem heartbeat stale	Any `last_heartbeat_age > 300`	Warning	Check process
Disk usage high	Any `disk_usage_percent > 85`	Warning	Trigger log cleanup

Alert routing: Critical -> PagerDuty. High/Warning -> Slack #bardo-alerts.

Cost protection

Control	Mechanism	Limit
Per-region cap	Turso COUNT query	100 machines/region
Global cap	Turso COUNT query	500 machines
Min payment	x402 must cover >=1 hour	25,000 micro-USDC
Min extension	1 hour at machine’s rate	See 03-billing.md
Max single extension	TTL addition capped	720 hours
Zombie detection	TTL worker + reconciliation + cron	<90 sec max
Per-user limit	Turso count check	5 active machines

Structured logging

All Golem binary output is structured JSON to stdout. The Fly.io log aggregator collects it.

{
  "level": "info",
  "timestamp": 1709901234567,
  "trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "golem_id": "golem-V1StGXR8_Z5j",
  "msg": "HeartbeatComplete",
  "tick": 847,
  "duration_ms": 234,
  "tier": "T0",
  "vitality": 0.72,
  "phase": "stable"
}

Control plane uses the same structured JSON format with trace_id propagation across HTTP headers, Turso records, and Fly Machines API calls.

Distributed tracing

Every request gets a trace_id (UUID v4) assigned at the edge. Propagates through all layers. The Golem’s Event Fabric includes trace_id on events that originate from external requests (steers, strategy updates).

Failure mode matrix

Failure	Impact	Recovery	Mitigation
Single control instance crash	No impact	<30s (Fly auto-restart)	Min 2 instances
All control instances down	No new provisions	<60s (autoscale)	Machine-local cron
Fly Machines API down	No provisions/destructions	Variable	Queue destruction for retry
Turso primary down	Read-only mode	<10s (auto-failover)	Embedded replicas
Golem VM crash	Individual Golem down	Supervisor restarts (up to 5)	Grimoire snapshot preserved
Network partition (6PN)	Proxy unreachable	Variable	Machine-local cron
R2 outage	No Grimoire snapshots	Variable	Snapshots queued locally
Styx outage	No clade sync, no lethe	Golem continues at ~95%	Local Grimoire authoritative

Scalability

At v1 scale (<500 machines, <100 writes/min), Turso’s single-writer model is not a bottleneck. Options at scale: shard by region, write batching, or migrate to multi-writer DB.

The Golem binary’s resource footprint is small (~20MB RSS at idle, ~50MB under load). A micro VM (256MB) runs one Golem comfortably. The constraint is inference cost, not compute.

Keyboard shortcuts

Bardo