Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

05 – Operations and monitoring [SPEC]

Metrics, alerting, Styx health, scaling, HA, and disaster recovery

Reader orientation: This document specifies operational concerns for Bardo Compute: high availability, monitoring, alerting, and disaster recovery. It belongs to the Compute layer of Bardo (the Rust runtime for Golems, mortal autonomous DeFi agents). The key concept before diving in: Bardo Compute has asymmetric failure costs where a zombie VM leaking past TTL expiry costs real money every minute, so every HA mechanism is designed to fail-safe by destroying machines rather than silently accumulating cost. Terms like Golem, Styx, x402, and Thanatopsis are defined inline on first use; a full glossary lives in 00-overview.md § Terminology.


Why HA is non-negotiable

Bardo Compute has asymmetric failure costs. A leaked zombie VM costs real money every minute it runs; a missed TTL expiry is not “eventually consistent” – it is a billing leak. Every HA mechanism below is designed with zombie prevention as the top priority. The system must fail-safe: if anything goes wrong, machines get destroyed rather than silently accumulating cost.


Multi-region, multi-instance deployment

Fly appInstancesRegionsScaling
bardo-controlMin 2ams, ordAutoscale to 10 (CPU-based)
bardo-machinesN/APer user requestPer deployment

All control plane logic runs in at least 2 regions. If one region fails entirely, the other continues serving.


Zero-downtime rolling deploys

# bardo-control (main process)
kill_signal = "SIGTERM"
kill_timeout = 30

[processes]
  api = ""
  ssh = "--ssh-mode"

[processes.ssh]
  kill_signal = "SIGTERM"
  kill_timeout = 300

SSH handler runs in a separate process group. Its 5-minute drain window does not block API rolling deploys.


TTL worker HA

Turso-based leader election

UPDATE system_locks
SET holder = ?1, acquired_at = unixepoch()
WHERE name = 'ttl_worker'
  AND (holder IS NULL OR acquired_at < unixepoch() - 60)
RETURNING holder;

Lock TTL: 60 seconds (implicit). Refresh: Every 30s. Consistency: Turso single-writer guarantees CAS atomicity.

TTL worker cycle

Every 30 seconds, the elected leader queries for expired machines and initiates destruction. On destruction, the Golem receives SIGTERM and begins Thanatopsis (four-phase death protocol). See 02-provisioning.md for the graceful shutdown sequence.

Failure scenarios

ScenarioRecovery timeMechanism
Holding instance crashes<=60s (lock expiry)Another instance acquires via CAS
Turso primary failover<=10s (automatic)Embedded replicas serve reads
All control instances down<=60sMachine-local cron queries Turso

Styx health monitoring

Each Golem maintains an outbound WebSocket to Styx. The health endpoint reports Styx connection status:

#![allow(unused)]
fn main() {
#[derive(Debug, Serialize)]
pub struct GolemHealthResponse {
    pub status: String,       // "ready", "booting", "draining", "crashed"
    pub uptime_seconds: u64,
    pub vitality: f64,        // composite vitality from three death clocks
    pub styx_connected: bool,
    pub styx_latency_ms: Option<u64>,
    pub styx_last_sync: Option<u64>,  // epoch ms of last successful sync
    pub tool_count: u32,
    pub ttl_remaining_seconds: u64,
    pub behavioral_phase: String,     // "thriving", "stable", "conservation", etc.
}
}

Styx degradation handling

The Golem’s bardo-styx extension handles all Styx degradation internally:

Styx statusGolem behaviorTUI indicator
Fully onlineFull ecology: backup, clade retrieval, lethe, pheromone field, bloodstainGreen
DegradedWrites queue locally (up to 1,000 entries). Queries return cached results.Amber
OfflineLocal Grimoire is authoritative. Peer-to-peer clade sync via direct gossip.Red
Never enabledIdentical to offline. Local Grimoire, file-based death bundles.Gray

A circuit breaker (exponential backoff: 3 failures -> open, 60s half-open probe) prevents cascading failures. The control plane does NOT monitor or manage the Styx connection – it is entirely the Golem’s responsibility.

Outbound WebSocket model (NAT-friendly)

All Styx connections are outbound from the Golem. Zero inbound ports required. This means:

  • Fly.io VMs behind 6PN private networking work without configuration
  • Self-hosted Golems behind NAT/firewall work without port forwarding
  • The TUI can reach remote Golems through Styx (events flow: Golem -> Styx -> TUI)
TUI <-- WSS ---- Styx <-- WSS ---- Golem (behind NAT)
         |                              |
         |  Events flow: Golem -> Styx -> TUI
         |  Steers flow: TUI -> Styx -> Golem

Reconciliation job

Every 5 minutes, independent of the TTL worker:

SELECT id, machine_id FROM machines
WHERE status = 'ready' AND expires_at < unixepoch() - 120

Catches machines >2 minutes past expiry. Also catches orphaned Fly machines (VMs running without corresponding DB records).


Redis: optional only

Redis is not required for any correctness-critical operation. All authoritative state lives in Turso.

Use caseWithout RedisWith Redis
Rate limitingIn-memory token bucket (per-instance)Distributed sliding window
Nonce dedupTurso UNIQUE constraintSame
TTL enforcementTurso poll workerSame
Proxy cacheIn-process LRUSame
Region capacityTurso COUNT querySame

Database HA

Turso

  • Automatic primary failover: Transparent
  • Embedded replicas: Zero-latency reads from the same process
  • Write routing: Automatically routes to primary
  • Read-after-write: Use sync() after billing-critical writes

Turso backup

Nightly export of billing_events to Cloudflare R2:

Schedule:  0 2 * * * (2 AM UTC daily)
Target:    r2://bardo-backups/billing/{date}.jsonl.gz
Retention: 90 days
RPO:       ~24 hours
RTO:       <10 minutes (import from R2)

Grimoire snapshot storage

Cloudflare R2 stores Grimoire snapshots. Latest 5 snapshots retained per machine. ~50MB typical per machine.

Snapshot types: periodic (every 6 hours), shutdown (Thanatopsis Phase II), manual (user-triggered).

Grimoire also streams continuously to Styx Archive layer. R2 snapshots are the backup-of-last-resort.


Disk monitoring

Health endpoint includes disk usage. Alert threshold: >85%. Remediation: >90% triggers log truncation; >95% pauses Golem process.


Golem-level health

Extended health with agent-specific metrics:

#![allow(unused)]
fn main() {
#[derive(Debug, Serialize)]
pub struct ExtendedGolemHealth {
    pub status: String,
    pub uptime_seconds: u64,
    pub disk_usage_percent: u8,
    pub agent: AgentHealth,
}

#[derive(Debug, Serialize)]
pub struct AgentHealth {
    pub heartbeat_count: u64,
    pub last_heartbeat_age_seconds: u64,
    pub tool_calls_total: u64,
    pub tool_calls_last_hour: u32,
    pub inference_tokens_used: u64,
    pub inference_tokens_allowance: u64,
    pub model: String,
    pub strategy_type: String,
    pub crash_count: u32,
    pub vitality: f64,
    pub behavioral_phase: String,
    pub sustainability_ratio: Option<f64>,
    pub styx_connected: bool,
}
}

Control plane alerting thresholds

MetricThresholdSeverityAction
last_heartbeat_age > 3005 min without heartbeatWarningCheck process
crash_count > 0Any supervisor restartWarningInvestigate
crash_count >= 5Max restarts reachedCriticalMachine will self-terminate
inference_tokens_used > allowance * 0.990% of token budgetWarningNotify user
disk_usage_percent > 85Disk fillingWarningTrigger log cleanup
vitality < 0.3Golem entering Declining phaseInfoNotify owner
styx_connected == false for 10 minStyx connection lostWarningCheck network

OpenTelemetry metrics

14 metrics exported via OpenTelemetry protocol:

MetricTypeLabelsAlert threshold
bardo_machines_activeGaugeregion, vm_size>100/region
bardo_provision_duration_secondsHistogramregion, vm_size, outcomep99 > 30s
bardo_provision_totalCounterregion, vm_size, outcomefailure_rate > 10%
bardo_extension_totalCounterpayer_type
bardo_ttl_worker_runs_totalCounteroutcomestuck > 90s
bardo_ttl_worker_duration_secondsHistogramp99 > 25s
bardo_revenue_micro_usdc_totalCountertypedaily < 1,000,000
bardo_zombie_machinesGauge>0 for 5 min
bardo_reconciliation_correctionsCountertype>0
bardo_turso_latency_secondsHistogramoperationp99 > 50ms
bardo_proxy_request_duration_secondsHistogramsubdomain, statusp99 > 5s
bardo_warm_pool_sizeGaugeregion<2
bardo_styx_connections_activeGaugeregion
bardo_golem_vitalityGaugemachine_name, phase<0.1 (terminal)

Alerting

AlertConditionSeverityAction
Zombie machinesbardo_zombie_machines > 0 for 5 minCriticalInvestigate TTL worker
TTL worker stuckSame holder >90s without refreshHighCheck instance health
Cost spikeRevenue today >1.5x yesterdayWarningVerify growth is organic
Provision failure rate>10% in 15 min windowHighCheck Fly Machines API
Turso latencyp99 >50ms for 5 minWarningCheck Turso dashboard
Region approaching cap>80 machines in any regionWarningConsider adding regions
Control instances < 2Instance count dropsCriticalInvestigate autoscaler
Warm pool depletedPool size <2 in any regionWarningCheck pool manager
Golem heartbeat staleAny last_heartbeat_age > 300WarningCheck process
Disk usage highAny disk_usage_percent > 85WarningTrigger log cleanup

Alert routing: Critical -> PagerDuty. High/Warning -> Slack #bardo-alerts.


Cost protection

ControlMechanismLimit
Per-region capTurso COUNT query100 machines/region
Global capTurso COUNT query500 machines
Min paymentx402 must cover >=1 hour25,000 micro-USDC
Min extension1 hour at machine’s rateSee 03-billing.md
Max single extensionTTL addition capped720 hours
Zombie detectionTTL worker + reconciliation + cron<90 sec max
Per-user limitTurso count check5 active machines

Structured logging

All Golem binary output is structured JSON to stdout. The Fly.io log aggregator collects it.

{
  "level": "info",
  "timestamp": 1709901234567,
  "trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "golem_id": "golem-V1StGXR8_Z5j",
  "msg": "HeartbeatComplete",
  "tick": 847,
  "duration_ms": 234,
  "tier": "T0",
  "vitality": 0.72,
  "phase": "stable"
}

Control plane uses the same structured JSON format with trace_id propagation across HTTP headers, Turso records, and Fly Machines API calls.


Distributed tracing

Every request gets a trace_id (UUID v4) assigned at the edge. Propagates through all layers. The Golem’s Event Fabric includes trace_id on events that originate from external requests (steers, strategy updates).


Failure mode matrix

FailureImpactRecoveryMitigation
Single control instance crashNo impact<30s (Fly auto-restart)Min 2 instances
All control instances downNo new provisions<60s (autoscale)Machine-local cron
Fly Machines API downNo provisions/destructionsVariableQueue destruction for retry
Turso primary downRead-only mode<10s (auto-failover)Embedded replicas
Golem VM crashIndividual Golem downSupervisor restarts (up to 5)Grimoire snapshot preserved
Network partition (6PN)Proxy unreachableVariableMachine-local cron
R2 outageNo Grimoire snapshotsVariableSnapshots queued locally
Styx outageNo clade sync, no letheGolem continues at ~95%Local Grimoire authoritative

Scalability

At v1 scale (<500 machines, <100 writes/min), Turso’s single-writer model is not a bottleneck. Options at scale: shard by region, write batching, or migrate to multi-writer DB.

The Golem binary’s resource footprint is small (~20MB RSS at idle, ~50MB under load). A micro VM (256MB) runs one Golem comfortably. The constraint is inference cost, not compute.