Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

08 – Observability and analytics [SPEC]

Per-agent cost attribution, OpenTelemetry traces, Event Fabric integration, cache metrics, performance targets

Related: 07-safety.md (PII detection, prompt injection defense, and audit logging), 03-economics.md (x402 spread revenue model and per-tenant cost attribution), 09-api.md (API reference with 33 endpoints including analytics), 11-privacy-trust.md (cryptographic audit trail with Merkle anchoring and provenance records)


Reader orientation: This document specifies the observability and analytics layer of Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents called Golems). It belongs to the inference plane and covers per-agent cost attribution, OpenTelemetry trace integration, Event Fabric event streaming, and cache performance metrics. The key concept is that every inference call produces structured telemetry that enables operators to understand cost, latency, and cache efficiency per agent and per subsystem. For term definitions, see prd2/shared/glossary.md.

Per-tenant cost attribution API

Every user (API key or wallet address) can query their inference spend with full breakdowns by model, optimization type, and time period. Golem (a mortal autonomous DeFi agent managed by the Bardo runtime) agents get additional per-strategy attribution.

GET /v1/analytics/spend

#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/spend.rs
#[derive(Debug, Clone, Serialize)]
pub struct SpendAnalytics {
    pub tenant_id: String,               // API key hash or wallet address
    pub period_start: u64,
    pub period_end: u64,
    pub total_spend_usdc: f64,           // paid by user (including spread)
    pub total_blockrun_cost_usdc: f64,   // paid to BlockRun
    pub total_spread_usdc: f64,          // earned by operator
    pub estimated_naive_cost_usdc: f64,  // without context engineering
    pub total_requests: u64,
    pub prompt_cache_hit_rate: f64,
    pub semantic_cache_hit_rate: f64,
    pub tokens_saved_by_routing: u64,
    pub tokens_saved_by_compression: u64,
    pub tokens_saved_by_tool_pruning: u64,
    pub cost_saved_total_usdc: f64,
    pub compactions_triggered: u64,
    pub handoffs_triggered: u64,
    pub savings_semantic_cache: f64,     // USDC saved by optimization type
    pub savings_prefix_cache: f64,
    pub savings_tool_pruning: f64,
    pub savings_history_compression: f64,
    pub by_model: Vec<ModelSpend>,       // model_id, provider, requests, spend/cost, purpose
}
}

OTEL metrics and tracing setup

A single MetricsCollector registers every instrument from the tables below. Pipeline layers take a shared reference instead of touching global state.

#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/metrics.rs
use opentelemetry::{global, metrics::{Counter, Histogram, Gauge, Meter}};
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::metrics::SdkMeterProvider;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};

pub struct MetricsCollector {
    pub cache_hit_rate: Gauge<f64>,
    pub cache_entries: Gauge<u64>,
    pub cache_savings_usd: Counter<f64>,
    pub cache_stale_served: Counter<u64>,
    pub prefix_cache_hit_rate: Gauge<f64>,
    pub cache_invalidations: Counter<u64>,
    pub spread_earned_total: Counter<f64>,
    pub blockrun_cost_total: Counter<f64>,
    pub user_savings_total: Counter<f64>,
    pub spread_pct_effective: Gauge<f64>,
    pub fallback_requests: Counter<u64>,
    pub pipeline_duration_ms: [Histogram<f64>; 8], // one per layer
    pub pipeline_total_duration_ms: Histogram<f64>,
    pub pipeline_profile: Counter<u64>,
    pub batch_api_savings_usd: Counter<f64>,
    pub diem_credits_consumed: Counter<u64>,
}

impl MetricsCollector {
    pub fn new(m: &Meter) -> Self {
        Self {
            cache_hit_rate:        m.f64_gauge("bardo_cache_hit_rate").build(),
            cache_entries:         m.u64_gauge("bardo_cache_entries").build(),
            cache_savings_usd:     m.f64_counter("bardo_cache_savings_usd").build(),
            cache_stale_served:    m.u64_counter("bardo_cache_stale_served").build(),
            prefix_cache_hit_rate: m.f64_gauge("bardo_prefix_cache_hit_rate").build(),
            cache_invalidations:   m.u64_counter("bardo_cache_invalidations").build(),
            spread_earned_total:   m.f64_counter("bardo_spread_earned_total").build(),
            blockrun_cost_total:   m.f64_counter("bardo_blockrun_cost_total").build(),
            user_savings_total:    m.f64_counter("bardo_user_savings_total").build(),
            spread_pct_effective:  m.f64_gauge("bardo_spread_pct_effective").build(),
            fallback_requests:     m.u64_counter("bardo_fallback_requests").build(),
            pipeline_duration_ms: std::array::from_fn(|i| {
                m.f64_histogram(format!("bardo_pipeline_l{}_duration_ms", i + 1)).build()
            }),
            pipeline_total_duration_ms: m.f64_histogram("bardo_pipeline_total_duration_ms").build(),
            pipeline_profile:      m.u64_counter("bardo_pipeline_profile").build(),
            batch_api_savings_usd: m.f64_counter("bardo_batch_api_savings_usd").build(),
            diem_credits_consumed: m.u64_counter("bardo_diem_credits_consumed").build(),
        }
    }
}

/// Wire tracing_subscriber + OTEL exporters. Call once at startup.
pub fn init_telemetry(endpoint: &str) -> SdkMeterProvider {
    let provider = SdkMeterProvider::builder()
        .with_periodic_exporter(
            opentelemetry_otlp::MetricExporter::builder()
                .with_tonic().with_endpoint(endpoint).build()
                .expect("OTLP metric exporter init failed"),
        ).build();
    global::set_meter_provider(provider.clone());
    let tracer = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic().with_endpoint(endpoint).build()
        .expect("OTLP span exporter init failed");
    tracing_subscriber::registry()
        .with(EnvFilter::from_default_env())
        .with(tracing_subscriber::fmt::layer())
        .with(tracing_opentelemetry::layer().with_tracer(tracer))
        .init();
    provider
}
}

Extended: OTEL trace tree, quality evaluation pipeline, arena model evaluation (Phase 2+), cost dashboard visualizations – see ../../prd2-extended/12-inference/08-observability-extended.md


Event Fabric integration

The inference gateway emits GolemEvent variants to the Event Fabric (see 01b-runtime-infrastructure.md section 3 for the tokio::broadcast ring buffer). Every inference state transition becomes a typed, serializable event consumed by surfaces (TUI, web, Telegram, Discord) without cross-referencing other events.

Inference events emitted

EventTriggerKey fields
InferenceStartRequest enters the pipelinemodel, tier, estimated_input_tokens, pipeline_profile
InferenceTokenStreaming chunk receivedmodel, tier, token_count (cumulative)
InferenceEndLLM call completesinput_tokens, output_tokens, cache_read_tokens, cost_usd, latency_ms
CacheHitSemantic or hash cache hitcache_layer (L2_semantic, L3_hash), similarity, savings_usd
ProviderFallbackPrimary provider failsfailed_provider, fallback_provider, error

These events carry CamelCase variant names and snake_case fields in the wire format, tagged with #[serde(tag = "type")]. Each is self-contained: an InferenceEnd event includes model, tier, token counts, cost, and latency in a single payload.

Event emission pattern

#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/events.rs

/// Emit an inference event to the Event Fabric.
/// Zero cost when no subscriber exists (subscriber count check first).
pub fn emit_inference_event(
    fabric: &EventFabric,
    golem_id: &str,
    tick: u64,
    event: GolemEvent,
) {
    // Only serialize when someone is listening
    if fabric.subscriber_count() > 0 {
        fabric.emit(event);
    }
}
}

Events feed directly into the OTEL metrics collector: every InferenceEnd event updates bardo_pipeline_total_duration_ms, bardo_cache_hit_rate, and cost counters.


Cache metrics

Prometheus-compatible, with alert thresholds:

MetricTypeAlert threshold
bardo_cache_hit_rateGauge<30% for 15 min
bardo_cache_entriesGauge>maxEntries * 0.95
bardo_cache_savings_usdCounter
bardo_cache_stale_servedCounter>0
bardo_prefix_cache_hit_rateGauge<50% for 15 min
bardo_cache_invalidationsCounterSpike detection

Anthropic’s Claude Code team declares internal SEV incidents when prompt cache hit rates drop. The gateway applies the same discipline – bardo_prefix_cache_hit_rate below 50% triggers alerts because prompt assembly has broken cache alignment.


Revenue metrics

Moved to ../revenue-model.md Section 3.2 (Inference Gateway, Revenue Metrics).


Pipeline profiling metrics

Per-layer latency tracking for the 8-layer pipeline. Targets assume a Rust gateway with 4+ cores:

MetricTypeAlert thresholdPerformance target
bardo_pipeline_l1_duration_msHistogram>5ms<1ms
bardo_pipeline_l2_duration_msHistogram>20ms<5ms (fastembed-rs + HNSW)
bardo_pipeline_l3_duration_msHistogram>1ms<0.1ms (DashMap lookup)
bardo_pipeline_l4_duration_msHistogram>5ms<1ms
bardo_pipeline_l5_duration_msHistogram>3000ms200-2000ms (conditional)
bardo_pipeline_l6_duration_msHistogram>5ms<1ms
bardo_pipeline_l7_duration_msHistogram>5ms<1ms (compiled regex)
bardo_pipeline_l8_duration_msHistogram>15ms<8ms (DeBERTa INT8)
bardo_pipeline_total_duration_msHistogram>50ms<12ms (no L5, parallel)
bardo_pipeline_profileCounter
bardo_batch_api_savings_usdCounter
bardo_diem_credits_consumedCounter

Performance targets by profile

ProfileP95 targetLayers activeWhen used
Minimal<1msL3 onlyT0 heartbeat ticks
Fast<15msL1, L3, L6Golem internal calls
Standard<40msL1-L6User-facing requests
Full<65msL1-L8Private/high-security requests

Concurrency targets

Concurrent requestsP95 target
112ms
1013ms
5015ms
10018ms
50025ms

Tokio’s work-stealing scheduler + rayon for data parallelism keep latency flat under load. Unlike TypeScript gateways (which degrade above ~50 concurrent ONNX calls), the Rust gateway batches concurrent embedding and DeBERTa inference across all cores.


Cross-references

TopicDocumentWhat it covers
Audit log schema07-safety.mdPII detection, prompt injection defense, and the audit log entry structure for every inference request
Revenue model03-economics.mdx402 spread revenue model, per-tenant cost attribution formulas, and infrastructure cost projections
Analytics endpoints09-api.mdAPI reference with 33 endpoints including GET /v1/analytics/spend and spend breakdown queries
Cache architecture02-caching.mdThree-layer cache stack (hash, semantic, prefix) with regime-aware invalidation and hit rate metrics
Privacy and trust11-privacy-trust.mdCryptographic audit trail with hash chains and Merkle tree anchoring for provenance verification
Event Fabric../01-golem/01b-runtime-infrastructure.mdThe runtime’s event streaming infrastructure that carries GolemEvents from subsystems to TUI and telemetry consumers
Rust implementation14-rust-implementation.md10-crate Rust workspace including bardo-telemetry crate for tracing and OpenTelemetry metrics