08 – Observability and analytics [SPEC]
Per-agent cost attribution, OpenTelemetry traces, Event Fabric integration, cache metrics, performance targets
Related: 07-safety.md (PII detection, prompt injection defense, and audit logging), 03-economics.md (x402 spread revenue model and per-tenant cost attribution), 09-api.md (API reference with 33 endpoints including analytics), 11-privacy-trust.md (cryptographic audit trail with Merkle anchoring and provenance records)
Reader orientation: This document specifies the observability and analytics layer of Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents called Golems). It belongs to the inference plane and covers per-agent cost attribution, OpenTelemetry trace integration, Event Fabric event streaming, and cache performance metrics. The key concept is that every inference call produces structured telemetry that enables operators to understand cost, latency, and cache efficiency per agent and per subsystem. For term definitions, see
prd2/shared/glossary.md.
Per-tenant cost attribution API
Every user (API key or wallet address) can query their inference spend with full breakdowns by model, optimization type, and time period. Golem (a mortal autonomous DeFi agent managed by the Bardo runtime) agents get additional per-strategy attribution.
GET /v1/analytics/spend
#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/spend.rs
#[derive(Debug, Clone, Serialize)]
pub struct SpendAnalytics {
pub tenant_id: String, // API key hash or wallet address
pub period_start: u64,
pub period_end: u64,
pub total_spend_usdc: f64, // paid by user (including spread)
pub total_blockrun_cost_usdc: f64, // paid to BlockRun
pub total_spread_usdc: f64, // earned by operator
pub estimated_naive_cost_usdc: f64, // without context engineering
pub total_requests: u64,
pub prompt_cache_hit_rate: f64,
pub semantic_cache_hit_rate: f64,
pub tokens_saved_by_routing: u64,
pub tokens_saved_by_compression: u64,
pub tokens_saved_by_tool_pruning: u64,
pub cost_saved_total_usdc: f64,
pub compactions_triggered: u64,
pub handoffs_triggered: u64,
pub savings_semantic_cache: f64, // USDC saved by optimization type
pub savings_prefix_cache: f64,
pub savings_tool_pruning: f64,
pub savings_history_compression: f64,
pub by_model: Vec<ModelSpend>, // model_id, provider, requests, spend/cost, purpose
}
}
OTEL metrics and tracing setup
A single MetricsCollector registers every instrument from the tables below. Pipeline layers take a shared reference instead of touching global state.
#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/metrics.rs
use opentelemetry::{global, metrics::{Counter, Histogram, Gauge, Meter}};
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::metrics::SdkMeterProvider;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};
pub struct MetricsCollector {
pub cache_hit_rate: Gauge<f64>,
pub cache_entries: Gauge<u64>,
pub cache_savings_usd: Counter<f64>,
pub cache_stale_served: Counter<u64>,
pub prefix_cache_hit_rate: Gauge<f64>,
pub cache_invalidations: Counter<u64>,
pub spread_earned_total: Counter<f64>,
pub blockrun_cost_total: Counter<f64>,
pub user_savings_total: Counter<f64>,
pub spread_pct_effective: Gauge<f64>,
pub fallback_requests: Counter<u64>,
pub pipeline_duration_ms: [Histogram<f64>; 8], // one per layer
pub pipeline_total_duration_ms: Histogram<f64>,
pub pipeline_profile: Counter<u64>,
pub batch_api_savings_usd: Counter<f64>,
pub diem_credits_consumed: Counter<u64>,
}
impl MetricsCollector {
pub fn new(m: &Meter) -> Self {
Self {
cache_hit_rate: m.f64_gauge("bardo_cache_hit_rate").build(),
cache_entries: m.u64_gauge("bardo_cache_entries").build(),
cache_savings_usd: m.f64_counter("bardo_cache_savings_usd").build(),
cache_stale_served: m.u64_counter("bardo_cache_stale_served").build(),
prefix_cache_hit_rate: m.f64_gauge("bardo_prefix_cache_hit_rate").build(),
cache_invalidations: m.u64_counter("bardo_cache_invalidations").build(),
spread_earned_total: m.f64_counter("bardo_spread_earned_total").build(),
blockrun_cost_total: m.f64_counter("bardo_blockrun_cost_total").build(),
user_savings_total: m.f64_counter("bardo_user_savings_total").build(),
spread_pct_effective: m.f64_gauge("bardo_spread_pct_effective").build(),
fallback_requests: m.u64_counter("bardo_fallback_requests").build(),
pipeline_duration_ms: std::array::from_fn(|i| {
m.f64_histogram(format!("bardo_pipeline_l{}_duration_ms", i + 1)).build()
}),
pipeline_total_duration_ms: m.f64_histogram("bardo_pipeline_total_duration_ms").build(),
pipeline_profile: m.u64_counter("bardo_pipeline_profile").build(),
batch_api_savings_usd: m.f64_counter("bardo_batch_api_savings_usd").build(),
diem_credits_consumed: m.u64_counter("bardo_diem_credits_consumed").build(),
}
}
}
/// Wire tracing_subscriber + OTEL exporters. Call once at startup.
pub fn init_telemetry(endpoint: &str) -> SdkMeterProvider {
let provider = SdkMeterProvider::builder()
.with_periodic_exporter(
opentelemetry_otlp::MetricExporter::builder()
.with_tonic().with_endpoint(endpoint).build()
.expect("OTLP metric exporter init failed"),
).build();
global::set_meter_provider(provider.clone());
let tracer = opentelemetry_otlp::SpanExporter::builder()
.with_tonic().with_endpoint(endpoint).build()
.expect("OTLP span exporter init failed");
tracing_subscriber::registry()
.with(EnvFilter::from_default_env())
.with(tracing_subscriber::fmt::layer())
.with(tracing_opentelemetry::layer().with_tracer(tracer))
.init();
provider
}
}
Extended: OTEL trace tree, quality evaluation pipeline, arena model evaluation (Phase 2+), cost dashboard visualizations – see ../../prd2-extended/12-inference/08-observability-extended.md
Event Fabric integration
The inference gateway emits GolemEvent variants to the Event Fabric (see 01b-runtime-infrastructure.md section 3 for the tokio::broadcast ring buffer). Every inference state transition becomes a typed, serializable event consumed by surfaces (TUI, web, Telegram, Discord) without cross-referencing other events.
Inference events emitted
| Event | Trigger | Key fields |
|---|---|---|
InferenceStart | Request enters the pipeline | model, tier, estimated_input_tokens, pipeline_profile |
InferenceToken | Streaming chunk received | model, tier, token_count (cumulative) |
InferenceEnd | LLM call completes | input_tokens, output_tokens, cache_read_tokens, cost_usd, latency_ms |
CacheHit | Semantic or hash cache hit | cache_layer (L2_semantic, L3_hash), similarity, savings_usd |
ProviderFallback | Primary provider fails | failed_provider, fallback_provider, error |
These events carry CamelCase variant names and snake_case fields in the wire format, tagged with #[serde(tag = "type")]. Each is self-contained: an InferenceEnd event includes model, tier, token counts, cost, and latency in a single payload.
Event emission pattern
#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/events.rs
/// Emit an inference event to the Event Fabric.
/// Zero cost when no subscriber exists (subscriber count check first).
pub fn emit_inference_event(
fabric: &EventFabric,
golem_id: &str,
tick: u64,
event: GolemEvent,
) {
// Only serialize when someone is listening
if fabric.subscriber_count() > 0 {
fabric.emit(event);
}
}
}
Events feed directly into the OTEL metrics collector: every InferenceEnd event updates bardo_pipeline_total_duration_ms, bardo_cache_hit_rate, and cost counters.
Cache metrics
Prometheus-compatible, with alert thresholds:
| Metric | Type | Alert threshold |
|---|---|---|
bardo_cache_hit_rate | Gauge | <30% for 15 min |
bardo_cache_entries | Gauge | >maxEntries * 0.95 |
bardo_cache_savings_usd | Counter | – |
bardo_cache_stale_served | Counter | >0 |
bardo_prefix_cache_hit_rate | Gauge | <50% for 15 min |
bardo_cache_invalidations | Counter | Spike detection |
Anthropic’s Claude Code team declares internal SEV incidents when prompt cache hit rates drop. The gateway applies the same discipline – bardo_prefix_cache_hit_rate below 50% triggers alerts because prompt assembly has broken cache alignment.
Revenue metrics
Moved to ../revenue-model.md Section 3.2 (Inference Gateway, Revenue Metrics).
Pipeline profiling metrics
Per-layer latency tracking for the 8-layer pipeline. Targets assume a Rust gateway with 4+ cores:
| Metric | Type | Alert threshold | Performance target |
|---|---|---|---|
bardo_pipeline_l1_duration_ms | Histogram | >5ms | <1ms |
bardo_pipeline_l2_duration_ms | Histogram | >20ms | <5ms (fastembed-rs + HNSW) |
bardo_pipeline_l3_duration_ms | Histogram | >1ms | <0.1ms (DashMap lookup) |
bardo_pipeline_l4_duration_ms | Histogram | >5ms | <1ms |
bardo_pipeline_l5_duration_ms | Histogram | >3000ms | 200-2000ms (conditional) |
bardo_pipeline_l6_duration_ms | Histogram | >5ms | <1ms |
bardo_pipeline_l7_duration_ms | Histogram | >5ms | <1ms (compiled regex) |
bardo_pipeline_l8_duration_ms | Histogram | >15ms | <8ms (DeBERTa INT8) |
bardo_pipeline_total_duration_ms | Histogram | >50ms | <12ms (no L5, parallel) |
bardo_pipeline_profile | Counter | – | – |
bardo_batch_api_savings_usd | Counter | – | – |
bardo_diem_credits_consumed | Counter | – | – |
Performance targets by profile
| Profile | P95 target | Layers active | When used |
|---|---|---|---|
| Minimal | <1ms | L3 only | T0 heartbeat ticks |
| Fast | <15ms | L1, L3, L6 | Golem internal calls |
| Standard | <40ms | L1-L6 | User-facing requests |
| Full | <65ms | L1-L8 | Private/high-security requests |
Concurrency targets
| Concurrent requests | P95 target |
|---|---|
| 1 | 12ms |
| 10 | 13ms |
| 50 | 15ms |
| 100 | 18ms |
| 500 | 25ms |
Tokio’s work-stealing scheduler + rayon for data parallelism keep latency flat under load. Unlike TypeScript gateways (which degrade above ~50 concurrent ONNX calls), the Rust gateway batches concurrent embedding and DeBERTa inference across all cores.
Cross-references
| Topic | Document | What it covers |
|---|---|---|
| Audit log schema | 07-safety.md | PII detection, prompt injection defense, and the audit log entry structure for every inference request |
| Revenue model | 03-economics.md | x402 spread revenue model, per-tenant cost attribution formulas, and infrastructure cost projections |
| Analytics endpoints | 09-api.md | API reference with 33 endpoints including GET /v1/analytics/spend and spend breakdown queries |
| Cache architecture | 02-caching.md | Three-layer cache stack (hash, semantic, prefix) with regime-aware invalidation and hit rate metrics |
| Privacy and trust | 11-privacy-trust.md | Cryptographic audit trail with hash chains and Merkle tree anchoring for provenance verification |
| Event Fabric | ../01-golem/01b-runtime-infrastructure.md | The runtime’s event streaming infrastructure that carries GolemEvents from subsystems to TUI and telemetry consumers |
| Rust implementation | 14-rust-implementation.md | 10-crate Rust workspace including bardo-telemetry crate for tracing and OpenTelemetry metrics |