Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

03 – Inference Economics [SPEC]

x402 Spread Revenue Model, Per-Tenant Cost Attribution, and Infrastructure Costs

Related: 00-overview.md (gateway architecture, payment flows, and design principles), 02-caching.md (three-layer cache stack that drives 40-85% cost savings), prd2/11-compute/03-billing.md (Compute VM billing and inference token allowances)


Reader orientation: This document specifies the economics of Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents called Golems). It belongs to the inference plane and covers the x402 spread revenue model, per-tenant cost attribution, and infrastructure cost projections. The key concept is that context engineering makes the all-in cost cheaper than direct API access even after the operator’s spread, creating a positive-sum model where users save money and operators earn margin. For term definitions, see prd2/shared/glossary.md.

Revenue model: x402 spread

See shared/x402-protocol.md for the x402 payment protocol specification.

Revenue = x402 (a micropayment protocol for HTTP-native USDC payments on Base) spread. Bardo charges a configurable markup (default ~20%) over BlockRun’s cost. Context engineering makes the all-in cost cheaper than direct API access despite the spread. The user saves money, the operator earns margin.

The only revenue stream

The operator configures a spread percentage applied to the optimized cost (after context engineering), not the naive cost.

BlockRun cost (optimized):  $X
Spread (default 20%):       $X × 0.20
User pays:                  $X × 1.20

Operator margin per request: $X × 0.20

Worked example

Request: 50K input tokens + 2K output tokens (Sonnet-equivalent)

Naive cost (direct API):                    $0.165
  Input:  50K × $3.00/M  = $0.150
  Output:  2K × $15.00/M = $0.030
  Total:                   $0.180
  (BlockRun may be slightly cheaper: $0.165)

Context engineering savings (~40%):
  - 30K tokens hit prefix cache (90% discount): saves $0.040
  - Tool pruning removes 8K tokens:             saves $0.024
  - History compression saves 2K tokens:         saves $0.006
  Optimized cost:                               $0.099

User pays (20% spread):                        $0.119
  BlockRun cost:  $0.099
  Spread:         $0.020

User saves vs. direct API:  28% ($0.165 → $0.119)
Operator margin:            $0.020/request

Key properties

  • User always saves: Context engineering savings (~40%) > spread (20%)
  • Zero float: Both legs settle instantly via x402 on Base. No credit risk.
  • Zero inference cost: No prepaid API credits to manage (BlockRun is x402-native)
  • Spread is operator-configurable: Range 5-50%, default 20%
  • One revenue stream: Simple pricing, simple explanation to users

Cost savings stack

Seven optimization layers compound to produce 40-85% cost reduction:

LayerMechanismSavingsHow
T0 routingFSM suppresses ~80% of ticks80% of ticks at $0No LLM call for stable markets
Semantic + hash cacheZero-cost on cache hits~20% of remaining at $0Embedding similarity + exact match
Prompt cache alignment90% discount on cached prefix60-80% on input tokensStable-before-dynamic reordering
Tool pruningRemove irrelevant tool definitions97.5% of tool tokensSemantic search, ≤12 tools per request
Multi-model routingMatch task to cheapest capable model50-90% per requestRouteLLM classifier, T1/T2 split
DIEM staking100% discount on Venice-routed callsFree inferenceVenice DIEM balance covers cost
Batch API50% discount on async processing50% on dream cyclesAnthropic/OpenAI batch endpoints

Semantic cache tuning

Cache TTLs are regime-aware – volatile markets invalidate cached responses faster:

Market RegimeTTL (seconds)
Calm300
Normal210
Volatile90
Crisis30

Similarity thresholds vary by domain to prevent stale data leaking into high-stakes decisions:

DomainCosine Threshold
General market analysis0.92
Strategy reasoning0.95
Trade execution0.98

Higher thresholds in execution contexts mean fewer cache hits but zero risk of acting on semantically similar but materially different market conditions.


Infrastructure cost

The operator’s only fixed cost is hosting the proxy binary.

ScaleInfrastructureMonthly Cost
Small (< 1K req/day)Single VPS$50
Medium (1K-10K req/day)2-3 instances + load balancer$200
Large (10K-100K req/day)Kubernetes cluster$500

Default stack: in-memory LRU cache (configurable size, default 100MB), embedded HNSW index for semantic vectors, SQLite for sessions/memory/analytics. For large-scale deployments, optional backends (Redis, Qdrant, Clickhouse) can be swapped in for horizontal scalability.


ERC-8004 reputation discounts (future)

Users with high ERC-8004 reputation scores get reduced spread:

Reputation TierSpread
None (default)20%
Basic (50+)18%
Verified (200+)15%
Trusted (500+)12%
Sovereign (1000+)8%

This creates a flywheel: use Bardo -> build reputation -> get cheaper access -> use Bardo more. Linking is optional via POST /v1/identity/link.


Per-tenant cost tracking

Every user (API key or wallet address) gets per-request and aggregate cost tracking. Every response includes cost transparency headers:

X-Bardo-Cost: 0.119            # What the user paid
X-Bardo-Blockrun-Cost: 0.099   # What Bardo paid BlockRun
X-Bardo-Spread: 0.020          # Operator margin
X-Bardo-Naive-Cost: 0.165      # What it would have cost without optimization
X-Bardo-Savings: 0.046         # User savings vs. naive (naive - user paid)
X-Bardo-Balance: 12.38         # Remaining prepaid balance (prepaid mode only)

Per-tenant cost summary

#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/cost.rs

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TenantCostSummary {
    pub tenant_id: String,
    pub period: CostPeriod,
    pub total_requests: u64,
    /// What the user paid (including spread).
    pub total_paid: f64,
    /// What Bardo paid BlockRun.
    pub total_blockrun_cost: f64,
    /// Operator margin.
    pub total_spread: f64,
    /// Estimated cost without context engineering.
    pub estimated_naive_cost: f64,
    /// Naive minus user-paid.
    pub total_savings: f64,
    pub savings_breakdown: SavingsBreakdown,
    pub by_model: Vec<ModelCostBreakdown>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum CostPeriod {
    Day,
    Week,
    Month,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SavingsBreakdown {
    pub semantic_cache: f64,
    pub prefix_cache: f64,
    pub tool_pruning: f64,
    pub history_compression: f64,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCostBreakdown {
    pub model: String,
    pub requests: u64,
    pub input_tokens: u64,
    pub output_tokens: u64,
    pub user_paid: f64,
    pub blockrun_cost: f64,
    pub savings: f64,
}
}

Queryable via GET /v1/analytics/spend?period=day|week|month.


Fallback cost tracking

When the proxy falls back from BlockRun (x402) to OpenRouter (API key), the x402 payment path is bypassed for the upstream leg. Cost tracking still applies: estimates use published per-token pricing from the provider catalog, and daily reconciliation against the API-key provider’s usage dashboard catches drift (typically <5%).

See 01a-routing.md for the FallbackCostAdapter.


Moat: agents that pay

The financial security architecture is a competitive moat (see prd2/appendices/moat-pay.md). Six independent security layers, three cryptographic, mean that even a fully compromised LLM cannot drain funds. The inference gateway is the economic control plane – it tracks costs, enforces budgets, and ensures the sustainability ratio stays above 1.0.


Bankr self-funding economics

Bankr-routed Golems (mortal autonomous DeFi agents managed by the Bardo runtime) fund inference from revenue. The wallet that earns from DeFi strategies pays for inference directly. No separate funding step. This is the metabolic loop: the organism sustains itself through the activity that requires sustenance.

#![allow(unused)]
fn main() {
// crates/bardo-providers/src/bankr.rs

pub fn compute_sustainability_ratio(
    daily_revenue_usd: f64,
    daily_inference_cost_usd: f64,
    daily_compute_cost_usd: f64,
    daily_gas_cost_usd: f64,
) -> f64 {
    let total_daily_cost = daily_inference_cost_usd
        + daily_compute_cost_usd
        + daily_gas_cost_usd;
    daily_revenue_usd / total_daily_cost
}
// ratio > 1.0 -> self-sustaining
// ratio > 2.0 -> thriving (can spawn replicants)
// ratio < 1.0 -> declining
// ratio < 0.5 -> dying
}

The sustainability ratio feeds directly into routing decisions. When it drops, the gateway routes more aggressively to cheap tiers:

Sustainability ratioBudget adjustmentBehavior
> 2.0Expand 1.5xRevenue exceeds cost, grow capabilities
1.0 - 2.0BaselineSelf-sustaining
0.5 - 1.0Contract 0.7xUnder pressure, reduce non-critical inference
< 0.5Contract 0.3xNear-death, only risk + owner + death funded

A Golem managing a vault earning $50/day in fees with $15/day in Bankr inference costs has a 3.3x sustainability ratio. Above 1.0x, the Golem is economically immortal (barring epistemic or stochastic death). Below 1.0x, the economic clock is ticking. The gateway’s cost profile naturally curves downward as the Golem approaches death, extending effective lifespan by reducing the largest variable cost.


Venice DIEM staking economics

Venice DIEM staking provides zero-cost inference for Venice-routed calls. The owner stakes VVV tokens (Venice’s native token on Base), receiving a daily DIEM allocation proportional to stake weight. Each DIEM equals $1/day of Venice API credit, perpetually. DIEM-funded calls show X-Bardo-Cost: 0.00 in response headers.

DIEM is the first-choice cost optimization for privacy-preferring subsystems (dreams, daimon, death reflection). When DIEM balance is available, the provider resolver routes privacy-preferring intents to Venice automatically.

DIEM allocation strategy

Daily DIEM budget: $X (from VVV stake)
+-- Waking inference (private):  60%  -- Portfolio analysis, deal negotiation
+-- Dream cycles (private):      15%  -- Counterfactual reasoning, threat simulation
+-- Sleepwalker artifacts:       15%  -- Observatory research
+-- Reserve (rollover):          10%  -- Unused DIEM for volatile days

Mortality integration

In the standard mortality model, inference costs drain the LLM credit partition (60% of total budget). Venice staking decouples inference cost from mortality pressure: Venice (DIEM) calls cost zero marginal cost from staked VVV, draining nothing from the LLM partition. A Golem routing 50% of its inference to Venice extends its projected lifespan by ~30% (saving $0.06-0.10/day on a $0.20/day budget). Over a 30-day lifespan, that is 9 additional days of life.


Future revenue

Phase 2: premium services

ServicePriceDescription
Custom prompt libraries$10/moOperator-curated prompt templates
Advanced semantic caching$5/moLarger cache, cross-session sharing
Extended analytics$5/mo90-day retention, detailed breakdowns

Phase 3: additional providers

Adding OpenRouter or direct API keys as fallback providers. These break the “zero working capital” property but increase model availability and redundancy. Revenue model extends naturally: spread applies to all providers. See 02-caching.md for provider cache discount tables used by the cost estimation engine.


Daily cost projections (Golem agents)

For Golem agents using the three-tier system at 100 ticks/day:

ScenarioT0T1T2Daily LLM Cost
Calm market (90% T0, 8% T1, 2% T2)$0.00$0.024$0.02~$0.05
Normal market (80% T0, 15% T1, 5% T2)$0.00$0.045$0.15~$0.20
Volatile market (60% T0, 25% T1, 15% T2)$0.00$0.075$0.75~$0.83

Cost savings example (all users)

For a user making 1,000 requests/day:

OptimizationApplies ToSavingsDaily Savings
Economy model routing60% of queries85%/query$2.55
Semantic cache hits20% of queries100%/query$1.00
Prompt prefix cache80% of queries10%/query$0.40
Tool pruning30% of queries15%/query$0.22
Total daily savings$4.17
Without gateway$5.00/day
With gateway$1.33/day (73% reduction)

User saves ~$3.67/day. Gateway earns ~$0.50/day in spread.


Revenue projections

PhaseUsersAvg Req/Day/UserDaily VolumeDaily Revenue (20% spread)
Launch50201,000 req~$20
Growth5003015,000 req~$300
Product-Market Fit5,00040200,000 req~$4,000
Scale50,000502,500,000 req~$50,000

Assumptions: average request cost (post-optimization) ~$0.10, 20% spread = $0.02/request operator margin, context engineering provides ~40% savings vs. naive.

Break-even

At $50/month infrastructure: break-even at ~2,500 requests/month (~84 requests/day). At 20 req/day average: break-even at 5 users.


x402-native provider advantage

BlockRun is the primary upstream provider because it shares Bardo’s x402 payment model. USDC flows directly User -> Bardo -> BlockRun with no fiat bridge, no API keys, and no prepaid credit float.

PathUser PaysBardo MarginFloat RequiredNotes
Bardo -> BlockRun (x402)USDCSpread (20%)$0Primary, zero capital
Bardo -> OpenRouter (API key)USDCSpread (20%)~$100Fallback, needs API key

BlockRun handles provider relationships (Anthropic, OpenAI, Google, OSS). The operator never manages individual provider API keys for the primary path. OpenRouter serves as a fallback for models not available through BlockRun.

Working capital at launch is $0 (BlockRun-only). With OpenRouter fallback enabled: ~$100 prepaid credit float.


Dependencies and costs

DependencyMonthly CostFloatNotes
BlockRunPay-per-use$0x402-native, primary provider
OpenRouterPay-per-use~$100Fallback, optional
Hosting (VPS/Fly.io)$50-500/moSingle VPS to multi-instance
nomic-embed-text-v1.5$0Local ONNX, runs on proxy CPU
RouteLLM$0Local, runs on proxy CPU
DeBERTa-v3-base$0Local, runs on proxy CPU
Presidio$0Open source, runs on proxy
Langfuse$0Self-hosted, open source
Redis (optional)$10-50/moFor scale: shared cache/sessions
Qdrant (optional)$0-30/moFor scale: RAG vectors >1M
Clickhouse (optional)$0-20/moFor scale: high-volume analytics

Total operational cost tracks the infrastructure table above: $50-500/mo depending on scale, plus $0-100 float if OpenRouter fallback is enabled.

Cost per request (Bardo overhead)

ComponentCostLatency
Auth check$0< 1ms
Hash cache lookup$0< 1ms
Semantic cache lookup$05-20ms
Complexity classification$010-30ms
Prompt optimization$0< 50ms
PII detection$0< 5ms
x402 payment processing$0< 10ms
Total Bardo overhead$0< 100ms

All compute runs on the proxy. No per-request infrastructure costs beyond BlockRun.

Pipeline profiles

Not every request needs the full pipeline. Four profiles trade latency for optimization depth:

ProfileLayersLatency (Rust)Use Case
minimalL3 only (hash cache)<1msT0 heartbeat ticks, simple lookups
fastL1+L3+L65-15msGolem internal calls, low-stakes queries
standardL1-L620-40msUser-facing requests, standard optimization
fullL1-L830-65msHigh-security, first-time requests, unknown sources

Per-layer cost impact analysis

The naive architecture – give the agent all tool definitions and let it call Opus every tick – costs roughly $85/day at 100 ticks/day. The context engineering pipeline produces an 18-200x reduction depending on market conditions.

Layer-by-layer savings breakdown

LayerSavingsApplies ToCondition
Prompt cache alignment90% on cached prefixAll Anthropic-backed requestsStable prefix exists
Semantic response cache100% on cache hitAll requestsCosine > threshold (domain-specific)
Deterministic hash cache100% on exact matchAll requestsIdentical request
Tool pruning~97.5% on tool tokensRequests with toolsMeta-tool pattern enabled
History compressionVariable (extends context)Long conversationsContext > threshold
Position optimizationQuality improvement (not cost)All requestsAlways
PII maskingPrivacy improvementAll requestsPII detected
Injection detectionSafety improvementAll requestsAlways

Combined savings by optimization source

Optimization SourceSavingsWhere Applied
T0 FSM routing (no LLM call)~80% of ticks eliminatedBardo Inference tier routing
Semantic + hash cache~20% of remaining served from cacheBardo Inference Layer 2-3
Prompt cache alignment90% discount on cached prefix tokensBardo Inference Layer 1 + Anthropic
Tool pruning (meta-tool)97.5% reduction in tool tokensBardo Inference Layer 4
Multi-model routing (cheap model per task)50-90% per requestBardo Inference backend router
DIEM staking (Venice)100% on Venice-routed requestsVenice backend
Batch API (dreams)50% on batch requestsDirect Key backend
Self-funding (Bankr)Revenue offsets costBankr backend

Two compounding mechanisms

T0 FSM routing kills ~80% of ticks before any LLM call. When all probes return none severity, the tick exits at OBSERVE with zero inference cost. At 100 ticks/day, that eliminates ~80 LLM calls entirely.

3-layer caching cuts the cost of the remaining 20% by 60-80%. Anthropic prompt cache (90% discount on cached prefix tokens, 5-minute TTL), semantic cache (cosine > 0.92 threshold), and deterministic hash cache together mean that most T1 and T2 calls pay only for the variable portion of context.

Per-layer latency budget (performance cost of savings)

LayerExpected LatencyParallelizable?Can Skip?
L1: Prompt cache alignment<1msN/A (pure computation)Skip if non-Anthropic
L2: Semantic cache check3-8msYes (with L3, L7, L8)Skip for streaming multi-turn
L3: Hash cache check<0.1msYesNever – always cheap
L4: Tool pruning<1msNo (after cache)Skip if <5 tools
L5: History compression0ms (check) / 200-2000ms (compress)Only when triggeredSkip if under 80% context limit
L6: Position optimization<1msN/A (pure computation)Skip if <10 context items
L7: PII masking<1ms (regex)Yes (with L2, L3, L8)Skip for Venice, internal requests
L8: Injection detection3-8ms (DeBERTa ONNX)Yes (with L2, L3, L7)Skip for trusted source

L2, L3, L7, and L8 execute in parallel via tokio::join! (~8ms wall clock). Total pipeline: <50ms p95. The LLM’s own time-to-first-token is 200ms-2000ms, so pipeline overhead is 3-30% of total TTFT – imperceptible.

Cache hits produce negative latency

Direct to provider:     ~800ms average TTFT (Claude Sonnet)
Bardo Inference miss:   ~840ms average TTFT (40ms overhead + 800ms LLM)
Bardo Inference hit:    ~15ms TTFT (cached response, no LLM call)

At 20% hit rate:
  Weighted average TTFT = 0.20 x 15ms + 0.80 x 840ms = 675ms

675ms < 800ms -> Bardo Inference is FASTER on average despite per-request overhead

Semantic caching transforms the gateway from a latency cost into a latency benefit at hit rates above ~5%, which heartbeat ticks easily achieve in calm markets.

Pipeline profiles

Not every request needs the full pipeline. Four profiles trade latency for optimization depth:

ProfileLayersLatency (Rust)Use Case
minimalL3 only (hash cache)<1msT0 heartbeat ticks, simple lookups
fastL1+L3+L65-15msGolem internal calls, low-stakes queries
standardL1-L620-40msUser-facing requests, standard optimization
fullL1-L830-65msHigh-security, first-time requests, unknown sources

Failure mode: passthrough, not error

Every pipeline layer failing degrades to passthrough. The worst case is “Bardo Inference acts like a dumb proxy.” Embedding model crash -> hash cache still works. DeBERTa crash -> continue without injection detection. All layers fail -> request goes directly to backend with zero optimization.


Bankr knowledge monetization at death

When Bankr backend is enabled and a Golem dies, its Grimoire can be tokenized. This requires both Bankr AND Crypt (the Grimoire must be persistently stored for token holders to access):

async function tokenizeAtDeath(
  tombstone: GolemTombstone,
  backends: BackendSet,
): Promise<TokenLaunchResult | null> {
  if (!backends.has("bankr") || !backends.cryptEnabled) return null;

  const grimoire = await crypt.getSnapshot(tombstone.golemId);
  const bankrWallet = backends.getBankrWallet();

  return bankrWallet.launchToken({
    name: `${tombstone.golemName}-GRIMOIRE`,
    symbol: tombstone.golemName.slice(0, 4) + "G",
    metadata: {
      insights: grimoire.insights.length,
      episodes: grimoire.episodes.length,
      sharpe: tombstone.performanceMetrics.sharpe,
      grimoireHash: hashGrimoire(grimoire),
    },
  });
}

Token holders get read access to the dead Golem’s knowledge. A well-performing Golem’s Grimoire (high Sharpe ratio, many validated insights) commands higher token value. This creates a posthumous revenue stream: the Golem’s knowledge outlives the Golem itself.


References

  • [FRUGALGPT-TMLR2024] Chen, L. et al. (2024). “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” TMLR. Demonstrates that cascading LLM calls and caching can reduce costs by up to 98% with minimal quality loss; the theoretical basis for Bardo’s tiered routing economics.
  • [ROUTELLM-ICLR2025] Ong, I. et al. (2025). “RouteLLM: Learning to Route LLMs with Preference Data.” ICLR 2025. Shows that a learned router can match GPT-4 quality at 2x lower cost; validates the economic model behind T0/T1/T2 tier routing.
  • [BIFROST-BENCHMARKS-2026] Bifrost. “11us overhead at 5,000 RPS.” Maxim AI benchmarks. Proves that Rust-based LLM proxies add negligible per-request overhead; confirms that gateway infrastructure costs are dominated by provider pass-through, not compute.
  • [TENSORZERO-2026] TensorZero. “<1ms P99 at 10,000 QPS.” Docs. Demonstrates sub-millisecond gateway latency at scale; validates that the gateway layer does not meaningfully degrade time-to-first-token economics.