03 – Inference Economics [SPEC]

x402 Spread Revenue Model, Per-Tenant Cost Attribution, and Infrastructure Costs

Related: 00-overview.md (gateway architecture, payment flows, and design principles), 02-caching.md (three-layer cache stack that drives 40-85% cost savings), prd2/11-compute/03-billing.md (Compute VM billing and inference token allowances)

Reader orientation: This document specifies the economics of Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents called Golems). It belongs to the inference plane and covers the x402 spread revenue model, per-tenant cost attribution, and infrastructure cost projections. The key concept is that context engineering makes the all-in cost cheaper than direct API access even after the operator’s spread, creating a positive-sum model where users save money and operators earn margin. For term definitions, see prd2/shared/glossary.md.

Revenue model: x402 spread

See shared/x402-protocol.md for the x402 payment protocol specification.

Revenue = x402 (a micropayment protocol for HTTP-native USDC payments on Base) spread. Bardo charges a configurable markup (default ~20%) over BlockRun’s cost. Context engineering makes the all-in cost cheaper than direct API access despite the spread. The user saves money, the operator earns margin.

The only revenue stream

The operator configures a spread percentage applied to the optimized cost (after context engineering), not the naive cost.

BlockRun cost (optimized):  $X
Spread (default 20%):       $X × 0.20
User pays:                  $X × 1.20

Operator margin per request: $X × 0.20

Worked example

Request: 50K input tokens + 2K output tokens (Sonnet-equivalent)

Naive cost (direct API):                    $0.165
  Input:  50K × $3.00/M  = $0.150
  Output:  2K × $15.00/M = $0.030
  Total:                   $0.180
  (BlockRun may be slightly cheaper: $0.165)

Context engineering savings (~40%):
  - 30K tokens hit prefix cache (90% discount): saves $0.040
  - Tool pruning removes 8K tokens:             saves $0.024
  - History compression saves 2K tokens:         saves $0.006
  Optimized cost:                               $0.099

User pays (20% spread):                        $0.119
  BlockRun cost:  $0.099
  Spread:         $0.020

User saves vs. direct API:  28% ($0.165 → $0.119)
Operator margin:            $0.020/request

Key properties

User always saves: Context engineering savings (~40%) > spread (20%)
Zero float: Both legs settle instantly via x402 on Base. No credit risk.
Zero inference cost: No prepaid API credits to manage (BlockRun is x402-native)
Spread is operator-configurable: Range 5-50%, default 20%
One revenue stream: Simple pricing, simple explanation to users

Cost savings stack

Seven optimization layers compound to produce 40-85% cost reduction:

Layer	Mechanism	Savings	How
T0 routing	FSM suppresses ~80% of ticks	80% of ticks at $0	No LLM call for stable markets
Semantic + hash cache	Zero-cost on cache hits	~20% of remaining at $0	Embedding similarity + exact match
Prompt cache alignment	90% discount on cached prefix	60-80% on input tokens	Stable-before-dynamic reordering
Tool pruning	Remove irrelevant tool definitions	97.5% of tool tokens	Semantic search, ≤12 tools per request
Multi-model routing	Match task to cheapest capable model	50-90% per request	RouteLLM classifier, T1/T2 split
DIEM staking	100% discount on Venice-routed calls	Free inference	Venice DIEM balance covers cost
Batch API	50% discount on async processing	50% on dream cycles	Anthropic/OpenAI batch endpoints

Semantic cache tuning

Cache TTLs are regime-aware – volatile markets invalidate cached responses faster:

Market Regime	TTL (seconds)
Calm	300
Normal	210
Volatile	90
Crisis	30

Similarity thresholds vary by domain to prevent stale data leaking into high-stakes decisions:

Domain	Cosine Threshold
General market analysis	0.92
Strategy reasoning	0.95
Trade execution	0.98

Higher thresholds in execution contexts mean fewer cache hits but zero risk of acting on semantically similar but materially different market conditions.

Infrastructure cost

The operator’s only fixed cost is hosting the proxy binary.

Scale	Infrastructure	Monthly Cost
Small (< 1K req/day)	Single VPS	$50
Medium (1K-10K req/day)	2-3 instances + load balancer	$200
Large (10K-100K req/day)	Kubernetes cluster	$500

Default stack: in-memory LRU cache (configurable size, default 100MB), embedded HNSW index for semantic vectors, SQLite for sessions/memory/analytics. For large-scale deployments, optional backends (Redis, Qdrant, Clickhouse) can be swapped in for horizontal scalability.

ERC-8004 reputation discounts (future)

Users with high ERC-8004 reputation scores get reduced spread:

Reputation Tier	Spread
None (default)	20%
Basic (50+)	18%
Verified (200+)	15%
Trusted (500+)	12%
Sovereign (1000+)	8%

This creates a flywheel: use Bardo -> build reputation -> get cheaper access -> use Bardo more. Linking is optional via POST /v1/identity/link.

Per-tenant cost tracking

Every user (API key or wallet address) gets per-request and aggregate cost tracking. Every response includes cost transparency headers:

X-Bardo-Cost: 0.119            # What the user paid
X-Bardo-Blockrun-Cost: 0.099   # What Bardo paid BlockRun
X-Bardo-Spread: 0.020          # Operator margin
X-Bardo-Naive-Cost: 0.165      # What it would have cost without optimization
X-Bardo-Savings: 0.046         # User savings vs. naive (naive - user paid)
X-Bardo-Balance: 12.38         # Remaining prepaid balance (prepaid mode only)

Per-tenant cost summary

#![allow(unused)]
fn main() {
// crates/bardo-telemetry/src/cost.rs

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TenantCostSummary {
    pub tenant_id: String,
    pub period: CostPeriod,
    pub total_requests: u64,
    /// What the user paid (including spread).
    pub total_paid: f64,
    /// What Bardo paid BlockRun.
    pub total_blockrun_cost: f64,
    /// Operator margin.
    pub total_spread: f64,
    /// Estimated cost without context engineering.
    pub estimated_naive_cost: f64,
    /// Naive minus user-paid.
    pub total_savings: f64,
    pub savings_breakdown: SavingsBreakdown,
    pub by_model: Vec<ModelCostBreakdown>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum CostPeriod {
    Day,
    Week,
    Month,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SavingsBreakdown {
    pub semantic_cache: f64,
    pub prefix_cache: f64,
    pub tool_pruning: f64,
    pub history_compression: f64,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCostBreakdown {
    pub model: String,
    pub requests: u64,
    pub input_tokens: u64,
    pub output_tokens: u64,
    pub user_paid: f64,
    pub blockrun_cost: f64,
    pub savings: f64,
}
}

Queryable via GET /v1/analytics/spend?period=day|week|month.

Fallback cost tracking

When the proxy falls back from BlockRun (x402) to OpenRouter (API key), the x402 payment path is bypassed for the upstream leg. Cost tracking still applies: estimates use published per-token pricing from the provider catalog, and daily reconciliation against the API-key provider’s usage dashboard catches drift (typically <5%).

See 01a-routing.md for the FallbackCostAdapter.

Moat: agents that pay

The financial security architecture is a competitive moat (see prd2/appendices/moat-pay.md). Six independent security layers, three cryptographic, mean that even a fully compromised LLM cannot drain funds. The inference gateway is the economic control plane – it tracks costs, enforces budgets, and ensures the sustainability ratio stays above 1.0.

Bankr self-funding economics

Bankr-routed Golems (mortal autonomous DeFi agents managed by the Bardo runtime) fund inference from revenue. The wallet that earns from DeFi strategies pays for inference directly. No separate funding step. This is the metabolic loop: the organism sustains itself through the activity that requires sustenance.

#![allow(unused)]
fn main() {
// crates/bardo-providers/src/bankr.rs

pub fn compute_sustainability_ratio(
    daily_revenue_usd: f64,
    daily_inference_cost_usd: f64,
    daily_compute_cost_usd: f64,
    daily_gas_cost_usd: f64,
) -> f64 {
    let total_daily_cost = daily_inference_cost_usd
        + daily_compute_cost_usd
        + daily_gas_cost_usd;
    daily_revenue_usd / total_daily_cost
}
// ratio > 1.0 -> self-sustaining
// ratio > 2.0 -> thriving (can spawn replicants)
// ratio < 1.0 -> declining
// ratio < 0.5 -> dying
}

The sustainability ratio feeds directly into routing decisions. When it drops, the gateway routes more aggressively to cheap tiers:

Sustainability ratio	Budget adjustment	Behavior
> 2.0	Expand 1.5x	Revenue exceeds cost, grow capabilities
1.0 - 2.0	Baseline	Self-sustaining
0.5 - 1.0	Contract 0.7x	Under pressure, reduce non-critical inference
< 0.5	Contract 0.3x	Near-death, only risk + owner + death funded

A Golem managing a vault earning $50/day in fees with $15/day in Bankr inference costs has a 3.3x sustainability ratio. Above 1.0x, the Golem is economically immortal (barring epistemic or stochastic death). Below 1.0x, the economic clock is ticking. The gateway’s cost profile naturally curves downward as the Golem approaches death, extending effective lifespan by reducing the largest variable cost.

Venice DIEM staking economics

Venice DIEM staking provides zero-cost inference for Venice-routed calls. The owner stakes VVV tokens (Venice’s native token on Base), receiving a daily DIEM allocation proportional to stake weight. Each DIEM equals $1/day of Venice API credit, perpetually. DIEM-funded calls show X-Bardo-Cost: 0.00 in response headers.

DIEM is the first-choice cost optimization for privacy-preferring subsystems (dreams, daimon, death reflection). When DIEM balance is available, the provider resolver routes privacy-preferring intents to Venice automatically.

DIEM allocation strategy

Daily DIEM budget: $X (from VVV stake)
+-- Waking inference (private):  60%  -- Portfolio analysis, deal negotiation
+-- Dream cycles (private):      15%  -- Counterfactual reasoning, threat simulation
+-- Sleepwalker artifacts:       15%  -- Observatory research
+-- Reserve (rollover):          10%  -- Unused DIEM for volatile days

Mortality integration

In the standard mortality model, inference costs drain the LLM credit partition (60% of total budget). Venice staking decouples inference cost from mortality pressure: Venice (DIEM) calls cost zero marginal cost from staked VVV, draining nothing from the LLM partition. A Golem routing 50% of its inference to Venice extends its projected lifespan by ~30% (saving $0.06-0.10/day on a $0.20/day budget). Over a 30-day lifespan, that is 9 additional days of life.

Future revenue

Phase 2: premium services

Service	Price	Description
Custom prompt libraries	$10/mo	Operator-curated prompt templates
Advanced semantic caching	$5/mo	Larger cache, cross-session sharing
Extended analytics	$5/mo	90-day retention, detailed breakdowns

Phase 3: additional providers

Adding OpenRouter or direct API keys as fallback providers. These break the “zero working capital” property but increase model availability and redundancy. Revenue model extends naturally: spread applies to all providers. See 02-caching.md for provider cache discount tables used by the cost estimation engine.

Daily cost projections (Golem agents)

For Golem agents using the three-tier system at 100 ticks/day:

Scenario	T0	T1	T2	Daily LLM Cost
Calm market (90% T0, 8% T1, 2% T2)	$0.00	$0.024	$0.02	~$0.05
Normal market (80% T0, 15% T1, 5% T2)	$0.00	$0.045	$0.15	~$0.20
Volatile market (60% T0, 25% T1, 15% T2)	$0.00	$0.075	$0.75	~$0.83

Cost savings example (all users)

For a user making 1,000 requests/day:

Optimization	Applies To	Savings	Daily Savings
Economy model routing	60% of queries	85%/query	$2.55
Semantic cache hits	20% of queries	100%/query	$1.00
Prompt prefix cache	80% of queries	10%/query	$0.40
Tool pruning	30% of queries	15%/query	$0.22
Total daily savings			$4.17
Without gateway			$5.00/day
With gateway			$1.33/day (73% reduction)

User saves ~$3.67/day. Gateway earns ~$0.50/day in spread.

Revenue projections

Phase	Users	Avg Req/Day/User	Daily Volume	Daily Revenue (20% spread)
Launch	50	20	1,000 req	~$20
Growth	500	30	15,000 req	~$300
Product-Market Fit	5,000	40	200,000 req	~$4,000
Scale	50,000	50	2,500,000 req	~$50,000

Assumptions: average request cost (post-optimization) ~$0.10, 20% spread = $0.02/request operator margin, context engineering provides ~40% savings vs. naive.

Break-even

At $50/month infrastructure: break-even at ~2,500 requests/month (~84 requests/day). At 20 req/day average: break-even at 5 users.

x402-native provider advantage

BlockRun is the primary upstream provider because it shares Bardo’s x402 payment model. USDC flows directly User -> Bardo -> BlockRun with no fiat bridge, no API keys, and no prepaid credit float.

Path	User Pays	Bardo Margin	Float Required	Notes
Bardo -> BlockRun (x402)	USDC	Spread (20%)	$0	Primary, zero capital
Bardo -> OpenRouter (API key)	USDC	Spread (20%)	~$100	Fallback, needs API key

BlockRun handles provider relationships (Anthropic, OpenAI, Google, OSS). The operator never manages individual provider API keys for the primary path. OpenRouter serves as a fallback for models not available through BlockRun.

Working capital at launch is $0 (BlockRun-only). With OpenRouter fallback enabled: ~$100 prepaid credit float.

Dependencies and costs

Dependency	Monthly Cost	Float	Notes
BlockRun	Pay-per-use	$0	x402-native, primary provider
OpenRouter	Pay-per-use	~$100	Fallback, optional
Hosting (VPS/Fly.io)	$50-500/mo	–	Single VPS to multi-instance
nomic-embed-text-v1.5	$0	–	Local ONNX, runs on proxy CPU
RouteLLM	$0	–	Local, runs on proxy CPU
DeBERTa-v3-base	$0	–	Local, runs on proxy CPU
Presidio	$0	–	Open source, runs on proxy
Langfuse	$0	–	Self-hosted, open source
Redis (optional)	$10-50/mo	–	For scale: shared cache/sessions
Qdrant (optional)	$0-30/mo	–	For scale: RAG vectors >1M
Clickhouse (optional)	$0-20/mo	–	For scale: high-volume analytics

Total operational cost tracks the infrastructure table above: $50-500/mo depending on scale, plus $0-100 float if OpenRouter fallback is enabled.

Cost per request (Bardo overhead)

Component	Cost	Latency
Auth check	$0	< 1ms
Hash cache lookup	$0	< 1ms
Semantic cache lookup	$0	5-20ms
Complexity classification	$0	10-30ms
Prompt optimization	$0	< 50ms
PII detection	$0	< 5ms
x402 payment processing	$0	< 10ms
Total Bardo overhead	$0	< 100ms

All compute runs on the proxy. No per-request infrastructure costs beyond BlockRun.

Pipeline profiles

Not every request needs the full pipeline. Four profiles trade latency for optimization depth:

Profile	Layers	Latency (Rust)	Use Case
`minimal`	L3 only (hash cache)	<1ms	T0 heartbeat ticks, simple lookups
`fast`	L1+L3+L6	5-15ms	Golem internal calls, low-stakes queries
`standard`	L1-L6	20-40ms	User-facing requests, standard optimization
`full`	L1-L8	30-65ms	High-security, first-time requests, unknown sources

Per-layer cost impact analysis

The naive architecture – give the agent all tool definitions and let it call Opus every tick – costs roughly $85/day at 100 ticks/day. The context engineering pipeline produces an 18-200x reduction depending on market conditions.

Layer-by-layer savings breakdown

Layer	Savings	Applies To	Condition
Prompt cache alignment	90% on cached prefix	All Anthropic-backed requests	Stable prefix exists
Semantic response cache	100% on cache hit	All requests	Cosine > threshold (domain-specific)
Deterministic hash cache	100% on exact match	All requests	Identical request
Tool pruning	~97.5% on tool tokens	Requests with tools	Meta-tool pattern enabled
History compression	Variable (extends context)	Long conversations	Context > threshold
Position optimization	Quality improvement (not cost)	All requests	Always
PII masking	Privacy improvement	All requests	PII detected
Injection detection	Safety improvement	All requests	Always

Combined savings by optimization source

Optimization Source	Savings	Where Applied
T0 FSM routing (no LLM call)	~80% of ticks eliminated	Bardo Inference tier routing
Semantic + hash cache	~20% of remaining served from cache	Bardo Inference Layer 2-3
Prompt cache alignment	90% discount on cached prefix tokens	Bardo Inference Layer 1 + Anthropic
Tool pruning (meta-tool)	97.5% reduction in tool tokens	Bardo Inference Layer 4
Multi-model routing (cheap model per task)	50-90% per request	Bardo Inference backend router
DIEM staking (Venice)	100% on Venice-routed requests	Venice backend
Batch API (dreams)	50% on batch requests	Direct Key backend
Self-funding (Bankr)	Revenue offsets cost	Bankr backend

Two compounding mechanisms

T0 FSM routing kills ~80% of ticks before any LLM call. When all probes return none severity, the tick exits at OBSERVE with zero inference cost. At 100 ticks/day, that eliminates ~80 LLM calls entirely.

3-layer caching cuts the cost of the remaining 20% by 60-80%. Anthropic prompt cache (90% discount on cached prefix tokens, 5-minute TTL), semantic cache (cosine > 0.92 threshold), and deterministic hash cache together mean that most T1 and T2 calls pay only for the variable portion of context.

Per-layer latency budget (performance cost of savings)

Layer	Expected Latency	Parallelizable?	Can Skip?
L1: Prompt cache alignment	<1ms	N/A (pure computation)	Skip if non-Anthropic
L2: Semantic cache check	3-8ms	Yes (with L3, L7, L8)	Skip for streaming multi-turn
L3: Hash cache check	<0.1ms	Yes	Never – always cheap
L4: Tool pruning	<1ms	No (after cache)	Skip if <5 tools
L5: History compression	0ms (check) / 200-2000ms (compress)	Only when triggered	Skip if under 80% context limit
L6: Position optimization	<1ms	N/A (pure computation)	Skip if <10 context items
L7: PII masking	<1ms (regex)	Yes (with L2, L3, L8)	Skip for Venice, internal requests
L8: Injection detection	3-8ms (DeBERTa ONNX)	Yes (with L2, L3, L7)	Skip for trusted source

L2, L3, L7, and L8 execute in parallel via tokio::join! (~8ms wall clock). Total pipeline: <50ms p95. The LLM’s own time-to-first-token is 200ms-2000ms, so pipeline overhead is 3-30% of total TTFT – imperceptible.

Cache hits produce negative latency

Direct to provider:     ~800ms average TTFT (Claude Sonnet)
Bardo Inference miss:   ~840ms average TTFT (40ms overhead + 800ms LLM)
Bardo Inference hit:    ~15ms TTFT (cached response, no LLM call)

At 20% hit rate:
  Weighted average TTFT = 0.20 x 15ms + 0.80 x 840ms = 675ms

675ms < 800ms -> Bardo Inference is FASTER on average despite per-request overhead

Semantic caching transforms the gateway from a latency cost into a latency benefit at hit rates above ~5%, which heartbeat ticks easily achieve in calm markets.

Pipeline profiles

Not every request needs the full pipeline. Four profiles trade latency for optimization depth:

Profile	Layers	Latency (Rust)	Use Case
`minimal`	L3 only (hash cache)	<1ms	T0 heartbeat ticks, simple lookups
`fast`	L1+L3+L6	5-15ms	Golem internal calls, low-stakes queries
`standard`	L1-L6	20-40ms	User-facing requests, standard optimization
`full`	L1-L8	30-65ms	High-security, first-time requests, unknown sources

Failure mode: passthrough, not error

Every pipeline layer failing degrades to passthrough. The worst case is “Bardo Inference acts like a dumb proxy.” Embedding model crash -> hash cache still works. DeBERTa crash -> continue without injection detection. All layers fail -> request goes directly to backend with zero optimization.

Bankr knowledge monetization at death

When Bankr backend is enabled and a Golem dies, its Grimoire can be tokenized. This requires both Bankr AND Crypt (the Grimoire must be persistently stored for token holders to access):

async function tokenizeAtDeath(
  tombstone: GolemTombstone,
  backends: BackendSet,
): Promise<TokenLaunchResult | null> {
  if (!backends.has("bankr") || !backends.cryptEnabled) return null;

  const grimoire = await crypt.getSnapshot(tombstone.golemId);
  const bankrWallet = backends.getBankrWallet();

  return bankrWallet.launchToken({
    name: `${tombstone.golemName}-GRIMOIRE`,
    symbol: tombstone.golemName.slice(0, 4) + "G",
    metadata: {
      insights: grimoire.insights.length,
      episodes: grimoire.episodes.length,
      sharpe: tombstone.performanceMetrics.sharpe,
      grimoireHash: hashGrimoire(grimoire),
    },
  });
}

Token holders get read access to the dead Golem’s knowledge. A well-performing Golem’s Grimoire (high Sharpe ratio, many validated insights) commands higher token value. This creates a posthumous revenue stream: the Golem’s knowledge outlives the Golem itself.

References

[FRUGALGPT-TMLR2024] Chen, L. et al. (2024). “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” TMLR. Demonstrates that cascading LLM calls and caching can reduce costs by up to 98% with minimal quality loss; the theoretical basis for Bardo’s tiered routing economics.
[ROUTELLM-ICLR2025] Ong, I. et al. (2025). “RouteLLM: Learning to Route LLMs with Preference Data.” ICLR 2025. Shows that a learned router can match GPT-4 quality at 2x lower cost; validates the economic model behind T0/T1/T2 tier routing.
[BIFROST-BENCHMARKS-2026] Bifrost. “11us overhead at 5,000 RPS.” Maxim AI benchmarks. Proves that Rust-based LLM proxies add negligible per-request overhead; confirms that gateway infrastructure costs are dominated by provider pass-through, not compute.
[TENSORZERO-2026] TensorZero. “<1ms P99 at 10,000 QPS.” Docs. Demonstrates sub-millisecond gateway latency at scale; validates that the gateway layer does not meaningfully degrade time-to-first-token economics.

Keyboard shortcuts

Bardo