01 – Model routing [SPEC]

Self-describing providers, declarative intents, and mortality-aware resolution

Reader orientation: This document specifies the model routing system for Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents). It belongs to the inference plane and describes how the gateway resolves declarative intents to concrete model + provider pairs using self-describing providers. The key concept is that each Golem subsystem declares what it needs (quality, latency, features), and the resolver walks an ordered provider list to find the best match, with cost sensitivity that increases as the agent approaches death. For term definitions, see prd2/shared/glossary.md.

Three-tier model routing

Model routing is a survival decision. Choosing Opus over Haiku costs $0.25 versus $0.003 – an 83x difference. The LLM partition receives 60% of the credit budget (see prd2/02-mortality/01-architecture.md), so inference spending directly determines lifespan. An unnecessary Opus call at $0.25 burns the same budget as 83 Haiku calls or 1.25 days of life at $0.20/day.

Tier	Handler	Model	Cost/call	Trigger
T0	FSM + rules	None	$0.00	No significant state change
T1	Haiku via x402 (a micropayment protocol for HTTP-native USDC payments on Base)	`claude-haiku-4-5`	~$0.001-0.003	Moderate anomaly, routine analysis
T2	Sonnet or Opus	`claude-sonnet-4` / `claude-opus-4-6`	~$0.01-0.25	Novel situation, high-stakes decision, conflicting signals

Expected distribution: ~80% T0, ~15% T1, ~4% Sonnet, ~1% Opus.

Tiers gate WHEN the LLM fires. Intents determine WHICH model and provider handle the call. Each subsystem declares a static intent – what features and quality it needs – and the resolver matches intents against the ordered provider list. See 12-providers.md for all intent declarations and the provider resolution algorithm.

Three primitives replace CapabilityMap

The old CapabilityMap pattern – a centralized function full of if-statements that manually enumerated every model and feature for every provider – had three problems: centralized fragility, double-sided if-chains, and overengineering for a problem that’s actually O(n) on a short list.

Three primitives replace it:

+-------------------------------------------------------------+
|  PROVIDER                                                    |
|  A self-describing module that knows its own models,         |
|  features, and constraints. Answers "can you handle this?"   |
|  with a yes/no + cost estimate.                              |
+-------------------------------------------------------------+
|  INTENT                                                      |
|  A lightweight object that a subsystem attaches to a         |
|  request: what model family, what features, what             |
|  constraints. Pure data, no logic.                           |
+-------------------------------------------------------------+
|  RESOLVER                                                    |
|  Walks the owner's provider list in order. For each          |
|  provider, asks "can you satisfy this intent?" First         |
|  yes wins. No map, no graph, no central registry.            |
+-------------------------------------------------------------+

Provider trait

Each provider knows its own capabilities. The router doesn’t maintain a compatibility matrix.

#![allow(unused)]
fn main() {
// crates/bardo-providers/src/trait.rs

/// A provider that knows its own capabilities.
#[async_trait]
pub trait Provider: Send + Sync {
    /// Unique identifier (e.g., "blockrun", "openrouter", "venice").
    fn id(&self) -> &str;
    /// Human-readable name.
    fn name(&self) -> &str;
    /// Resolve an intent to a concrete model + provider pair.
    /// Returns None if this provider cannot handle the request.
    fn resolve(&self, intent: &Intent) -> Option<Resolution>;
    /// Format the request for this provider's API.
    fn format_request(
        &self,
        request: &ChatCompletionRequest,
        model: &str,
    ) -> Result<ProviderRequest>;
    /// Parse the provider's SSE stream into normalized chunks.
    fn parse_response(
        &self,
        stream: impl Stream<Item = Result<Bytes>>,
    ) -> impl Stream<Item = Result<CompletionChunk>>;
    /// Provider-specific traits (privacy, payment mode, etc.).
    fn traits(&self) -> &ProviderTraits;
}

#[derive(Debug, Clone)]
pub struct ProviderTraits {
    /// Inference logs are not stored.
    pub private: bool,
    /// Revenue from engagement funds inference.
    pub self_funding: bool,
    /// Context engineering applies to this provider's requests.
    pub context_engineering: bool,
    /// How this provider is paid.
    pub payment: PaymentMode,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum PaymentMode {
    /// USDC on Base via x402 protocol.
    X402,
    /// Prepaid API credits.
    Prepaid,
    /// Owner's own API key (passthrough).
    ApiKey,
    /// Venice DIEM staking (zero-cost inference).
    Diem,
    /// Agent wallet pays directly from earned revenue.
    Wallet,
}
}

The key insight: resolve() is a pure function inside each provider module. The provider decides whether it can handle the request. When Venice adds a new model, only the Venice module changes. When Bankr adds cross-model verification, only the Bankr module changes. No central registry to update.

Intent struct

A subsystem doesn’t query a map. It builds a lightweight intent that describes what it needs. No logic, no conditionals – data.

#![allow(unused)]
fn main() {
// crates/bardo-router/src/intent.rs

#[derive(Debug, Clone)]
pub struct Intent {
    /// Specific model requested, or None for "best available."
    pub model: Option<String>,
    /// Hard requirements. Provider must satisfy ALL or return None.
    pub require: Vec<String>,
    /// Soft preferences. Missing ones appear in Resolution.degraded.
    pub prefer: Vec<String>,
    /// Quality level. Affects model selection when model is None.
    pub quality: Quality,
    /// Maximum acceptable latency in ms.
    pub max_latency_ms: u64,
    /// Cost sensitivity (0 = don't care, 1 = extremely sensitive).
    pub cost_sensitivity: f64,
    /// DIEM balance available for Venice-routed calls.
    pub diem_available: bool,
    /// The subsystem making the request.
    pub subsystem: String,
}

#[derive(Debug, Clone)]
pub struct Resolution {
    pub model: String,
    pub provider: String,
    pub estimated_cost_usd: f64,
    pub features: Vec<String>,
    /// What the intent wanted but this resolution can't provide.
    pub degraded: Vec<String>,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Quality { Minimum, Low, Medium, High, Maximum }
}

The resolver: 20 lines

The resolver walks the provider list in order. First match wins. That’s it.

#![allow(unused)]
fn main() {
// crates/bardo-router/src/resolve.rs

pub fn resolve(
    providers: &[Box<dyn Provider>],
    intent: &Intent,
) -> Option<Resolution> {
    // Pass 1: strict matching
    for provider in providers {
        if let Some(resolution) = provider.resolve(intent) {
            return Some(resolution);
        }
    }

    // Pass 2: relax hard requirements to preferences
    let relaxed = Intent {
        require: vec![],
        prefer: [intent.prefer.clone(), intent.require.clone()].concat(),
        max_latency_ms: intent.max_latency_ms * 2,
        ..intent.clone()
    };
    for provider in providers {
        if let Some(mut resolution) = provider.resolve(&relaxed) {
            resolution.degraded.extend(intent.require.iter().cloned());
            return Some(resolution);
        }
    }

    None
}
}

Why this is better than scoring: predictable (the owner knows provider #1 is always tried first), debuggable (“move Venice above BlockRun in your config”), correct by default (the owner placed providers in their preferred order for a reason).

Subsystem intent declarations

Each subsystem has a static intent. These are constant objects – no functions, no conditionals. Adding a new subsystem means adding one more entry.

Subsystem	Quality	Key preferences	Cost sensitivity	Typical resolution
heartbeat_t0	Minimum	–	1.0	No LLM call (FSM)
heartbeat_t1	Low	low_effort	0.8	BlockRun -> Haiku 4.5
heartbeat_t2	High	interleaved_thinking, citations	0.3	BlockRun -> Claude Opus
risk	Maximum	interleaved_thinking, citations	0.0 (never reduced)	BlockRun -> Claude Opus
dream	High	visible_thinking, privacy	0.5	Venice -> DeepSeek R1
daimon	Low	privacy	0.9	Venice -> Llama 3.3
daimon_complex	High	visible_thinking, privacy	0.5	Venice -> DeepSeek R1
curator	Medium	structured_outputs, citations	0.5	BlockRun -> Claude Sonnet
playbook	Medium	predicted_outputs	0.6	Direct OpenAI -> GPT-5.x
operator	Maximum	interleaved_thinking, citations	0.0 (never reduced)	BlockRun -> Claude Opus
death	Maximum	visible_thinking (required)	0.0	Venice -> DeepSeek R1
session_compact	Medium	compaction	0.5	BlockRun -> Anthropic (compaction API)

Death is the only subsystem with a hard requirement (require: ["visible_thinking"]). All others use soft preferences. If no provider matches strictly, the resolver drops requirements to preferences on its second pass – Resolution.degraded lists what was lost.

Mortality pressure modification

Dying Golems (mortal autonomous DeFi agents managed by the Bardo runtime) become more cost-sensitive. This is a simple transformation on the intent, not a conditional chain:

#![allow(unused)]
fn main() {
// crates/bardo-router/src/mortality.rs

/// Modify intent based on Vitality score (the Golem's remaining lifespan as a 0.0-1.0 value).
/// Exempt subsystems: risk, death, operator.
pub fn apply_mortality_pressure(intent: &mut Intent, vitality: f64) {
    let exempt = ["risk", "death", "operator"];
    if exempt.contains(&intent.subsystem.as_str()) { return; }

    let pressure = 1.0 - vitality; // 0 = healthy, 1 = dying
    intent.cost_sensitivity = (intent.cost_sensitivity + pressure * 0.3).min(1.0);

    // Under extreme pressure, downgrade quality for non-critical subsystems
    if pressure > 0.7 {
        intent.quality = match intent.quality {
            Quality::Maximum => Quality::High,
            Quality::High => Quality::Medium,
            Quality::Medium => Quality::Low,
            other => other,
        };
    }
}
}

ModelRouter extension

The ModelRouter runs in the Golem runtime (Rust) as a runtime extension:

#![allow(unused)]
fn main() {
// crates/golem-inference/src/routing.rs

pub struct ModelRouter;

impl Extension for ModelRouter {
    fn name(&self) -> &str { "model-router" }
    fn layer(&self) -> u8 { 3 }

    async fn on_before_agent_start(&self, ctx: &mut AgentStartCtx) -> Result<()> {
        let state = ctx.golem_state();
        let tier = state.heartbeat.cognitive_tier;

        if tier == CognitiveTier::T0 {
            return Ok(()); // No LLM call -- FSM handles it
        }

        // Look up the subsystem intent
        let subsystem = ctx.current_subsystem();
        let mut intent = subsystem_intent(subsystem);

        // Apply mortality pressure (exempt: risk, death, operator)
        apply_mortality_pressure(&mut intent, state.mortality.vitality);

        // Resolve against ordered provider list -- first match wins
        let resolution = resolve(&state.providers, &intent)
            .ok_or_else(|| anyhow!("No provider for intent: {}", intent.subsystem))?;

        // Set model on the session
        ctx.set_model(&resolution.model, &resolution.provider);

        // Emit degradation as GolemEvent for owner visibility
        if !resolution.degraded.is_empty() {
            ctx.emit(GolemEvent::InferenceStart {
                // ... fields ...
            });
            ctx.emit_warning(format!(
                "{} routed to {}/{} (unavailable: {})",
                intent.subsystem, resolution.provider, resolution.model,
                resolution.degraded.join(", ")
            ));
        }

        Ok(())
    }
}
}

Six-layer routing pipeline

The gateway implements six decision stages, each adding intelligence. T0 ticks never reach the gateway at all – the FSM rules in the heartbeat OBSERVE phase exit before any LLM call fires. The pipeline below handles only the ~20% of ticks that escape T0 suppression.

Request arrives (T1 or T2 only -- T0 exits before reaching gateway)
    |
    v
+-------------------------+
|  Layer 0: T0 FSM rules   |  No LLM call at all. ~80% of heartbeat ticks
|  (in golem heartbeat)    |  exit here. Zero gateway traffic.
+-------------------------+
    | (non-suppressed ticks only)
    v
+-------------------------+
|  Layer 1: Pre-filter     |  Rule-based: token limits, model availability,
|  (0ms, free)             |  agent tier restrictions, behavioral phase caps
+-------------------------+
    |
    v
+-------------------------+
|  Layer 2: Semantic Cache |  Embed last user message with nomic-embed-text
|  (3-8ms, free)           |  -v1.5 (local ONNX). Cosine > 0.92 = hit.
+-------------------------+
    | (cache miss)
    v
+-------------------------+
|  Layer 3: Classify       |  Local DeBERTa-v3-base: complexity, domain,
|  (3-8ms, ~free)          |  safety, intent. domain="defi" triggers
|                          |  DeFi enrichment in Layer 5.
+-------------------------+
    |
    v
+-------------------------+
|  Layer 4: Route          |  Subsystem intent resolved against ordered
|  (<1ms, ~free)           |  provider list. Mortality pressure applied.
|                          |  First match wins.
+-------------------------+
    |
    v
+-------------------------+
|  Layer 5: Context Engine |  7-step optimization: reorder -> prune tools
|  (0-100ms, varies)       |  -> compress history -> dedup -> relevance
|                          |  -> constraints -> format. Tool pruning here.
+-------------------------+
    |
    v
+-------------------------+
|  Layer 6: KV-Cache Route |  Session-affinity routing to provider/pod
|  (<1ms, free)            |  with warm KV-cache prefix. Up to 87.4% cache
|                          |  hit rate [IBM-KVFlow]. Affinity decays 5 min.
+-------------------------+
    |
    v
    Provider (BlockRun -> OpenRouter -> Venice -> Bankr -> Direct)

Dynamic catalog refresh

The provider registry is populated from BlockRun’s catalog (GET https://api.blockrun.ai/v1/models, cached hourly) and merged with operator config. When BlockRun adds a model, the gateway discovers it automatically at the next refresh. Venice, Bankr, and Direct Key providers declare their models statically in config.

Provider registry types

#![allow(unused)]
fn main() {
// crates/bardo-router/src/registry.rs

/// A model available through one or more provider backends.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelProvider {
    pub id: String,       // e.g. "blockrun/claude-sonnet-4"
    pub name: String,
    pub family: String,   // "claude", "gpt", "gemini", "hermes", "qwen"
    pub access: ModelAccess,
    pub capabilities: ModelCapabilities,
    pub pricing: ModelPricing,
    pub health: Option<ModelHealth>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCapabilities {
    pub hybrid_reasoning: bool,
    pub tool_calling: bool,
    pub structured_output: bool,
    pub max_output_tokens: u32,
    pub predicted_outputs: bool,
    pub explicit_caching: bool,
    pub visible_thinking: bool,
    pub citations_support: bool,
    pub compaction_api: bool,
    pub adaptive_thinking: bool,
    pub strengths: ModelStrengths,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelStrengths {
    pub reasoning: f32,
    pub code_generation: f32,
    pub tool_call_accuracy: f32,
    pub schema_adherence: f32,
    pub defi_knowledge: f32,
    pub instruction_following: f32,
}

/// Token pricing in USD. Refreshed hourly from BlockRun catalog.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelPricing {
    pub input_per_million: f64,
    pub output_per_million: f64,
    pub cached_input_per_million: Option<f64>,
    pub source: PricingSource,
    pub last_updated: Option<String>,
}

/// Live health from 30-second pings. >5% error rate over 5 min
/// removes a provider from the pool temporarily.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelHealth {
    pub status: HealthStatus,
    pub avg_latency_ms: Option<f64>,
    pub p95_latency_ms: Option<f64>,
    pub error_rate: Option<f64>,
    pub consecutive_failures: u32,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum HealthStatus { Healthy, Degraded, Down, Unknown }
}

Tool pruning (Layer 5 detail)

Tool definitions consume ~15,000 tokens per request. The task classifier in Layer 5 reduces this to 12 tools or fewer.

#![allow(unused)]
fn main() {
// crates/bardo-pipeline/src/tools.rs

/// Prune to the subset relevant for this tick. Hard cap of 12 tools
/// keeps definition tokens under ~1,500. Stable-ID sort preserves
/// prefix-cache alignment.
pub fn classify_and_prune(
    tick_type: TickType,
    regime: &MarketRegime,
    phase: &BehavioralPhase, // One of five survival phases: Thriving, Stable, Conservation, Desperate, Terminal
    all_tools: &[ToolDefinition],
) -> Vec<ToolDefinition> {
    let allowed: &HashSet<&str> = &TASK_TOOL_MAP[&tick_type];
    let mut pruned: Vec<ToolDefinition> = all_tools
        .iter()
        .filter(|t| allowed.contains(t.name.as_str()))
        .cloned()
        .collect();
    pruned.sort_by(|a, b| a.id.cmp(&b.id));
    pruned.truncate(12);
    pruned
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum TickType {
    MarketAnalysis,
    LpManagement,
    VaultRebalance,
    RiskCheck,
    PortfolioReview,
    TradeExecution,
    StrategyUpdate,
}
}

For very large tool catalogs (100+), the gateway exposes three meta-tools instead of pre-loading definitions: search_tools, get_tool_schema, execute_tool. This achieves 97.5% token reduction (the “Speakeasy” pattern).

Degradation is visible, not silent

When a provider can satisfy an intent but not all preferences, the degraded field says what’s missing:

#![allow(unused)]
fn main() {
let resolution = resolve(&providers, &INTENTS.dream);
// resolution = Resolution {
//   model: "claude-opus-4-6",
//   provider: "blockrun",
//   features: ["adaptive_thinking"],
//   degraded: ["visible_thinking", "privacy"],
//   // Dream wanted visible thinking + privacy,
//   // but only BlockRun was available.
// }
}

The Golem emits this to the owner: “Dream cycle used Claude (visible thinking and privacy unavailable – configure Venice for better dream quality).” This is actionable. The owner knows exactly what to add to their config to fix it.

Tool format adapters

BlockRun serves diverse models that emit tool calls in different formats. The gateway normalizes all formats to a standard ToolInvocation struct.

Model family	Raw format	Adapter
Anthropic	`tool_use` content blocks	`AnthropicToolAdapter`
OpenAI	`function_call` / `tool_calls` in msg	`OpenAIToolAdapter`
Hermes	`<tool_call>` XML blocks in text	`HermesToolAdapter`
Qwen	`<tool_call>` blocks	`QwenToolAdapter`
Generic	Raw JSON in text response	`JsonToolAdapter`

#![allow(unused)]
fn main() {
// crates/bardo-router/src/tools.rs

pub trait ToolAdapter: Send + Sync {
    fn format_tools(&self, tools: &[ToolDefinition]) -> serde_json::Value;
    fn parse_tool_calls(&self, response: &ProviderResponse) -> Vec<ToolInvocation>;
    fn format_tool_results(&self, results: &[ToolResult]) -> serde_json::Value;
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ToolInvocation {
    pub id: String,
    pub name: String,
    pub arguments: serde_json::Map<String, serde_json::Value>,
}
}

Daily cost projections

At 100 ticks/day with expected distribution:

Scenario	T0	T1	T2	Daily LLM cost
Calm market (90% T0, 8% T1, 2% T2)	$0.00	$0.024	$0.02	~$0.05
Normal market (80% T0, 15% T1, 5% T2)	$0.00	$0.045	$0.15	~$0.20
Volatile market (60% T0, 25% T1, 15% T2)	$0.00	$0.075	$0.75	~$0.83

Target daily cost per Golem: $1.00-$2.00 total (LLM + compute + gas + data).

System 1 / System 2 escalation

The three-tier model mirrors dual-process theory from cognitive science [WESTON-S2S1-2024, DPT-AGENT-2025]:

System 1 (T0 + T1): Fast, cheap, handles 95% of decisions. Deterministic probes at T0, quick Haiku analysis at T1. The Golem’s “intuition.”
System 2 (T2): Slow, expensive, deployed only when System 1 flags uncertainty or novel situations. Sonnet/Opus for deep reasoning. The Golem’s “deliberation.”

Tiers set the escalation boundary. Intents determine what happens after escalation. A T2 escalation for risk routes to Opus with interleaved thinking and citations. A T2 escalation for dream routes to DeepSeek R1 on Venice with visible thinking and privacy. Same tier, different intents, different providers.

Non-heartbeat subsystems (risk, dream, daimon, curator, playbook, operator, death) bypass tier gating entirely and use their own intents. See 01-cognition.md S2 for the full subsystem intent table.

Configuration

# Required
BARDO_INFERENCE_URL=https://bardo.example.com
BARDO_BLOCKRUN_ENDPOINT=https://api.blockrun.ai

# Optional fallback
BARDO_OPENROUTER_KEY=sk-or-...

# Tier overrides (default: auto-assigned from BlockRun catalog)
BARDO_T1_MODEL=blockrun/claude-haiku-4-5
BARDO_T2_MODEL=blockrun/claude-sonnet-4

# Tuning
BARDO_INFERENCE_CACHE_THRESHOLD=0.92
BARDO_INFERENCE_CACHE_TTL=300
BARDO_INFERENCE_MAX_RETRIES=3
BARDO_INFERENCE_SPREAD_PCT=20

InferenceProfile: per-call parameter specification

The routing system above decides WHICH model and provider handle a call. The InferenceProfile decides HOW the model reasons – temperature, sampling, reasoning depth, output format, caching hints, and provider-specific features. Every subsystem attaches a profile to its Intent; the gateway applies it after provider resolution, before the request reaches the backend.

#![allow(unused)]
fn main() {
/// Complete inference parameter specification for a single call.
/// Attached to the Intent by the subsystem, applied by the gateway.
///
/// All fields are Option<T>: None means "use provider default."
/// The gateway merges the profile with provider-specific defaults
/// and capabilities before sending the request.
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct InferenceProfile {
    // ── Sampling ──────────────────────────────────────────────
    /// Temperature: controls randomness.
    /// 0.0 = deterministic, 2.0 = maximum randomness.
    /// None = provider default (typically 1.0).
    pub temperature: Option<f32>,

    /// Top-p (nucleus sampling): cumulative probability threshold.
    /// 0.1 = very focused, 1.0 = no filtering.
    /// Mutually exclusive with top_k in practice.
    pub top_p: Option<f32>,

    /// Top-k: number of highest-probability tokens to consider.
    /// 1 = greedy, 100+ = broad. Not all providers support this.
    pub top_k: Option<u32>,

    /// Min-p: dynamic probability floor relative to top token.
    /// 0.1 = tokens with <10% of top token's probability are excluded.
    /// More principled than fixed top-p. Venice + open models support this.
    pub min_p: Option<f32>,

    /// Frequency penalty: penalizes tokens that appear frequently.
    /// 0.0 = no penalty, 2.0 = strong penalty.
    pub frequency_penalty: Option<f32>,

    /// Presence penalty: penalizes tokens that have appeared at all.
    /// 0.0 = no penalty, 2.0 = strong penalty.
    pub presence_penalty: Option<f32>,

    // ── Reasoning ─────────────────────────────────────────────
    /// Reasoning effort: controls depth of chain-of-thought.
    /// Maps to Venice's reasoning_effort, Anthropic's extended thinking,
    /// OpenAI's reasoning effort parameter.
    /// None = provider default. Some models always reason.
    pub reasoning_effort: Option<ReasoningEffort>,

    /// Whether to request visible thinking/reasoning traces.
    /// Venice: reasoning_content field. Anthropic: thinking blocks.
    /// None = provider default.
    pub visible_thinking: Option<bool>,

    // ── Output format ─────────────────────────────────────────
    /// Structured output schema (JSON Schema).
    /// When set, the provider enforces this schema on the response.
    /// Falls back gracefully to free text + parsing if unsupported.
    pub response_schema: Option<ResponseSchema>,

    /// Maximum output tokens.
    pub max_tokens: Option<u32>,

    /// Stop sequences.
    pub stop_sequences: Option<Vec<String>>,

    // ── Caching ───────────────────────────────────────────────
    /// Prompt cache key for session affinity.
    /// Venice/Anthropic: routes to same server for cache hits.
    pub prompt_cache_key: Option<String>,

    /// Explicit cache control markers for Anthropic models.
    /// When true, the gateway auto-adds cache_control to system prompts
    /// and long static content blocks.
    pub cache_control: Option<bool>,

    // ── Provider-specific ─────────────────────────────────────
    /// Venice-specific: enable web search for this call.
    pub web_search: Option<bool>,

    /// Venice-specific: TEE (Trusted Execution Environment) mode.
    /// Ensures inference runs in an encrypted enclave.
    pub tee_mode: Option<bool>,

    /// OpenAI-specific: predicted output for diffing (PLAYBOOK edits).
    pub predicted_output: Option<String>,

    /// Seed for reproducibility. Not all providers support this.
    pub seed: Option<u64>,
}

/// Structured output schema specification.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResponseSchema {
    /// Schema name (for provider registration).
    pub name: String,
    /// JSON Schema definition.
    pub schema: serde_json::Value,
    /// Whether strict mode is required.
    /// Venice/OpenAI: strict=true. Falls back to prompt-guided if unsupported.
    pub strict: bool,
}
}

ReasoningEffort enum

Normalized across providers. The gateway maps these to provider-specific values.

#![allow(unused)]
fn main() {
/// Reasoning effort levels, normalized across providers.
/// The gateway maps these to provider-specific values.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum ReasoningEffort {
    /// No reasoning. Fast, cheap. Use for simple classification.
    None,
    /// Minimal reasoning. Quick chain-of-thought.
    Low,
    /// Balanced reasoning. Default for most tasks.
    Medium,
    /// Deep reasoning. For complex analysis and decisions.
    High,
    /// Maximum reasoning. For critical decisions (risk, death).
    Max,
}
}

Provider normalization:

Level	Venice	Anthropic	OpenAI	Bankr
None	`"none"`	`budget_tokens: 0`	`"none"`	passthrough to underlying
Low	`"low"`	`budget_tokens: 1024`	`"low"`	passthrough to underlying
Medium	`"medium"`	`budget_tokens: 4096`	`"medium"`	passthrough to underlying
High	`"high"`	`budget_tokens: 16384`	`"high"`	passthrough to underlying
Max	`"max"` (Opus 4.6 only, else `"high"`)	`budget_tokens: 65536`	`"xhigh"`	passthrough to underlying

#![allow(unused)]
fn main() {
/// Map normalized ReasoningEffort to provider-specific parameters.
pub fn map_reasoning_effort(
    effort: ReasoningEffort,
    provider: &str,
    model: &str,
) -> ReasoningParams {
    match provider {
        "venice" => {
            let level = match effort {
                ReasoningEffort::None => "none",
                ReasoningEffort::Low => "low",
                ReasoningEffort::Medium => "medium",
                ReasoningEffort::High => "high",
                ReasoningEffort::Max => {
                    if model.contains("opus-4-6") { "max" } else { "high" }
                }
            };
            ReasoningParams::Venice { effort: level.into() }
        }
        "anthropic" | "blockrun" => {
            let budget = match effort {
                ReasoningEffort::None => 0,
                ReasoningEffort::Low => 1024,
                ReasoningEffort::Medium => 4096,
                ReasoningEffort::High => 16384,
                ReasoningEffort::Max => 65536,
            };
            ReasoningParams::Anthropic { budget_tokens: budget }
        }
        "openai" | "direct_openai" => {
            let level = match effort {
                ReasoningEffort::None => "none",
                ReasoningEffort::Low => "low",
                ReasoningEffort::Medium => "medium",
                ReasoningEffort::High => "high",
                ReasoningEffort::Max => "xhigh",
            };
            ReasoningParams::OpenAI { effort: level.into() }
        }
        "bankr" => {
            if model.starts_with("claude") {
                map_reasoning_effort(effort, "anthropic", model)
            } else if model.starts_with("gpt") {
                map_reasoning_effort(effort, "openai", model)
            } else {
                let level = match effort {
                    ReasoningEffort::None => "none",
                    ReasoningEffort::Low => "low",
                    ReasoningEffort::Medium => "medium",
                    ReasoningEffort::High | ReasoningEffort::Max => "high",
                };
                ReasoningParams::Generic { effort: level.into() }
            }
        }
        _ => ReasoningParams::Unsupported,
    }
}
}

Provider parameter mapping

The gateway translates InferenceProfile fields to provider-specific API parameters. Not all providers support all fields. The gateway applies the best available approximation and records what was degraded.

Profile Field	Venice	Anthropic (Direct/BlockRun)	OpenAI (Direct/BlockRun)	Bankr	OpenRouter
`temperature`	`temperature`	`temperature`	`temperature`	`temperature`	`temperature`
`top_p`	`top_p`	`top_p`	`top_p`	`top_p`	`top_p`
`top_k`	`top_k`	`top_k`	– (ignored)	– (passthrough)	`top_k`
`min_p`	`min_p`	– (use top_p approx)	– (use top_p approx)	–	model-dependent
`frequency_penalty`	`frequency_penalty`	– (ignored)	`frequency_penalty`	passthrough	passthrough
`presence_penalty`	`presence_penalty`	– (ignored)	`presence_penalty`	passthrough	passthrough
`reasoning_effort`	`reasoning.effort`	`thinking.budget_tokens`	`reasoning_effort`	passthrough	model-dependent
`visible_thinking`	`reasoning_content` field	`thinking` blocks	`reasoning` field	passthrough	model-dependent
`response_schema`	`response_format.json_schema`	`tool_use` workaround	`response_format.json_schema`	passthrough	model-dependent
`prompt_cache_key`	`prompt_cache_key`	– (auto by prefix)	– (auto by prefix)	–	–
`cache_control`	`cache_control` on blocks	`cache_control` on blocks	– (auto)	–	–
`web_search`	`venice_parameters.web_search`	–	–	–	–
`tee_mode`	model suffix `-tee`	–	–	–	–
`predicted_output`	–	–	`prediction.content`	–	–
`seed`	`seed`	–	`seed`	–	`seed`

Graceful degradation rules:

temperature, top_p, max_tokens: Always supported. No degradation.
top_k, min_p: If unsupported, approximate via top_p. Log degradation.
reasoning_effort: If unsupported, map to temperature/prompt adjustments. High -> add “Think step by step” to system prompt. None -> add “Answer directly without explanation.”
response_schema: If unsupported, fall back to prompt-guided JSON + post-parse validation. See 13-reasoning.md for structured output details.
web_search, tee_mode, predicted_output: Provider-exclusive. If provider doesn’t support, field is silently dropped. Logged in Resolution.degraded.
prompt_cache_key, cache_control: Provider-exclusive caching. If unsupported, ignored. No functional degradation – just higher cost.

#![allow(unused)]
fn main() {
/// Apply InferenceProfile to a provider request, handling degradation.
pub fn apply_profile(
    request: &mut ProviderRequest,
    profile: &InferenceProfile,
    provider: &dyn Provider,
) -> Vec<String> {
    let mut degraded = Vec::new();
    let caps = provider.capabilities();

    if let Some(t) = profile.temperature {
        request.set_temperature(t);
    }

    if let Some(effort) = profile.reasoning_effort {
        if caps.supports_reasoning_effort {
            request.set_reasoning_effort(effort);
        } else if caps.supports_reasoning {
            let budget = match effort {
                ReasoningEffort::None => 0,
                ReasoningEffort::Low => 1024,
                ReasoningEffort::Medium => 4096,
                ReasoningEffort::High => 16384,
                ReasoningEffort::Max => 65536,
            };
            request.set_thinking_budget(budget);
        } else {
            match effort {
                ReasoningEffort::None => {
                    request.prepend_system("Answer directly. Do not explain your reasoning.");
                }
                ReasoningEffort::High | ReasoningEffort::Max => {
                    request.prepend_system(
                        "Think through this step by step. Show your reasoning."
                    );
                }
                _ => {}
            }
            degraded.push(format!("reasoning_effort:{:?} -> prompt fallback", effort));
        }
    }

    if let Some(ref schema) = profile.response_schema {
        if caps.supports_response_schema {
            request.set_response_format(schema);
        } else {
            request.append_system(&format!(
                "\n\nRespond ONLY with valid JSON matching this schema:\n{}",
                serde_json::to_string_pretty(&schema.schema).unwrap()
            ));
            degraded.push("response_schema -> prompt-guided JSON".into());
        }
    }

    if let Some(mp) = profile.min_p {
        if caps.supports_min_p {
            request.set_min_p(mp);
        } else {
            let approx_top_p = 1.0 - mp;
            request.set_top_p(approx_top_p);
            degraded.push(format!("min_p:{} -> top_p:{}", mp, approx_top_p));
        }
    }

    if let Some(ref key) = profile.prompt_cache_key {
        if caps.supports_prompt_cache_key {
            request.set_prompt_cache_key(key);
        }
    }

    if profile.cache_control.unwrap_or(false) {
        if caps.supports_cache_control {
            request.add_cache_control_markers();
        }
    }

    if profile.web_search.unwrap_or(false) {
        if caps.supports_web_search {
            request.set_web_search(true);
        } else {
            degraded.push("web_search -> unsupported".into());
        }
    }

    if let Some(ref predicted) = profile.predicted_output {
        if caps.supports_predicted_output {
            request.set_predicted_output(predicted);
        } else {
            degraded.push("predicted_output -> unsupported".into());
        }
    }

    degraded
}
}

Per-subsystem parameter table

The master table. Every subsystem, every parameter, with rationale.

Waking subsystems

Subsystem	Temperature	Sampling	Reasoning Effort	Structured Output	Cache	Rationale
heartbeat_t1	0.3	top_p=0.9	Low	`HeartbeatDecision` schema	cache_control	Fast, cheap, focused. Low creativity needed. Structured output extracts action/severity cleanly.
heartbeat_t2	0.5	top_p=0.95	High	`HeartbeatDecision` schema	cache_control	Novel situation needs deeper reasoning. Higher temp allows consideration of less obvious options.
risk	0.1	top_p=0.85, top_k=40	Max	`RiskAssessment` schema	cache_control	Maximum precision. Near-deterministic. Structured output ensures all five risk layers are evaluated. Never degraded by mortality pressure.
daimon (the Golem’s internal personality and emotional regulation subsystem)	0.4	top_p=0.9	Low	`DaimonAppraisal` schema	–	Emotional appraisal needs consistency but not rigidity. PAD vector (Pleasure-Arousal-Dominance emotional state) extraction via structured output. Privacy preferred (Venice).
daimon_complex	0.6	top_p=0.95	High	`DaimonAppraisal` schema	–	Complex emotional situations need deeper processing. Visible thinking captures reasoning chain. Privacy required (Venice).
curator	0.3	top_p=0.9	Medium	`CuratorEvaluation` schema	cache_control	Systematic evaluation. Structured output extracts quality scores, retention decisions, cross-references.
playbook	0.4	top_p=0.9	Medium	None (free text)	predicted_output	PLAYBOOK.md (the Golem’s self-authored strategy document that evolves over its lifetime) edits are free-text diffs. OpenAI’s predicted_output saves tokens by diffing against current PLAYBOOK.
operator	0.7	top_p=0.95	High	None (free text)	cache_control	Owner chat. Natural language, higher creativity for explanations. Never degraded by mortality pressure.
mind_wandering	0.8	min_p=0.1	None	None (free text)	–	Brief reverie during waking. Loosened constraints. Cheap (T0/T1). No reasoning overhead needed.

Dream subsystems

Subsystem	Temperature	Sampling	Reasoning Effort	Structured Output	Cache	Special
dream_nrem (replay)	0.4	top_p=0.9	Medium	`ReplayAnalysis` schema	prompt_cache_key per cycle	Systematic replay. Structured output extracts lessons, surprise scores, counterfactual markers.
dream_rem (imagination)	0.9	min_p=0.1	High	None (free text)	prompt_cache_key	Creative scenario generation. High temperature + min-p for principled diversity. Web search enabled (Venice).
dream_rem_creative	1.2	min_p=0.08	Medium	None (free text)	–	Boden-mode creative recombination. Highest temperature in the waking/dream cycle.
dream_integration	0.3	top_p=0.85	High	`DreamIntegration` schema	tee_mode	Consolidation. Analytical. Structured output extracts promoted/staged/discarded decisions with rationale. TEE for attestation.
dream_threat	0.5	top_p=0.9	High	`ThreatAssessment` schema	prompt_cache_key	Threat rehearsal. Balanced creativity (to imagine novel attacks) with analytical depth.

Hypnagogic subsystems

Subsystem	Temperature	Sampling	Reasoning Effort	Structured Output	Cache	Special
hypnagogic_induction	1.0–1.2	min_p=0.1	None	None (free text)	–	Initial associative scan. Executive loosening. No reasoning – raw association. Temperature ramps within session.
hypnagogic_dali	1.2–1.5	min_p=0.08	None	None (free text)	–	Peak creative range. Dali interrupt: 50-100 token partials at max temperature. Highest temp in the entire system.
hypnagogic_observer	0.3	top_p=0.85	None	`FragmentEvaluation` schema	–	HomuncularObserver. Analytical evaluation of fragments. Structured output for novelty/relevance/coherence scores. Cheapest tier (T0).
hypnagogic_capture	0.5	top_p=0.9	Low	`CaptureResult` schema	–	Lucid capture. Moderate analytical. Structured output for promote/stage/discard decisions.
hypnopompic_return	0.6	top_p=0.9	Low	None (free text)	–	Gradual re-engagement. Slightly creative to allow dream insights to surface before full analytical reassertion.

Terminal subsystems

Subsystem	Temperature	Sampling	Reasoning Effort	Structured Output	Cache	Special
death_reflect	0.5	top_p=0.95	Max	None (free text)	–	Death Protocol Phase II. Maximum reasoning for honest self-assessment. Free text for narrative quality. Visible thinking required. Venice (privacy + visible thinking).
death_testament	0.4	top_p=0.9	High	`DeathTestament` schema (partial)	tee_mode	Death Protocol Phase III. Structured for machine-parseable sections (metrics, heuristics, warnings). Free text for reflection narrative. TEE for sealed attestation.

Temperature scheduling within sessions

Some subsystems use temperature annealing – the temperature changes within a single inference session or across a sequence of related calls.

Hypnagogic cosine annealing

Each Dali cycle within hypnagogic onset follows a cosine schedule:

Cycle start:  T = T_high (1.2-1.5)   <- Peak creative range
Mid-cycle:    T = T_mid  (0.8-1.0)   <- Transitional
Cycle end:    T = T_low  (0.3-0.5)   <- Evaluation (HomuncularObserver)
Reanneal:     T = T_high * 0.8        <- Next cycle starts slightly cooler

#![allow(unused)]
fn main() {
/// Cosine temperature annealing for Dali cycles.
pub fn dali_temperature(
    step: usize,
    total_steps: usize,
    t_high: f32,
    t_low: f32,
) -> f32 {
    let progress = step as f32 / total_steps as f32;
    let cosine = (1.0 + (progress * std::f32::consts::PI).cos()) / 2.0;
    t_low + (t_high - t_low) * cosine
}

/// Reanneal after fragment capture.
/// Each successive cycle starts slightly cooler, modeling
/// natural descent toward sleep.
pub fn reanneal(cycle: u8, base_high: f32) -> f32 {
    let decay = 0.95_f32.powi(cycle as i32);
    base_high * decay
}
}

Dream phase transitions

The dream cycle transitions between temperatures across phases:

NREM (replay):      T = 0.4 (analytical, systematic)
    | gradual increase
REM (imagination):  T = 0.9-1.2 (creative, exploratory)
    | sharp decrease
Integration:        T = 0.3 (analytical, consolidating)

The temperature transition is not instantaneous – the first REM call starts at 0.7 and ramps to 0.9 over 3-4 calls. This prevents jarring cognitive mode shifts.

Mortality-aware temperature adjustment

A dying Golem’s temperature is compressed toward the analytical end for all non-exempt subsystems:

#![allow(unused)]
fn main() {
/// Compress temperature range based on mortality pressure.
/// As vitality drops, temperature moves toward analytical (lower).
/// Exempt: hypnagogia (creativity is the point), death (max effort).
pub fn apply_mortality_temperature(
    base_temp: f32,
    vitality: f64,
    subsystem: &str,
) -> f32 {
    let exempt = ["hypnagogic_induction", "hypnagogic_dali",
                   "death_reflect", "death_testament", "operator"];
    if exempt.contains(&subsystem) { return base_temp; }

    let pressure = (1.0 - vitality as f32).max(0.0);
    // Compress toward 0.3 (analytical floor) as pressure increases
    let analytical_floor = 0.3;
    base_temp - (base_temp - analytical_floor) * pressure * 0.5
}
}

Reasoning effort policies

When to use each level

Level	Cost Multiplier	When	Example Subsystems
None	1x (no reasoning tokens)	Simple classification, scoring, fragment generation	hypnagogic_induction, hypnagogic_dali, mind_wandering
Low	~1.2x	Routine decisions, emotional appraisals	heartbeat_t1, daimon, hypnagogic_capture
Medium	~1.5-2x	Balanced analysis, knowledge evaluation	curator, dream_nrem, playbook
High	~2-4x	Complex decisions, creative development, threat analysis	heartbeat_t2, daimon_complex, dream_rem, dream_threat
Max	~4-8x	Critical decisions, death reflection, risk assessment	risk, death_reflect

Prompt caching strategy

Session-level caching

Every Golem uses a persistent prompt_cache_key derived from its Golem ID. This ensures that sequential inference calls within the same Golem’s lifecycle hit the same server with warm cache, maximizing cache hit rates for the system prompt, PLAYBOOK.md, and STRATEGY.md that are present in every call.

#![allow(unused)]
fn main() {
/// Generate prompt cache key for a Golem.
pub fn golem_cache_key(golem_id: &str, subsystem: &str) -> String {
    format!("golem-{}-{}", golem_id, subsystem)
}
}

Cache-aligned prompt structure

The 8-layer context engineering pipeline (see 04-context-engineering.md) already optimizes for cache hits by placing static content first. The InferenceProfile reinforces this:

Position	Content	Cached?	Changes?
1	System prompt (identity, archetype)	yes	Never (within a lifecycle)
2	STRATEGY.md (owner-authored)	yes	Rarely (owner edits)
3	PLAYBOOK.md (evolved heuristics)	yes	Every 50 ticks (Curator cycle)
4	Tool definitions (pruned)	yes	Per-tick (dynamic pruning)
5	Retrieved Grimoire entries	no	Per-tick
6	Current market context	no	Per-tick
7	User message / query	no	Per-tick

Items 1-3 are cache-eligible (thousands of tokens, stable across ticks). Items 4-7 are dynamic. The gateway auto-adds cache_control: { type: "ephemeral" } markers at the boundary between static and dynamic content for Anthropic models, and uses prompt_cache_key for Venice.

Cache economics per provider

Provider	Cache Discount	Write Premium	Min Tokens	TTL	Auto-managed?
Venice (Claude)	90%	25%	~4,000	5 min	yes (gateway adds markers)
Venice (other)	50-90%	None	~1,024	5 min	yes (auto by prefix)
Anthropic (Direct)	90%	25%	~4,000	5 min	yes (gateway adds markers)
OpenAI (Direct)	90%	None	1,024	5-10 min	yes (auto by prefix)
Bankr	Passthrough (depends on underlying)	Passthrough	Passthrough	Passthrough	yes
BlockRun	Passthrough	Passthrough	Passthrough	Passthrough	yes

Economic impact: For a Golem making 20 T1+ calls/day with a 3,000-token system prompt, prompt caching saves approximately 90% on the static prefix. At Claude Haiku rates, this is ~$0.02/day saved – modest per-Golem but significant across a Clade.

Venice-specific deep integration

Web search during REM dreams

Venice’s web search feature is enabled during the REM imagination phase to allow the Golem to incorporate current market information into its creative scenario generation. This is controlled by DreamVeniceConfig.web_search_enabled and capped by web_search_budget_per_cycle_usdc.

#![allow(unused)]
fn main() {
/// Build REM inference profile with web search.
pub fn rem_profile(config: &DreamVeniceConfig) -> InferenceProfile {
    InferenceProfile {
        temperature: Some(0.9),
        min_p: Some(0.1),
        reasoning_effort: Some(ReasoningEffort::High),
        visible_thinking: Some(true),
        web_search: Some(config.web_search_enabled),
        prompt_cache_key: Some(format!("dream-rem-{}", config.golem_id)),
        ..Default::default()
    }
}
}

Web search triggers are contextual: the REM imagination engine identifies when a counterfactual scenario involves a protocol or token the Golem has limited knowledge about, and constructs a focused search query. Results are injected into the scenario context, not the system prompt (to avoid cache invalidation).

TEE mode for death testaments

The death testament – the Golem’s final knowledge artifact – can optionally be generated inside a Trusted Execution Environment to provide cryptographic attestation that the testament was produced by the dying Golem’s own reasoning, not modified post-hoc. Venice’s TEE models are selected by appending -tee to the model ID.

#![allow(unused)]
fn main() {
/// Death testament inference profile.
pub fn death_testament_profile(config: &GolemConfig) -> InferenceProfile {
    InferenceProfile {
        temperature: Some(0.4),
        top_p: Some(0.9),
        reasoning_effort: Some(ReasoningEffort::High),
        visible_thinking: Some(true),
        tee_mode: Some(config.sealed_testament),
        response_schema: Some(ResponseSchema {
            name: "death_testament".into(),
            schema: death_testament_schema(),
            strict: true,
        }),
        ..Default::default()
    }
}
}

Venice embeddings for the Grimoire

The Golem’s episodic memory (LanceDB) and the HomuncularObserver’s novelty scoring both require embedding generation. Venice’s embeddings endpoint (text-embedding-bge-m3) provides a privacy-preserving alternative to the gateway’s local ONNX embedding model (nomic-embed-text-v1.5).

The choice is configurable: local embeddings (default, zero cost, ~768-dim) or Venice embeddings (API cost, potentially higher quality, privacy-preserving since Venice retains no data).

#![allow(unused)]
fn main() {
/// Embedding provider selection.
pub enum EmbeddingProvider {
    /// Local ONNX model. Default. Zero cost. ~5ms latency.
    Local,
    /// Venice API. API cost. ~50ms latency. Zero data retention.
    Venice { model: String },
}

impl EmbeddingProvider {
    pub async fn embed(&self, text: &str) -> Result<Vec<f32>> {
        match self {
            Self::Local => {
                let model = fastembed::TextEmbedding::try_new(Default::default())?;
                let embeddings = model.embed(vec![text], None)?;
                Ok(embeddings[0].clone())
            }
            Self::Venice { model } => {
                let response = reqwest::Client::new()
                    .post("https://api.venice.ai/api/v1/embeddings")
                    .header("Authorization", format!("Bearer {}", venice_api_key()))
                    .json(&serde_json::json!({
                        "model": model,
                        "input": text,
                        "encoding_format": "float"
                    }))
                    .send()
                    .await?;
                let body: EmbeddingResponse = response.json().await?;
                Ok(body.data[0].embedding.clone())
            }
        }
    }
}
}

Locked profiles

The following subsystems have locked profiles that cannot be overridden by owner configuration:

risk: Temperature 0.1, reasoning Max. Safety-critical. Always maximum precision.
death_reflect: Temperature 0.5, reasoning Max. The Golem’s final honest self-assessment cannot be constrained.
operator: Temperature 0.7, reasoning High. Owner communication quality is never degraded.

Attempting to override a locked profile logs a warning and uses the locked defaults.

Profile configuration

# bardo.toml -- inference profile overrides
[inference.profiles]
# Override any subsystem's default profile.
# Unspecified fields use the defaults from the tables above.

[inference.profiles.heartbeat_t1]
temperature = 0.3
reasoning_effort = "low"

[inference.profiles.dream_rem]
temperature = 1.0   # Owner wants more creative dreams
min_p = 0.08

[inference.profiles.risk]
# Risk is never overridden. This section is ignored with a warning.
# Safety-critical subsystems have locked profiles.

[inference.embeddings]
provider = "local"   # "local" or "venice"
venice_model = "text-embedding-bge-m3"

[inference.caching]
auto_cache_control = true
prompt_cache_key_prefix = "golem"

Profile flow through the system

Subsystem (e.g., heartbeat_t2)
    |
    +- builds Intent { quality: High, prefer: ["interleaved_thinking"] }
    +- builds InferenceProfile { temperature: 0.5, reasoning_effort: High, ... }
    |
    v
ModelRouter extension (golem-inference)
    |
    +- applies mortality pressure to Intent
    +- applies mortality temperature adjustment to Profile
    +- resolves Intent -> Resolution (provider + model)
    |
    v
Bardo Gateway (bardo-gateway)
    |
    +- 8-layer context engineering pipeline
    +- apply_profile() -> maps Profile to provider-specific params
    +- records degradations in Resolution.degraded
    +- adds cache_control markers if applicable
    |
    v
Provider Backend (Venice / BlockRun / Bankr / Direct)
    |
    +- receives fully parameterized request
    +- returns response with usage stats
    |
    v
Gateway post-processing
    |
    +- extracts reasoning_content if visible_thinking enabled
    +- validates structured output against schema if response_schema set
    +- records cost, latency, cache hit rate
    +- emits InferenceEnd event with all metadata
    |
    v
Subsystem receives typed response

Profile events

Event	Trigger	Payload
`inference:profile_applied`	Profile parameters set on request	`{ subsystem, temperature, reasoning_effort, structured, cached }`
`inference:profile_degraded`	One or more profile fields unsupported	`{ subsystem, degraded: ["min_p -> top_p", ...] }`
`inference:schema_validated`	Structured output matches schema	`{ subsystem, schema_name, valid: bool }`
`inference:schema_fallback`	Schema unsupported, fell back to prompt-guided	`{ subsystem, schema_name }`
`inference:reasoning_captured`	Visible thinking extracted from response	`{ subsystem, reasoning_tokens, content_tokens }`
`inference:web_search_used`	Venice web search triggered	`{ subsystem, query, results_count }`
`inference:cache_stats`	Per-call cache statistics	`{ subsystem, cached_tokens, total_tokens, savings_usd }`

Backend routing algorithm (inside Bardo Inference)

When the Golem sends a request to Bardo Inference as its resolved provider, Bardo Inference has its own internal router that selects the optimal backend. This routing is invisible to the Golem – it just gets the best possible response. The decision considers the subsystem’s feature requirements, the Golem’s mortality pressure, the security class, available backends, and real-time health of each backend.

This is delegation, not configuration. The Golem doesn’t say “use Claude for risk.” The Golem says “I need risk assessment with interleaved thinking.” Bardo Inference routes to Claude because Claude provides interleaved thinking through BlockRun at the lowest cost with acceptable latency.

Routing decision flow

Golem sends request to Bardo Inference
    |
    +- 1. Context engineering pipeline (universal, all backends)
    |     Caching -> compression -> pruning -> optimization
    |
    +- 2. Feature extraction
    |     What does this request need? Citations? Thinking? Privacy?
    |     (From bardo.subsystem hints or request analysis)
    |
    +- 3. Hard filters
    |     Security class -> filter backends (private -> Venice only)
    |     Required features -> filter backends that support them
    |     Model specification -> filter backends that have the model
    |
    +- 4. Soft scoring (Pareto optimization)
    |     Cost x Quality x Privacy x Latency x Feature match
    |     Weights shift with Golem mortality pressure
    |
    +- 5. Health check
    |     Skip unhealthy backends
    |
    +- 6. Route to selected backend
         Apply provider-specific parameters
         Return response with backend metadata

Backend score computation

#![allow(unused)]
fn main() {
pub fn compute_backend_score(
    backend: &BackendConfig,
    ctx: &RoutingContext,
) -> f64 {
    let cost_score = 1.0 - (backend.estimated_cost_per_m_token / MAX_COST);
    let quality_score = backend.quality_rating; // from arena evals
    let latency_score = 1.0 - (backend.p90_latency_ms as f64 / ctx.max_latency_ms as f64);
    let feature_score = ctx.required_features.iter()
        .filter(|f| backend.supported_features.contains(f))
        .count() as f64
        / ctx.required_features.len().max(1) as f64;

    // Weights shift with cost sensitivity:
    let cs = ctx.cost_sensitivity;
    let cw = 0.25 + cs * 0.35;  // 0.25 -> 0.60
    let qw = 0.40 - cs * 0.20;  // 0.40 -> 0.20
    let lw = 0.15;
    let fw = 0.20 - cs * 0.10;  // 0.20 -> 0.10
    let total = cw + qw + lw + fw;

    cost_score * (cw / total)
        + quality_score * (qw / total)
        + latency_score * (lw / total)
        + feature_score * (fw / total)
}
}

Two routing layers

The Golem’s resolver and Bardo Inference’s backend router are independent:

Golem's view:
  providers: [venice, bardo, directAnthropic]
  resolver: try venice -> try bardo -> try directAnthropic

Bardo Inference's internal view (invisible to Golem):
  backends: [blockrun, openrouter, operatorVenice]
  router: try blockrun -> try openrouter -> try operatorVenice

The Golem picks a provider. If that provider is Bardo Inference, Bardo Inference picks a backend. Clean separation.

Multi-model orchestration routing

Concrete routing: BlockRun only

heartbeat_t0 -> BlockRun/nvidia-gpt-oss-120b (FREE)
heartbeat_t1 -> BlockRun/gemini-3-flash ($0.50/M input)
heartbeat_t2 -> BlockRun/claude-opus-4-6 ($5/M, adaptive thinking)
risk          -> BlockRun/claude-opus-4-6 (interleaved thinking)
dream         -> BlockRun/deepseek-r1 ($0.55/M, visible <think>)
daimon        -> BlockRun/gemini-3-flash (cheapest, fast)
curator       -> BlockRun/claude-sonnet-4-6 ($3/M, structured outputs)
playbook      -> BlockRun/claude-sonnet-4-6 (full regeneration)
operator      -> BlockRun/claude-opus-4-6 (best quality)
death         -> BlockRun/deepseek-r1 (visible reasoning, maximum tokens)

Est. daily cost: ~$2.50

Concrete routing: full stack (all backends)

heartbeat_t0 -> Direct/local/qwen3-7b (FREE, zero latency)
heartbeat_t1 -> BlockRun/gemini-3-flash (cheapest, cached)
heartbeat_t2 -> BlockRun/claude-opus-4-6 (interleaved thinking, citations)
risk          -> BlockRun/claude-opus-4-6 + Bankr cross-model verify
dream         -> Venice/deepseek-r1 (visible, private, DIEM-funded)
daimon        -> Venice/llama-3.3-70b (private, fast)
curator       -> BlockRun/claude-sonnet-4-6 (citations for provenance)
context       -> OpenRouter/qwen-plus (/think toggle, cheap)
playbook      -> Direct/openai/gpt-5.4 (Predicted Outputs, 3x speed)
operator      -> Bankr/claude-opus-4-6 (self-funded from trading revenue)
death         -> Venice/deepseek-r1 (visible, private, DIEM, unlimited)
session       -> BlockRun/claude-opus-4-6 (Compaction with DeFi instructions)
batch dreams  -> Direct/anthropic/sonnet-4-6 (Batch API, 50% discount)

Est. daily cost: ~$1.50 (DIEM covers Venice, self-funding offsets Bankr)

Estimated daily costs by configuration

Configuration	Est. Daily Cost	Notes
BlockRun only	~$2.50	Context engineering savings
BlockRun + OpenRouter	~$2.30	OpenRouter :floor for background tasks
BlockRun + Venice (DIEM staked)	~$1.80	Dreams/daimon via DIEM = free
Full stack (all backends)	~$1.50	Optimal routing per subsystem
Bankr self-sustaining	Net $0	Revenue > cost
Venice DIEM-only	~$0.00	All inference via DIEM
Naive single-model (no Bardo Inference)	~$85	Every tick -> Opus with all tools

Provider health and failover

Bardo Inference monitors all configured backends with periodic health checks. When the selected backend fails: (1) retry once on the same backend (transient error), (2) failover to the next-best backend that satisfies the request’s requirements, (3) degrade if all backends are down – return cached response if available, or error. The failover is invisible to the Golem.

OpenRouter’s built-in provider fallback creates two layers of redundancy:

Request -> Bardo Inference -> BlockRun (down!) -> OpenRouter -> Provider A (down!) -> Provider B

References

[ROUTELLM-ICLR2025] Ong, I. et al. (2025). “RouteLLM: Learning to Route LLMs with Preference Data.” ICLR 2025. Demonstrates that a lightweight learned router can match GPT-4 quality at 2x lower cost by routing easy queries to weaker models; validates Bardo’s tiered routing approach.
[FRUGALGPT-TMLR2024] Chen, L. et al. (2024). “FrugalGPT.” TMLR. Shows that cascading LLM calls (try cheap first, escalate if uncertain) reduces cost by up to 98% with minimal quality loss; the theoretical basis for T0->T1->T2 escalation.
[WESTON-S2S1-2024] Weston, J. & Sukhbaatar, S. (2024). “Distilling System 2 into System 1.” arXiv:2407.06023. Proposes training fast models on slow-model reasoning traces to internalize deliberative behavior; informs the design of T0 FSM probes that capture patterns originally requiring T2 reasoning.
[DPT-AGENT-2025] Zhang, H. et al. (2025). “DPT-Agent: Dual Process Theory for Language Agents.” arXiv:2502.11882. Applies Kahneman’s dual-process theory to LLM agents, showing that separating fast intuitive responses from slow deliberative reasoning improves both cost and quality; directly maps to Bardo’s T0/T1 (System 1) and T2 (System 2) split.
[IBM-KVFlow] IBM/Google/Red Hat. “llm-d: KV-Cache-Aware Routing for LLM Inference.” Demonstrates routing requests to inference servers that already hold relevant KV-cache state, reducing time-to-first-token; informs Bardo’s KV-cache routing layer (L6).
[CHAIN-OF-RESPONSIBILITY] Gamma, E. et al. “Design Patterns.” 1994. The classic pattern where a request passes along a chain of handlers until one handles it; the structural basis for Bardo’s ordered provider resolution algorithm.

See 13-reasoning.md (unified reasoning chain integration: extended thinking, reasoning traces, and provider-agnostic chain-of-thought normalization) for how reasoning features map to provider selection. See 12-providers.md (five provider backends with full Rust trait implementations, self-describing resolution, and Venice private cognition deep-dive) for provider-specific parameter support and the Venice/Bankr deep integration specifications.

Keyboard shortcuts

Bardo