Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Inference Parameters: Temperature, Sampling, and Reasoning Policies

Version: 1.0 | Status: Draft | Type: SPEC (normative)

Parent: prd2/11-inference/

Crate: golem-inference

Depends on: golem-core, golem-daimon, golem-dreams, bardo-gateway, bardo-providers

Purpose: Define the complete inference parameter policy for every Golem subsystem across every cognitive state. Today, the inference gateway routes requests to the right model and provider (see 01a-routing.md), but it does not specify how the model should reason – what temperature to use, what sampling strategy, what reasoning effort, what response format. This document fills that gap. Written for a first-time reader with no assumed familiarity with the Bardo architecture.


Reader orientation: This document specifies the inference parameter policies for every Golem (a mortal autonomous DeFi agent managed by the Bardo runtime) subsystem across every cognitive state. It belongs to the Bardo Inference layer and defines temperature, sampling, reasoning effort, and response format per subsystem (heartbeat, risk, dream, Daimon, curator, playbook, operator, death). The key concept is that different subsystems need different inference parameters: risk assessment demands near-deterministic output (temperature 0.1), while dream cycles need creative recombination (temperature 0.8-0.9). For term definitions, see prd2/shared/glossary.md.

Why This Matters

An LLM call is not just a string of prompt text sent to a model. Every provider exposes parameters that fundamentally alter the character of the response: temperature controls randomness, sampling strategies control which tokens are reachable, reasoning effort controls how deeply the model thinks before answering, and response format controls whether the output is free text or structured data. Using the same parameters for every call – the naive default – wastes money, produces worse results, and misses provider-specific features that exist precisely to solve the problems each Golem subsystem faces.

The Golem has 12+ subsystems that make inference calls, each with radically different needs. The risk engine needs maximum precision and zero creativity. Hypnagogic onset needs maximum creativity and loosened constraints. The Curator needs structured data extraction. The death testament needs deep, visible reasoning. Treating all of these as “send prompt, get text” is like using the same wrench for every bolt.

This document defines the InferenceProfile – a per-subsystem, per-cognitive-state specification of exactly which parameters to set on every inference call. The profile is attached to the Intent (see 01a-routing.md) and applied by the gateway after provider resolution, before the call reaches the backend.


The InferenceProfile

#![allow(unused)]
fn main() {
/// Complete inference parameter specification for a single call.
/// Attached to the Intent by the subsystem, applied by the gateway.
///
/// All fields are Option<T>: None means "use provider default."
/// The gateway merges the profile with provider-specific defaults
/// and capabilities before sending the request.
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct InferenceProfile {
    // ── Sampling ──────────────────────────────────────────────
    /// Temperature: controls randomness.
    /// 0.0 = deterministic, 2.0 = maximum randomness.
    /// None = provider default (typically 1.0).
    pub temperature: Option<f32>,

    /// Top-p (nucleus sampling): cumulative probability threshold.
    /// 0.1 = very focused, 1.0 = no filtering.
    /// Mutually exclusive with top_k in practice.
    pub top_p: Option<f32>,

    /// Top-k: number of highest-probability tokens to consider.
    /// 1 = greedy, 100+ = broad. Not all providers support this.
    pub top_k: Option<u32>,

    /// Min-p: dynamic probability floor relative to top token.
    /// 0.1 = tokens with <10% of top token's probability are excluded.
    /// More principled than fixed top-p. Venice + open models support this.
    pub min_p: Option<f32>,

    /// Frequency penalty: penalizes tokens that appear frequently.
    /// 0.0 = no penalty, 2.0 = strong penalty.
    pub frequency_penalty: Option<f32>,

    /// Presence penalty: penalizes tokens that have appeared at all.
    /// 0.0 = no penalty, 2.0 = strong penalty.
    pub presence_penalty: Option<f32>,

    // ── Reasoning ─────────────────────────────────────────────
    /// Reasoning effort: controls depth of chain-of-thought.
    /// Maps to Venice's reasoning_effort, Anthropic's extended thinking,
    /// OpenAI's reasoning effort parameter.
    /// None = provider default. Some models always reason.
    pub reasoning_effort: Option<ReasoningEffort>,

    /// Whether to request visible thinking/reasoning traces.
    /// Venice: reasoning_content field. Anthropic: thinking blocks.
    /// None = provider default.
    pub visible_thinking: Option<bool>,

    // ── Output format ─────────────────────────────────────────
    /// Structured output schema (JSON Schema).
    /// When set, the provider enforces this schema on the response.
    /// Falls back gracefully to free text + parsing if unsupported.
    pub response_schema: Option<ResponseSchema>,

    /// Maximum output tokens.
    pub max_tokens: Option<u32>,

    /// Stop sequences.
    pub stop_sequences: Option<Vec<String>>,

    // ── Caching ───────────────────────────────────────────────
    /// Prompt cache key for session affinity.
    /// Venice/Anthropic: routes to same server for cache hits.
    pub prompt_cache_key: Option<String>,

    /// Explicit cache control markers for Anthropic models.
    /// When true, the gateway auto-adds cache_control to system prompts
    /// and long static content blocks.
    pub cache_control: Option<bool>,

    // ── Provider-specific ─────────────────────────────────────
    /// Venice-specific: enable web search for this call.
    pub web_search: Option<bool>,

    /// Venice-specific: TEE (Trusted Execution Environment) mode.
    /// Ensures inference runs in an encrypted enclave.
    pub tee_mode: Option<bool>,

    /// OpenAI-specific: predicted output for diffing (PLAYBOOK edits).
    pub predicted_output: Option<String>,

    /// Seed for reproducibility. Not all providers support this.
    pub seed: Option<u64>,
}

/// Reasoning effort levels, normalized across providers.
/// The gateway maps these to provider-specific values.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum ReasoningEffort {
    /// No reasoning. Fast, cheap. Use for simple classification.
    None,
    /// Minimal reasoning. Quick chain-of-thought.
    Low,
    /// Balanced reasoning. Default for most tasks.
    Medium,
    /// Deep reasoning. For complex analysis and decisions.
    High,
    /// Maximum reasoning. For critical decisions (risk, death).
    Max,
}

/// Structured output schema specification.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResponseSchema {
    /// Schema name (for provider registration).
    pub name: String,
    /// JSON Schema definition.
    pub schema: serde_json::Value,
    /// Whether strict mode is required.
    /// Venice/OpenAI: strict=true. Falls back to prompt-guided if unsupported.
    pub strict: bool,
}
}

Provider Parameter Mapping

The gateway translates InferenceProfile fields to provider-specific API parameters. Not all providers support all fields. The gateway applies the best available approximation and records what was degraded.

Profile FieldVeniceAnthropic (Direct/BlockRun)OpenAI (Direct/BlockRun)BankrOpenRouter
temperaturetemperaturetemperaturetemperaturetemperaturetemperature
top_ptop_ptop_ptop_ptop_ptop_p
top_ktop_ktop_k✗ (ignored)✗ (passthrough)top_k
min_pmin_p✗ (use top_p approx)✗ (use top_p approx)model-dependent
frequency_penaltyfrequency_penalty✗ (ignored)frequency_penaltypassthroughpassthrough
presence_penaltypresence_penalty✗ (ignored)presence_penaltypassthroughpassthrough
reasoning_effortreasoning.effortthinking.budget_tokensreasoning_effortpassthroughmodel-dependent
visible_thinkingreasoning_content fieldthinking blocksreasoning fieldpassthroughmodel-dependent
response_schemaresponse_format.json_schematool_use workaroundresponse_format.json_schemapassthroughmodel-dependent
prompt_cache_keyprompt_cache_key✗ (auto by prefix)✗ (auto by prefix)
cache_controlcache_control on blockscache_control on blocks✗ (auto)
web_searchvenice_parameters.web_search
tee_modemodel suffix -tee
predicted_outputprediction.content
seedseedseedseed

Graceful degradation rules

When a profile field is not supported by the resolved provider:

  1. temperature, top_p, max_tokens: Always supported. No degradation.
  2. top_k, min_p: If unsupported, approximate via top_p. Log degradation.
  3. reasoning_effort: If unsupported, map to temperature/prompt adjustments. High -> add “Think step by step” to system prompt. None -> add “Answer directly without explanation.”
  4. response_schema: If unsupported, fall back to prompt-guided JSON + post-parse validation. See 01-structured-outputs.md.
  5. web_search, tee_mode, predicted_output: Provider-exclusive. If provider doesn’t support, field is silently dropped. Logged in Resolution.degraded.
  6. prompt_cache_key, cache_control: Provider-exclusive caching. If unsupported, ignored. No functional degradation – just higher cost.
#![allow(unused)]
fn main() {
/// Apply InferenceProfile to a provider request, handling degradation.
pub fn apply_profile(
    request: &mut ProviderRequest,
    profile: &InferenceProfile,
    provider: &dyn Provider,
) -> Vec<String> {
    let mut degraded = Vec::new();
    let caps = provider.capabilities();

    // Temperature: always supported
    if let Some(t) = profile.temperature {
        request.set_temperature(t);
    }

    // Reasoning effort: provider-specific mapping
    if let Some(effort) = profile.reasoning_effort {
        if caps.supports_reasoning_effort {
            request.set_reasoning_effort(effort);
        } else if caps.supports_reasoning {
            // Map effort to budget_tokens for Anthropic
            let budget = match effort {
                ReasoningEffort::None => 0,
                ReasoningEffort::Low => 1024,
                ReasoningEffort::Medium => 4096,
                ReasoningEffort::High => 16384,
                ReasoningEffort::Max => 65536,
            };
            request.set_thinking_budget(budget);
        } else {
            // Fallback: prompt-level reasoning guidance
            match effort {
                ReasoningEffort::None => {
                    request.prepend_system("Answer directly. Do not explain your reasoning.");
                }
                ReasoningEffort::High | ReasoningEffort::Max => {
                    request.prepend_system(
                        "Think through this step by step. Show your reasoning."
                    );
                }
                _ => {} // Medium/Low: no modification
            }
            degraded.push(format!("reasoning_effort:{:?} -> prompt fallback", effort));
        }
    }

    // Structured output: see 01-structured-outputs.md for full logic
    if let Some(ref schema) = profile.response_schema {
        if caps.supports_response_schema {
            request.set_response_format(schema);
        } else {
            // Fallback: inject schema into prompt, parse output
            request.append_system(&format!(
                "\n\nRespond ONLY with valid JSON matching this schema:\n{}",
                serde_json::to_string_pretty(&schema.schema).unwrap()
            ));
            degraded.push("response_schema -> prompt-guided JSON".into());
        }
    }

    // Min-p: Venice and some open models only
    if let Some(mp) = profile.min_p {
        if caps.supports_min_p {
            request.set_min_p(mp);
        } else {
            // Approximate: min_p 0.1 ≈ top_p 0.9
            let approx_top_p = 1.0 - mp;
            request.set_top_p(approx_top_p);
            degraded.push(format!("min_p:{} -> top_p:{}", mp, approx_top_p));
        }
    }

    // Prompt caching: Venice and Anthropic
    if let Some(ref key) = profile.prompt_cache_key {
        if caps.supports_prompt_cache_key {
            request.set_prompt_cache_key(key);
        }
        // No degradation — just cost impact
    }

    if profile.cache_control.unwrap_or(false) {
        if caps.supports_cache_control {
            request.add_cache_control_markers();
        }
    }

    // Web search: Venice only
    if profile.web_search.unwrap_or(false) {
        if caps.supports_web_search {
            request.set_web_search(true);
        } else {
            degraded.push("web_search -> unsupported".into());
        }
    }

    // Predicted output: OpenAI only
    if let Some(ref predicted) = profile.predicted_output {
        if caps.supports_predicted_output {
            request.set_predicted_output(predicted);
        } else {
            degraded.push("predicted_output -> unsupported".into());
        }
    }

    degraded
}
}

Subsystem Parameter Table

The master table. Every subsystem, every parameter, with rationale.

Waking Subsystems

SubsystemTemperatureSamplingReasoning EffortStructured OutputCacheRationale
heartbeat_t10.3top_p=0.9LowHeartbeatDecision schemacache_controlFast, cheap, focused. Low creativity needed. Structured output extracts action/severity cleanly.
heartbeat_t20.5top_p=0.95HighHeartbeatDecision schemacache_controlNovel situation needs deeper reasoning. Higher temp allows consideration of less obvious options.
risk0.1top_p=0.85, top_k=40MaxRiskAssessment schemacache_controlMaximum precision. Near-deterministic. Structured output ensures all five risk layers are evaluated. Never degraded by mortality pressure.
daimon0.4top_p=0.9LowDaimonAppraisal schemanoneEmotional appraisal needs consistency but not rigidity. PAD vector extraction via structured output. Privacy preferred (Venice).
daimon_complex0.6top_p=0.95HighDaimonAppraisal schemanoneComplex emotional situations need deeper processing. Visible thinking captures reasoning chain. Privacy required (Venice).
curator0.3top_p=0.9MediumCuratorEvaluation schemacache_controlSystematic evaluation. Structured output extracts quality scores, retention decisions, cross-references.
playbook0.4top_p=0.9MediumNone (free text)predicted_outputPLAYBOOK.md edits are free-text diffs. OpenAI’s predicted_output saves tokens by diffing against current PLAYBOOK.
operator0.7top_p=0.95HighNone (free text)cache_controlOwner chat. Natural language, higher creativity for explanations. Never degraded by mortality pressure.
mind_wandering0.8min_p=0.1NoneNone (free text)noneBrief reverie during waking. Loosened constraints. Cheap (T0/T1). No reasoning overhead needed.

Dream Subsystems

SubsystemTemperatureSamplingReasoning EffortStructured OutputCacheSpecial
dream_nrem (replay)0.4top_p=0.9MediumReplayAnalysis schemaprompt_cache_key per cycleSystematic replay. Structured output extracts lessons, surprise scores, counterfactual markers.
dream_rem (imagination)0.9min_p=0.1HighNone (free text)prompt_cache_keyCreative scenario generation. High temperature + min-p for principled diversity. Web search enabled (Venice).
dream_rem_creative1.2min_p=0.08MediumNone (free text)noneBoden-mode creative recombination. Highest temperature in the waking/dream cycle.
dream_integration0.3top_p=0.85HighDreamIntegration schematee_modeConsolidation. Analytical. Structured output extracts promoted/staged/discarded decisions with rationale. TEE for attestation.
dream_threat0.5top_p=0.9HighThreatAssessment schemaprompt_cache_keyThreat rehearsal. Balanced creativity (to imagine novel attacks) with analytical depth.

Hypnagogic Subsystems

SubsystemTemperatureSamplingReasoning EffortStructured OutputCacheSpecial
hypnagogic_induction1.0-1.2min_p=0.1NoneNone (free text)noneInitial associative scan. Executive loosening. No reasoning – raw association. Temperature ramps within session.
hypnagogic_dali1.2-1.5min_p=0.08NoneNone (free text)nonePeak creative range. Dali interrupt: 50-100 token partials at max temperature. Highest temp in the entire system.
hypnagogic_observer0.3top_p=0.85NoneFragmentEvaluation schemanoneHomuncularObserver. Analytical evaluation of fragments. Structured output for novelty/relevance/coherence scores. Cheapest tier (T0).
hypnagogic_capture0.5top_p=0.9LowCaptureResult schemanoneLucid capture. Moderate analytical. Structured output for promote/stage/discard decisions.
hypnopompic_return0.6top_p=0.9LowNone (free text)noneGradual re-engagement. Slightly creative to allow dream insights to surface before full analytical reassertion.

Terminal Subsystems

SubsystemTemperatureSamplingReasoning EffortStructured OutputCacheSpecial
death_reflect0.5top_p=0.95MaxNone (free text)noneDeath Protocol Phase II. Maximum reasoning for honest self-assessment. Free text for narrative quality. Visible thinking required. Venice (privacy + visible thinking).
death_testament0.4top_p=0.9HighDeathTestament schema (partial)tee_modeDeath Protocol Phase III. Structured for machine-parseable sections (metrics, heuristics, warnings). Free text for reflection narrative. TEE for sealed attestation.

Temperature Scheduling Within Sessions

Some subsystems use temperature annealing – the temperature changes within a single inference session or across a sequence of related calls. This is distinct from the static per-subsystem temperature above.

Hypnagogic cosine annealing

Each Dali cycle within hypnagogic onset follows a cosine schedule:

Cycle start:  T = T_high (1.2-1.5)   <- Peak creative range
Mid-cycle:    T = T_mid  (0.8-1.0)   <- Transitional
Cycle end:    T = T_low  (0.3-0.5)   <- Evaluation (HomuncularObserver)
Reanneal:     T = T_high * 0.8        <- Next cycle starts slightly cooler
#![allow(unused)]
fn main() {
/// Cosine temperature annealing for Dali cycles.
pub fn dali_temperature(
    step: usize,
    total_steps: usize,
    t_high: f32,
    t_low: f32,
) -> f32 {
    let progress = step as f32 / total_steps as f32;
    let cosine = (1.0 + (progress * std::f32::consts::PI).cos()) / 2.0;
    t_low + (t_high - t_low) * cosine
}

/// Reanneal after fragment capture.
/// Each successive cycle starts slightly cooler, modeling
/// natural descent toward sleep.
pub fn reanneal(cycle: u8, base_high: f32) -> f32 {
    let decay = 0.95_f32.powi(cycle as i32);
    base_high * decay
}
}

Dream phase transitions

The dream cycle transitions between temperatures across phases:

NREM (replay):      T = 0.4 (analytical, systematic)
    | gradual increase
REM (imagination):  T = 0.9-1.2 (creative, exploratory)
    | sharp decrease
Integration:        T = 0.3 (analytical, consolidating)

The temperature transition is not instantaneous – the first REM call starts at 0.7 and ramps to 0.9 over 3-4 calls. This prevents jarring cognitive mode shifts.

Mortality-aware temperature adjustment

A dying Golem’s temperature is compressed toward the analytical end for all non-exempt subsystems:

#![allow(unused)]
fn main() {
/// Compress temperature range based on mortality pressure.
/// As vitality drops, temperature moves toward analytical (lower).
/// Exempt: hypnagogia (creativity is the point), death (max effort).
pub fn apply_mortality_temperature(
    base_temp: f32,
    vitality: f64,
    subsystem: &str,
) -> f32 {
    let exempt = ["hypnagogic_induction", "hypnagogic_dali",
                   "death_reflect", "death_testament", "operator"];
    if exempt.contains(&subsystem) { return base_temp; }

    let pressure = (1.0 - vitality as f32).max(0.0);
    // Compress toward 0.3 (analytical floor) as pressure increases
    let analytical_floor = 0.3;
    base_temp - (base_temp - analytical_floor) * pressure * 0.5
}
}

Reasoning Effort Policies

When to use each level

LevelCost MultiplierWhenExample Subsystems
None1x (no reasoning tokens)Simple classification, scoring, fragment generationhypnagogic_induction, hypnagogic_dali, mind_wandering
Low~1.2xRoutine decisions, emotional appraisalsheartbeat_t1, daimon, hypnagogic_capture
Medium~1.5-2xBalanced analysis, knowledge evaluationcurator, dream_nrem, playbook
High~2-4xComplex decisions, creative development, threat analysisheartbeat_t2, daimon_complex, dream_rem, dream_threat
Max~4-8xCritical decisions, death reflection, risk assessmentrisk, death_reflect

Provider mapping for reasoning effort

#![allow(unused)]
fn main() {
/// Map normalized ReasoningEffort to provider-specific parameters.
pub fn map_reasoning_effort(
    effort: ReasoningEffort,
    provider: &str,
    model: &str,
) -> ReasoningParams {
    match provider {
        "venice" => {
            // Venice uses string-valued reasoning_effort
            let level = match effort {
                ReasoningEffort::None => "none",
                ReasoningEffort::Low => "low",
                ReasoningEffort::Medium => "medium",
                ReasoningEffort::High => "high",
                ReasoningEffort::Max => {
                    // "max" only supported on Claude Opus 4.6 via Venice
                    if model.contains("opus-4-6") { "max" } else { "high" }
                }
            };
            ReasoningParams::Venice { effort: level.into() }
        }
        "anthropic" | "blockrun" => {
            // Anthropic uses budget_tokens for extended thinking
            let budget = match effort {
                ReasoningEffort::None => 0,
                ReasoningEffort::Low => 1024,
                ReasoningEffort::Medium => 4096,
                ReasoningEffort::High => 16384,
                ReasoningEffort::Max => 65536,
            };
            ReasoningParams::Anthropic { budget_tokens: budget }
        }
        "openai" | "direct_openai" => {
            // OpenAI uses string-valued reasoning_effort
            let level = match effort {
                ReasoningEffort::None => "none",
                ReasoningEffort::Low => "low",
                ReasoningEffort::Medium => "medium",
                ReasoningEffort::High => "high",
                ReasoningEffort::Max => "xhigh", // OpenAI's "extra high"
            };
            ReasoningParams::OpenAI { effort: level.into() }
        }
        "bankr" => {
            // Bankr passes through to underlying provider.
            // Detect underlying provider from model name.
            if model.starts_with("claude") {
                map_reasoning_effort(effort, "anthropic", model)
            } else if model.starts_with("gpt") {
                map_reasoning_effort(effort, "openai", model)
            } else {
                // Gemini, Kimi, Qwen: use low/medium/high if supported
                let level = match effort {
                    ReasoningEffort::None => "none",
                    ReasoningEffort::Low => "low",
                    ReasoningEffort::Medium => "medium",
                    ReasoningEffort::High | ReasoningEffort::Max => "high",
                };
                ReasoningParams::Generic { effort: level.into() }
            }
        }
        _ => ReasoningParams::Unsupported,
    }
}
}

Prompt Caching Strategy

Session-level caching

Every Golem uses a persistent prompt_cache_key derived from its Golem ID. This ensures that sequential inference calls within the same Golem’s lifecycle hit the same server with warm cache, maximizing cache hit rates for the system prompt, PLAYBOOK.md, and STRATEGY.md that are present in every call.

#![allow(unused)]
fn main() {
/// Generate prompt cache key for a Golem.
pub fn golem_cache_key(golem_id: &str, subsystem: &str) -> String {
    // Group by subsystem to maximize prefix sharing.
    // All heartbeat calls share one cache; all dream calls share another.
    format!("golem-{}-{}", golem_id, subsystem)
}
}

Cache-aligned prompt structure

The 8-layer context engineering pipeline (see 04-context-engineering.md) already optimizes for cache hits by placing static content first. The InferenceProfile reinforces this:

PositionContentCached?Changes?
1System prompt (identity, archetype)yesNever (within a lifecycle)
2STRATEGY.md (owner-authored)yesRarely (owner edits)
3PLAYBOOK.md (evolved heuristics)yesEvery 50 ticks (Curator cycle)
4Tool definitions (pruned)yesPer-tick (dynamic pruning)
5Retrieved Grimoire entriesnoPer-tick
6Current market contextnoPer-tick
7User message / querynoPer-tick

Items 1-3 are cache-eligible (thousands of tokens, stable across ticks). Items 4-7 are dynamic. The gateway auto-adds cache_control: { type: "ephemeral" } markers at the boundary between static and dynamic content for Anthropic models, and uses prompt_cache_key for Venice.

Cache economics per provider

ProviderCache DiscountWrite PremiumMin TokensTTLAuto-managed?
Venice (Claude)90%25%~4,0005 minyes (gateway adds markers)
Venice (other)50-90%None~1,0245 minyes (auto by prefix)
Anthropic (Direct)90%25%~4,0005 minyes (gateway adds markers)
OpenAI (Direct)90%None1,0245-10 minyes (auto by prefix)
BankrPassthrough (depends on underlying)PassthroughPassthroughPassthroughyes
BlockRunPassthroughPassthroughPassthroughPassthroughyes

Economic impact: For a Golem making 20 T1+ calls/day with a 3,000-token system prompt, prompt caching saves approximately 90% on the static prefix. At Claude Haiku rates, this is ~$0.02/day saved – modest per-Golem but significant across a Clade.


Venice-Specific Deep Integration

Web search during REM dreams

Venice’s web search feature is enabled during the REM imagination phase to allow the Golem to incorporate current market information into its creative scenario generation. This is controlled by DreamVeniceConfig.web_search_enabled and capped by web_search_budget_per_cycle_usdc.

#![allow(unused)]
fn main() {
/// Build REM inference profile with web search.
pub fn rem_profile(config: &DreamVeniceConfig) -> InferenceProfile {
    InferenceProfile {
        temperature: Some(0.9),
        min_p: Some(0.1),
        reasoning_effort: Some(ReasoningEffort::High),
        visible_thinking: Some(true), // Capture reasoning chain
        web_search: Some(config.web_search_enabled),
        prompt_cache_key: Some(format!("dream-rem-{}", config.golem_id)),
        ..Default::default()
    }
}
}

Web search triggers are contextual: the REM imagination engine identifies when a counterfactual scenario involves a protocol or token the Golem has limited knowledge about, and constructs a focused search query. Results are injected into the scenario context, not the system prompt (to avoid cache invalidation).

TEE mode for death testaments

The death testament – the Golem’s final knowledge artifact – can optionally be generated inside a Trusted Execution Environment to provide cryptographic attestation that the testament was produced by the dying Golem’s own reasoning, not modified post-hoc. Venice’s TEE models are selected by appending -tee to the model ID.

#![allow(unused)]
fn main() {
/// Death testament inference profile.
pub fn death_testament_profile(config: &GolemConfig) -> InferenceProfile {
    InferenceProfile {
        temperature: Some(0.4),
        top_p: Some(0.9),
        reasoning_effort: Some(ReasoningEffort::High),
        visible_thinking: Some(true),
        tee_mode: Some(config.sealed_testament),
        // Partial structured output: metrics + heuristics are structured,
        // narrative reflection is free text.
        response_schema: Some(ResponseSchema {
            name: "death_testament".into(),
            schema: death_testament_schema(),
            strict: true,
        }),
        ..Default::default()
    }
}
}

Venice embeddings for the Grimoire

The Golem’s episodic memory (LanceDB) and the HomuncularObserver’s novelty scoring both require embedding generation. Venice’s embeddings endpoint (text-embedding-bge-m3) provides a privacy-preserving alternative to the gateway’s local ONNX embedding model (nomic-embed-text-v1.5).

The choice is configurable: local embeddings (default, zero cost, ~768-dim) or Venice embeddings (API cost, potentially higher quality, privacy-preserving since Venice retains no data).

#![allow(unused)]
fn main() {
/// Embedding provider selection.
pub enum EmbeddingProvider {
    /// Local ONNX model. Default. Zero cost. ~5ms latency.
    Local,
    /// Venice API. API cost. ~50ms latency. Zero data retention.
    Venice { model: String },
}

impl EmbeddingProvider {
    pub async fn embed(&self, text: &str) -> Result<Vec<f32>> {
        match self {
            Self::Local => {
                // fastembed-rs with nomic-embed-text-v1.5
                let model = fastembed::TextEmbedding::try_new(Default::default())?;
                let embeddings = model.embed(vec![text], None)?;
                Ok(embeddings[0].clone())
            }
            Self::Venice { model } => {
                let response = reqwest::Client::new()
                    .post("https://api.venice.ai/api/v1/embeddings")
                    .header("Authorization", format!("Bearer {}", venice_api_key()))
                    .json(&serde_json::json!({
                        "model": model,
                        "input": text,
                        "encoding_format": "float"
                    }))
                    .send()
                    .await?;
                let body: EmbeddingResponse = response.json().await?;
                Ok(body.data[0].embedding.clone())
            }
        }
    }
}
}

Integration: How Profiles Flow Through the System

Subsystem (e.g., heartbeat_t2)
    |
    +- builds Intent { quality: High, prefer: ["interleaved_thinking"] }
    +- builds InferenceProfile { temperature: 0.5, reasoning_effort: High, ... }
    |
    v
ModelRouter extension (golem-inference)
    |
    +- applies mortality pressure to Intent
    +- applies mortality temperature adjustment to Profile
    +- resolves Intent -> Resolution (provider + model)
    |
    v
Bardo Gateway (bardo-gateway)
    |
    +- 8-layer context engineering pipeline
    +- apply_profile() -> maps Profile to provider-specific params
    +- records degradations in Resolution.degraded
    +- adds cache_control markers if applicable
    |
    v
Provider Backend (Venice / BlockRun / Bankr / Direct)
    |
    +- receives fully parameterized request
    +- returns response with usage stats
    |
    v
Gateway post-processing
    |
    +- extracts reasoning_content if visible_thinking enabled
    +- validates structured output against schema if response_schema set
    +- records cost, latency, cache hit rate
    +- emits InferenceEnd event with all metadata
    |
    v
Subsystem receives typed response

Events emitted

EventTriggerPayload
inference:profile_appliedProfile parameters set on request{ subsystem, temperature, reasoning_effort, structured, cached }
inference:profile_degradedOne or more profile fields unsupported{ subsystem, degraded: ["min_p -> top_p", ...] }
inference:schema_validatedStructured output matches schema{ subsystem, schema_name, valid: bool }
inference:schema_fallbackSchema unsupported, fell back to prompt-guided{ subsystem, schema_name }
inference:reasoning_capturedVisible thinking extracted from response{ subsystem, reasoning_tokens, content_tokens }
inference:web_search_usedVenice web search triggered{ subsystem, query, results_count }
inference:cache_statsPer-call cache statistics{ subsystem, cached_tokens, total_tokens, savings_usd }

Configuration

# bardo.toml -- inference profile overrides
[inference.profiles]
# Override any subsystem's default profile.
# Unspecified fields use the defaults from this document.

[inference.profiles.heartbeat_t1]
temperature = 0.3
reasoning_effort = "low"

[inference.profiles.dream_rem]
temperature = 1.0   # Owner wants more creative dreams
min_p = 0.08

[inference.profiles.risk]
# Risk is never overridden. This section is ignored with a warning.
# Safety-critical subsystems have locked profiles.

[inference.embeddings]
provider = "local"   # "local" or "venice"
venice_model = "text-embedding-bge-m3"

[inference.caching]
auto_cache_control = true
prompt_cache_key_prefix = "golem"

Locked profiles

The following subsystems have locked profiles that cannot be overridden by owner configuration:

  • risk: Temperature 0.1, reasoning Max. Safety-critical. Always maximum precision.
  • death_reflect: Temperature 0.5, reasoning Max. The Golem’s final honest self-assessment cannot be constrained.
  • operator: Temperature 0.7, reasoning High. Owner communication quality is never degraded.

Attempting to override a locked profile logs a warning and uses the locked defaults.


Cross-References

TopicDocumentWhat it covers
Model routing and provider resolution01a-routing.mdSelf-describing providers, declarative intents, and the full routing algorithm including subsystem intent declarations
Structured output schemas01-structured-outputs.md (companion doc)StructuredOutput abstraction across providers with JSON Schema enforcement and graceful degradation
Context engineering pipeline04-context-engineering.md8-layer pipeline where inference profile parameters interact with prompt cache alignment and tool pruning
Venice provider specification12-providers.md section VeniceVenice private cognition plane: TEE attestation, E2EE inference, sensitivity classification, and DIEM staking
Hypnagogic temperature scheduling../06-hypnagogia/02-architecture.mdTemperature ramp during dream-cycle onset/return transitions between waking and sleep states
Dream cycle phases../05-dreams/01-architecture.mdNREM memory consolidation, REM creative recombination, and integration insight promotion phases
Mortality pressure on inference01a-routing.md section MortalityHow dying Golems become more cost-sensitive: quality downgrade, cost_sensitivity increase, exempt subsystems
Daimon emotional appraisal../03-daimon/01-appraisal.mdThe Daimon’s emotional regulation subsystem: PAD vector extraction, appraisal schemas, and affective bias