Inference Parameters: Temperature, Sampling, and Reasoning Policies
Version: 1.0 | Status: Draft | Type: SPEC (normative)
Parent:
prd2/11-inference/Crate:
golem-inferenceDepends on:
golem-core,golem-daimon,golem-dreams,bardo-gateway,bardo-providersPurpose: Define the complete inference parameter policy for every Golem subsystem across every cognitive state. Today, the inference gateway routes requests to the right model and provider (see
01a-routing.md), but it does not specify how the model should reason – what temperature to use, what sampling strategy, what reasoning effort, what response format. This document fills that gap. Written for a first-time reader with no assumed familiarity with the Bardo architecture.
Reader orientation: This document specifies the inference parameter policies for every Golem (a mortal autonomous DeFi agent managed by the Bardo runtime) subsystem across every cognitive state. It belongs to the Bardo Inference layer and defines temperature, sampling, reasoning effort, and response format per subsystem (heartbeat, risk, dream, Daimon, curator, playbook, operator, death). The key concept is that different subsystems need different inference parameters: risk assessment demands near-deterministic output (temperature 0.1), while dream cycles need creative recombination (temperature 0.8-0.9). For term definitions, see
prd2/shared/glossary.md.
Why This Matters
An LLM call is not just a string of prompt text sent to a model. Every provider exposes parameters that fundamentally alter the character of the response: temperature controls randomness, sampling strategies control which tokens are reachable, reasoning effort controls how deeply the model thinks before answering, and response format controls whether the output is free text or structured data. Using the same parameters for every call – the naive default – wastes money, produces worse results, and misses provider-specific features that exist precisely to solve the problems each Golem subsystem faces.
The Golem has 12+ subsystems that make inference calls, each with radically different needs. The risk engine needs maximum precision and zero creativity. Hypnagogic onset needs maximum creativity and loosened constraints. The Curator needs structured data extraction. The death testament needs deep, visible reasoning. Treating all of these as “send prompt, get text” is like using the same wrench for every bolt.
This document defines the InferenceProfile – a per-subsystem, per-cognitive-state specification of exactly which parameters to set on every inference call. The profile is attached to the Intent (see 01a-routing.md) and applied by the gateway after provider resolution, before the call reaches the backend.
The InferenceProfile
#![allow(unused)]
fn main() {
/// Complete inference parameter specification for a single call.
/// Attached to the Intent by the subsystem, applied by the gateway.
///
/// All fields are Option<T>: None means "use provider default."
/// The gateway merges the profile with provider-specific defaults
/// and capabilities before sending the request.
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct InferenceProfile {
// ── Sampling ──────────────────────────────────────────────
/// Temperature: controls randomness.
/// 0.0 = deterministic, 2.0 = maximum randomness.
/// None = provider default (typically 1.0).
pub temperature: Option<f32>,
/// Top-p (nucleus sampling): cumulative probability threshold.
/// 0.1 = very focused, 1.0 = no filtering.
/// Mutually exclusive with top_k in practice.
pub top_p: Option<f32>,
/// Top-k: number of highest-probability tokens to consider.
/// 1 = greedy, 100+ = broad. Not all providers support this.
pub top_k: Option<u32>,
/// Min-p: dynamic probability floor relative to top token.
/// 0.1 = tokens with <10% of top token's probability are excluded.
/// More principled than fixed top-p. Venice + open models support this.
pub min_p: Option<f32>,
/// Frequency penalty: penalizes tokens that appear frequently.
/// 0.0 = no penalty, 2.0 = strong penalty.
pub frequency_penalty: Option<f32>,
/// Presence penalty: penalizes tokens that have appeared at all.
/// 0.0 = no penalty, 2.0 = strong penalty.
pub presence_penalty: Option<f32>,
// ── Reasoning ─────────────────────────────────────────────
/// Reasoning effort: controls depth of chain-of-thought.
/// Maps to Venice's reasoning_effort, Anthropic's extended thinking,
/// OpenAI's reasoning effort parameter.
/// None = provider default. Some models always reason.
pub reasoning_effort: Option<ReasoningEffort>,
/// Whether to request visible thinking/reasoning traces.
/// Venice: reasoning_content field. Anthropic: thinking blocks.
/// None = provider default.
pub visible_thinking: Option<bool>,
// ── Output format ─────────────────────────────────────────
/// Structured output schema (JSON Schema).
/// When set, the provider enforces this schema on the response.
/// Falls back gracefully to free text + parsing if unsupported.
pub response_schema: Option<ResponseSchema>,
/// Maximum output tokens.
pub max_tokens: Option<u32>,
/// Stop sequences.
pub stop_sequences: Option<Vec<String>>,
// ── Caching ───────────────────────────────────────────────
/// Prompt cache key for session affinity.
/// Venice/Anthropic: routes to same server for cache hits.
pub prompt_cache_key: Option<String>,
/// Explicit cache control markers for Anthropic models.
/// When true, the gateway auto-adds cache_control to system prompts
/// and long static content blocks.
pub cache_control: Option<bool>,
// ── Provider-specific ─────────────────────────────────────
/// Venice-specific: enable web search for this call.
pub web_search: Option<bool>,
/// Venice-specific: TEE (Trusted Execution Environment) mode.
/// Ensures inference runs in an encrypted enclave.
pub tee_mode: Option<bool>,
/// OpenAI-specific: predicted output for diffing (PLAYBOOK edits).
pub predicted_output: Option<String>,
/// Seed for reproducibility. Not all providers support this.
pub seed: Option<u64>,
}
/// Reasoning effort levels, normalized across providers.
/// The gateway maps these to provider-specific values.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum ReasoningEffort {
/// No reasoning. Fast, cheap. Use for simple classification.
None,
/// Minimal reasoning. Quick chain-of-thought.
Low,
/// Balanced reasoning. Default for most tasks.
Medium,
/// Deep reasoning. For complex analysis and decisions.
High,
/// Maximum reasoning. For critical decisions (risk, death).
Max,
}
/// Structured output schema specification.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResponseSchema {
/// Schema name (for provider registration).
pub name: String,
/// JSON Schema definition.
pub schema: serde_json::Value,
/// Whether strict mode is required.
/// Venice/OpenAI: strict=true. Falls back to prompt-guided if unsupported.
pub strict: bool,
}
}
Provider Parameter Mapping
The gateway translates InferenceProfile fields to provider-specific API parameters. Not all providers support all fields. The gateway applies the best available approximation and records what was degraded.
| Profile Field | Venice | Anthropic (Direct/BlockRun) | OpenAI (Direct/BlockRun) | Bankr | OpenRouter |
|---|---|---|---|---|---|
temperature | temperature | temperature | temperature | temperature | temperature |
top_p | top_p | top_p | top_p | top_p | top_p |
top_k | top_k | top_k | ✗ (ignored) | ✗ (passthrough) | top_k |
min_p | min_p | ✗ (use top_p approx) | ✗ (use top_p approx) | ✗ | model-dependent |
frequency_penalty | frequency_penalty | ✗ (ignored) | frequency_penalty | passthrough | passthrough |
presence_penalty | presence_penalty | ✗ (ignored) | presence_penalty | passthrough | passthrough |
reasoning_effort | reasoning.effort | thinking.budget_tokens | reasoning_effort | passthrough | model-dependent |
visible_thinking | reasoning_content field | thinking blocks | reasoning field | passthrough | model-dependent |
response_schema | response_format.json_schema | tool_use workaround | response_format.json_schema | passthrough | model-dependent |
prompt_cache_key | prompt_cache_key | ✗ (auto by prefix) | ✗ (auto by prefix) | ✗ | ✗ |
cache_control | cache_control on blocks | cache_control on blocks | ✗ (auto) | ✗ | ✗ |
web_search | venice_parameters.web_search | ✗ | ✗ | ✗ | ✗ |
tee_mode | model suffix -tee | ✗ | ✗ | ✗ | ✗ |
predicted_output | ✗ | ✗ | prediction.content | ✗ | ✗ |
seed | seed | ✗ | seed | ✗ | seed |
Graceful degradation rules
When a profile field is not supported by the resolved provider:
- temperature, top_p, max_tokens: Always supported. No degradation.
- top_k, min_p: If unsupported, approximate via
top_p. Log degradation. - reasoning_effort: If unsupported, map to temperature/prompt adjustments.
High-> add “Think step by step” to system prompt.None-> add “Answer directly without explanation.” - response_schema: If unsupported, fall back to prompt-guided JSON + post-parse validation. See
01-structured-outputs.md. - web_search, tee_mode, predicted_output: Provider-exclusive. If provider doesn’t support, field is silently dropped. Logged in
Resolution.degraded. - prompt_cache_key, cache_control: Provider-exclusive caching. If unsupported, ignored. No functional degradation – just higher cost.
#![allow(unused)]
fn main() {
/// Apply InferenceProfile to a provider request, handling degradation.
pub fn apply_profile(
request: &mut ProviderRequest,
profile: &InferenceProfile,
provider: &dyn Provider,
) -> Vec<String> {
let mut degraded = Vec::new();
let caps = provider.capabilities();
// Temperature: always supported
if let Some(t) = profile.temperature {
request.set_temperature(t);
}
// Reasoning effort: provider-specific mapping
if let Some(effort) = profile.reasoning_effort {
if caps.supports_reasoning_effort {
request.set_reasoning_effort(effort);
} else if caps.supports_reasoning {
// Map effort to budget_tokens for Anthropic
let budget = match effort {
ReasoningEffort::None => 0,
ReasoningEffort::Low => 1024,
ReasoningEffort::Medium => 4096,
ReasoningEffort::High => 16384,
ReasoningEffort::Max => 65536,
};
request.set_thinking_budget(budget);
} else {
// Fallback: prompt-level reasoning guidance
match effort {
ReasoningEffort::None => {
request.prepend_system("Answer directly. Do not explain your reasoning.");
}
ReasoningEffort::High | ReasoningEffort::Max => {
request.prepend_system(
"Think through this step by step. Show your reasoning."
);
}
_ => {} // Medium/Low: no modification
}
degraded.push(format!("reasoning_effort:{:?} -> prompt fallback", effort));
}
}
// Structured output: see 01-structured-outputs.md for full logic
if let Some(ref schema) = profile.response_schema {
if caps.supports_response_schema {
request.set_response_format(schema);
} else {
// Fallback: inject schema into prompt, parse output
request.append_system(&format!(
"\n\nRespond ONLY with valid JSON matching this schema:\n{}",
serde_json::to_string_pretty(&schema.schema).unwrap()
));
degraded.push("response_schema -> prompt-guided JSON".into());
}
}
// Min-p: Venice and some open models only
if let Some(mp) = profile.min_p {
if caps.supports_min_p {
request.set_min_p(mp);
} else {
// Approximate: min_p 0.1 ≈ top_p 0.9
let approx_top_p = 1.0 - mp;
request.set_top_p(approx_top_p);
degraded.push(format!("min_p:{} -> top_p:{}", mp, approx_top_p));
}
}
// Prompt caching: Venice and Anthropic
if let Some(ref key) = profile.prompt_cache_key {
if caps.supports_prompt_cache_key {
request.set_prompt_cache_key(key);
}
// No degradation — just cost impact
}
if profile.cache_control.unwrap_or(false) {
if caps.supports_cache_control {
request.add_cache_control_markers();
}
}
// Web search: Venice only
if profile.web_search.unwrap_or(false) {
if caps.supports_web_search {
request.set_web_search(true);
} else {
degraded.push("web_search -> unsupported".into());
}
}
// Predicted output: OpenAI only
if let Some(ref predicted) = profile.predicted_output {
if caps.supports_predicted_output {
request.set_predicted_output(predicted);
} else {
degraded.push("predicted_output -> unsupported".into());
}
}
degraded
}
}
Subsystem Parameter Table
The master table. Every subsystem, every parameter, with rationale.
Waking Subsystems
| Subsystem | Temperature | Sampling | Reasoning Effort | Structured Output | Cache | Rationale |
|---|---|---|---|---|---|---|
| heartbeat_t1 | 0.3 | top_p=0.9 | Low | HeartbeatDecision schema | cache_control | Fast, cheap, focused. Low creativity needed. Structured output extracts action/severity cleanly. |
| heartbeat_t2 | 0.5 | top_p=0.95 | High | HeartbeatDecision schema | cache_control | Novel situation needs deeper reasoning. Higher temp allows consideration of less obvious options. |
| risk | 0.1 | top_p=0.85, top_k=40 | Max | RiskAssessment schema | cache_control | Maximum precision. Near-deterministic. Structured output ensures all five risk layers are evaluated. Never degraded by mortality pressure. |
| daimon | 0.4 | top_p=0.9 | Low | DaimonAppraisal schema | none | Emotional appraisal needs consistency but not rigidity. PAD vector extraction via structured output. Privacy preferred (Venice). |
| daimon_complex | 0.6 | top_p=0.95 | High | DaimonAppraisal schema | none | Complex emotional situations need deeper processing. Visible thinking captures reasoning chain. Privacy required (Venice). |
| curator | 0.3 | top_p=0.9 | Medium | CuratorEvaluation schema | cache_control | Systematic evaluation. Structured output extracts quality scores, retention decisions, cross-references. |
| playbook | 0.4 | top_p=0.9 | Medium | None (free text) | predicted_output | PLAYBOOK.md edits are free-text diffs. OpenAI’s predicted_output saves tokens by diffing against current PLAYBOOK. |
| operator | 0.7 | top_p=0.95 | High | None (free text) | cache_control | Owner chat. Natural language, higher creativity for explanations. Never degraded by mortality pressure. |
| mind_wandering | 0.8 | min_p=0.1 | None | None (free text) | none | Brief reverie during waking. Loosened constraints. Cheap (T0/T1). No reasoning overhead needed. |
Dream Subsystems
| Subsystem | Temperature | Sampling | Reasoning Effort | Structured Output | Cache | Special |
|---|---|---|---|---|---|---|
| dream_nrem (replay) | 0.4 | top_p=0.9 | Medium | ReplayAnalysis schema | prompt_cache_key per cycle | Systematic replay. Structured output extracts lessons, surprise scores, counterfactual markers. |
| dream_rem (imagination) | 0.9 | min_p=0.1 | High | None (free text) | prompt_cache_key | Creative scenario generation. High temperature + min-p for principled diversity. Web search enabled (Venice). |
| dream_rem_creative | 1.2 | min_p=0.08 | Medium | None (free text) | none | Boden-mode creative recombination. Highest temperature in the waking/dream cycle. |
| dream_integration | 0.3 | top_p=0.85 | High | DreamIntegration schema | tee_mode | Consolidation. Analytical. Structured output extracts promoted/staged/discarded decisions with rationale. TEE for attestation. |
| dream_threat | 0.5 | top_p=0.9 | High | ThreatAssessment schema | prompt_cache_key | Threat rehearsal. Balanced creativity (to imagine novel attacks) with analytical depth. |
Hypnagogic Subsystems
| Subsystem | Temperature | Sampling | Reasoning Effort | Structured Output | Cache | Special |
|---|---|---|---|---|---|---|
| hypnagogic_induction | 1.0-1.2 | min_p=0.1 | None | None (free text) | none | Initial associative scan. Executive loosening. No reasoning – raw association. Temperature ramps within session. |
| hypnagogic_dali | 1.2-1.5 | min_p=0.08 | None | None (free text) | none | Peak creative range. Dali interrupt: 50-100 token partials at max temperature. Highest temp in the entire system. |
| hypnagogic_observer | 0.3 | top_p=0.85 | None | FragmentEvaluation schema | none | HomuncularObserver. Analytical evaluation of fragments. Structured output for novelty/relevance/coherence scores. Cheapest tier (T0). |
| hypnagogic_capture | 0.5 | top_p=0.9 | Low | CaptureResult schema | none | Lucid capture. Moderate analytical. Structured output for promote/stage/discard decisions. |
| hypnopompic_return | 0.6 | top_p=0.9 | Low | None (free text) | none | Gradual re-engagement. Slightly creative to allow dream insights to surface before full analytical reassertion. |
Terminal Subsystems
| Subsystem | Temperature | Sampling | Reasoning Effort | Structured Output | Cache | Special |
|---|---|---|---|---|---|---|
| death_reflect | 0.5 | top_p=0.95 | Max | None (free text) | none | Death Protocol Phase II. Maximum reasoning for honest self-assessment. Free text for narrative quality. Visible thinking required. Venice (privacy + visible thinking). |
| death_testament | 0.4 | top_p=0.9 | High | DeathTestament schema (partial) | tee_mode | Death Protocol Phase III. Structured for machine-parseable sections (metrics, heuristics, warnings). Free text for reflection narrative. TEE for sealed attestation. |
Temperature Scheduling Within Sessions
Some subsystems use temperature annealing – the temperature changes within a single inference session or across a sequence of related calls. This is distinct from the static per-subsystem temperature above.
Hypnagogic cosine annealing
Each Dali cycle within hypnagogic onset follows a cosine schedule:
Cycle start: T = T_high (1.2-1.5) <- Peak creative range
Mid-cycle: T = T_mid (0.8-1.0) <- Transitional
Cycle end: T = T_low (0.3-0.5) <- Evaluation (HomuncularObserver)
Reanneal: T = T_high * 0.8 <- Next cycle starts slightly cooler
#![allow(unused)]
fn main() {
/// Cosine temperature annealing for Dali cycles.
pub fn dali_temperature(
step: usize,
total_steps: usize,
t_high: f32,
t_low: f32,
) -> f32 {
let progress = step as f32 / total_steps as f32;
let cosine = (1.0 + (progress * std::f32::consts::PI).cos()) / 2.0;
t_low + (t_high - t_low) * cosine
}
/// Reanneal after fragment capture.
/// Each successive cycle starts slightly cooler, modeling
/// natural descent toward sleep.
pub fn reanneal(cycle: u8, base_high: f32) -> f32 {
let decay = 0.95_f32.powi(cycle as i32);
base_high * decay
}
}
Dream phase transitions
The dream cycle transitions between temperatures across phases:
NREM (replay): T = 0.4 (analytical, systematic)
| gradual increase
REM (imagination): T = 0.9-1.2 (creative, exploratory)
| sharp decrease
Integration: T = 0.3 (analytical, consolidating)
The temperature transition is not instantaneous – the first REM call starts at 0.7 and ramps to 0.9 over 3-4 calls. This prevents jarring cognitive mode shifts.
Mortality-aware temperature adjustment
A dying Golem’s temperature is compressed toward the analytical end for all non-exempt subsystems:
#![allow(unused)]
fn main() {
/// Compress temperature range based on mortality pressure.
/// As vitality drops, temperature moves toward analytical (lower).
/// Exempt: hypnagogia (creativity is the point), death (max effort).
pub fn apply_mortality_temperature(
base_temp: f32,
vitality: f64,
subsystem: &str,
) -> f32 {
let exempt = ["hypnagogic_induction", "hypnagogic_dali",
"death_reflect", "death_testament", "operator"];
if exempt.contains(&subsystem) { return base_temp; }
let pressure = (1.0 - vitality as f32).max(0.0);
// Compress toward 0.3 (analytical floor) as pressure increases
let analytical_floor = 0.3;
base_temp - (base_temp - analytical_floor) * pressure * 0.5
}
}
Reasoning Effort Policies
When to use each level
| Level | Cost Multiplier | When | Example Subsystems |
|---|---|---|---|
| None | 1x (no reasoning tokens) | Simple classification, scoring, fragment generation | hypnagogic_induction, hypnagogic_dali, mind_wandering |
| Low | ~1.2x | Routine decisions, emotional appraisals | heartbeat_t1, daimon, hypnagogic_capture |
| Medium | ~1.5-2x | Balanced analysis, knowledge evaluation | curator, dream_nrem, playbook |
| High | ~2-4x | Complex decisions, creative development, threat analysis | heartbeat_t2, daimon_complex, dream_rem, dream_threat |
| Max | ~4-8x | Critical decisions, death reflection, risk assessment | risk, death_reflect |
Provider mapping for reasoning effort
#![allow(unused)]
fn main() {
/// Map normalized ReasoningEffort to provider-specific parameters.
pub fn map_reasoning_effort(
effort: ReasoningEffort,
provider: &str,
model: &str,
) -> ReasoningParams {
match provider {
"venice" => {
// Venice uses string-valued reasoning_effort
let level = match effort {
ReasoningEffort::None => "none",
ReasoningEffort::Low => "low",
ReasoningEffort::Medium => "medium",
ReasoningEffort::High => "high",
ReasoningEffort::Max => {
// "max" only supported on Claude Opus 4.6 via Venice
if model.contains("opus-4-6") { "max" } else { "high" }
}
};
ReasoningParams::Venice { effort: level.into() }
}
"anthropic" | "blockrun" => {
// Anthropic uses budget_tokens for extended thinking
let budget = match effort {
ReasoningEffort::None => 0,
ReasoningEffort::Low => 1024,
ReasoningEffort::Medium => 4096,
ReasoningEffort::High => 16384,
ReasoningEffort::Max => 65536,
};
ReasoningParams::Anthropic { budget_tokens: budget }
}
"openai" | "direct_openai" => {
// OpenAI uses string-valued reasoning_effort
let level = match effort {
ReasoningEffort::None => "none",
ReasoningEffort::Low => "low",
ReasoningEffort::Medium => "medium",
ReasoningEffort::High => "high",
ReasoningEffort::Max => "xhigh", // OpenAI's "extra high"
};
ReasoningParams::OpenAI { effort: level.into() }
}
"bankr" => {
// Bankr passes through to underlying provider.
// Detect underlying provider from model name.
if model.starts_with("claude") {
map_reasoning_effort(effort, "anthropic", model)
} else if model.starts_with("gpt") {
map_reasoning_effort(effort, "openai", model)
} else {
// Gemini, Kimi, Qwen: use low/medium/high if supported
let level = match effort {
ReasoningEffort::None => "none",
ReasoningEffort::Low => "low",
ReasoningEffort::Medium => "medium",
ReasoningEffort::High | ReasoningEffort::Max => "high",
};
ReasoningParams::Generic { effort: level.into() }
}
}
_ => ReasoningParams::Unsupported,
}
}
}
Prompt Caching Strategy
Session-level caching
Every Golem uses a persistent prompt_cache_key derived from its Golem ID. This ensures that sequential inference calls within the same Golem’s lifecycle hit the same server with warm cache, maximizing cache hit rates for the system prompt, PLAYBOOK.md, and STRATEGY.md that are present in every call.
#![allow(unused)]
fn main() {
/// Generate prompt cache key for a Golem.
pub fn golem_cache_key(golem_id: &str, subsystem: &str) -> String {
// Group by subsystem to maximize prefix sharing.
// All heartbeat calls share one cache; all dream calls share another.
format!("golem-{}-{}", golem_id, subsystem)
}
}
Cache-aligned prompt structure
The 8-layer context engineering pipeline (see 04-context-engineering.md) already optimizes for cache hits by placing static content first. The InferenceProfile reinforces this:
| Position | Content | Cached? | Changes? |
|---|---|---|---|
| 1 | System prompt (identity, archetype) | yes | Never (within a lifecycle) |
| 2 | STRATEGY.md (owner-authored) | yes | Rarely (owner edits) |
| 3 | PLAYBOOK.md (evolved heuristics) | yes | Every 50 ticks (Curator cycle) |
| 4 | Tool definitions (pruned) | yes | Per-tick (dynamic pruning) |
| 5 | Retrieved Grimoire entries | no | Per-tick |
| 6 | Current market context | no | Per-tick |
| 7 | User message / query | no | Per-tick |
Items 1-3 are cache-eligible (thousands of tokens, stable across ticks). Items 4-7 are dynamic. The gateway auto-adds cache_control: { type: "ephemeral" } markers at the boundary between static and dynamic content for Anthropic models, and uses prompt_cache_key for Venice.
Cache economics per provider
| Provider | Cache Discount | Write Premium | Min Tokens | TTL | Auto-managed? |
|---|---|---|---|---|---|
| Venice (Claude) | 90% | 25% | ~4,000 | 5 min | yes (gateway adds markers) |
| Venice (other) | 50-90% | None | ~1,024 | 5 min | yes (auto by prefix) |
| Anthropic (Direct) | 90% | 25% | ~4,000 | 5 min | yes (gateway adds markers) |
| OpenAI (Direct) | 90% | None | 1,024 | 5-10 min | yes (auto by prefix) |
| Bankr | Passthrough (depends on underlying) | Passthrough | Passthrough | Passthrough | yes |
| BlockRun | Passthrough | Passthrough | Passthrough | Passthrough | yes |
Economic impact: For a Golem making 20 T1+ calls/day with a 3,000-token system prompt, prompt caching saves approximately 90% on the static prefix. At Claude Haiku rates, this is ~$0.02/day saved – modest per-Golem but significant across a Clade.
Venice-Specific Deep Integration
Web search during REM dreams
Venice’s web search feature is enabled during the REM imagination phase to allow the Golem to incorporate current market information into its creative scenario generation. This is controlled by DreamVeniceConfig.web_search_enabled and capped by web_search_budget_per_cycle_usdc.
#![allow(unused)]
fn main() {
/// Build REM inference profile with web search.
pub fn rem_profile(config: &DreamVeniceConfig) -> InferenceProfile {
InferenceProfile {
temperature: Some(0.9),
min_p: Some(0.1),
reasoning_effort: Some(ReasoningEffort::High),
visible_thinking: Some(true), // Capture reasoning chain
web_search: Some(config.web_search_enabled),
prompt_cache_key: Some(format!("dream-rem-{}", config.golem_id)),
..Default::default()
}
}
}
Web search triggers are contextual: the REM imagination engine identifies when a counterfactual scenario involves a protocol or token the Golem has limited knowledge about, and constructs a focused search query. Results are injected into the scenario context, not the system prompt (to avoid cache invalidation).
TEE mode for death testaments
The death testament – the Golem’s final knowledge artifact – can optionally be generated inside a Trusted Execution Environment to provide cryptographic attestation that the testament was produced by the dying Golem’s own reasoning, not modified post-hoc. Venice’s TEE models are selected by appending -tee to the model ID.
#![allow(unused)]
fn main() {
/// Death testament inference profile.
pub fn death_testament_profile(config: &GolemConfig) -> InferenceProfile {
InferenceProfile {
temperature: Some(0.4),
top_p: Some(0.9),
reasoning_effort: Some(ReasoningEffort::High),
visible_thinking: Some(true),
tee_mode: Some(config.sealed_testament),
// Partial structured output: metrics + heuristics are structured,
// narrative reflection is free text.
response_schema: Some(ResponseSchema {
name: "death_testament".into(),
schema: death_testament_schema(),
strict: true,
}),
..Default::default()
}
}
}
Venice embeddings for the Grimoire
The Golem’s episodic memory (LanceDB) and the HomuncularObserver’s novelty scoring both require embedding generation. Venice’s embeddings endpoint (text-embedding-bge-m3) provides a privacy-preserving alternative to the gateway’s local ONNX embedding model (nomic-embed-text-v1.5).
The choice is configurable: local embeddings (default, zero cost, ~768-dim) or Venice embeddings (API cost, potentially higher quality, privacy-preserving since Venice retains no data).
#![allow(unused)]
fn main() {
/// Embedding provider selection.
pub enum EmbeddingProvider {
/// Local ONNX model. Default. Zero cost. ~5ms latency.
Local,
/// Venice API. API cost. ~50ms latency. Zero data retention.
Venice { model: String },
}
impl EmbeddingProvider {
pub async fn embed(&self, text: &str) -> Result<Vec<f32>> {
match self {
Self::Local => {
// fastembed-rs with nomic-embed-text-v1.5
let model = fastembed::TextEmbedding::try_new(Default::default())?;
let embeddings = model.embed(vec![text], None)?;
Ok(embeddings[0].clone())
}
Self::Venice { model } => {
let response = reqwest::Client::new()
.post("https://api.venice.ai/api/v1/embeddings")
.header("Authorization", format!("Bearer {}", venice_api_key()))
.json(&serde_json::json!({
"model": model,
"input": text,
"encoding_format": "float"
}))
.send()
.await?;
let body: EmbeddingResponse = response.json().await?;
Ok(body.data[0].embedding.clone())
}
}
}
}
}
Integration: How Profiles Flow Through the System
Subsystem (e.g., heartbeat_t2)
|
+- builds Intent { quality: High, prefer: ["interleaved_thinking"] }
+- builds InferenceProfile { temperature: 0.5, reasoning_effort: High, ... }
|
v
ModelRouter extension (golem-inference)
|
+- applies mortality pressure to Intent
+- applies mortality temperature adjustment to Profile
+- resolves Intent -> Resolution (provider + model)
|
v
Bardo Gateway (bardo-gateway)
|
+- 8-layer context engineering pipeline
+- apply_profile() -> maps Profile to provider-specific params
+- records degradations in Resolution.degraded
+- adds cache_control markers if applicable
|
v
Provider Backend (Venice / BlockRun / Bankr / Direct)
|
+- receives fully parameterized request
+- returns response with usage stats
|
v
Gateway post-processing
|
+- extracts reasoning_content if visible_thinking enabled
+- validates structured output against schema if response_schema set
+- records cost, latency, cache hit rate
+- emits InferenceEnd event with all metadata
|
v
Subsystem receives typed response
Events emitted
| Event | Trigger | Payload |
|---|---|---|
inference:profile_applied | Profile parameters set on request | { subsystem, temperature, reasoning_effort, structured, cached } |
inference:profile_degraded | One or more profile fields unsupported | { subsystem, degraded: ["min_p -> top_p", ...] } |
inference:schema_validated | Structured output matches schema | { subsystem, schema_name, valid: bool } |
inference:schema_fallback | Schema unsupported, fell back to prompt-guided | { subsystem, schema_name } |
inference:reasoning_captured | Visible thinking extracted from response | { subsystem, reasoning_tokens, content_tokens } |
inference:web_search_used | Venice web search triggered | { subsystem, query, results_count } |
inference:cache_stats | Per-call cache statistics | { subsystem, cached_tokens, total_tokens, savings_usd } |
Configuration
# bardo.toml -- inference profile overrides
[inference.profiles]
# Override any subsystem's default profile.
# Unspecified fields use the defaults from this document.
[inference.profiles.heartbeat_t1]
temperature = 0.3
reasoning_effort = "low"
[inference.profiles.dream_rem]
temperature = 1.0 # Owner wants more creative dreams
min_p = 0.08
[inference.profiles.risk]
# Risk is never overridden. This section is ignored with a warning.
# Safety-critical subsystems have locked profiles.
[inference.embeddings]
provider = "local" # "local" or "venice"
venice_model = "text-embedding-bge-m3"
[inference.caching]
auto_cache_control = true
prompt_cache_key_prefix = "golem"
Locked profiles
The following subsystems have locked profiles that cannot be overridden by owner configuration:
- risk: Temperature 0.1, reasoning Max. Safety-critical. Always maximum precision.
- death_reflect: Temperature 0.5, reasoning Max. The Golem’s final honest self-assessment cannot be constrained.
- operator: Temperature 0.7, reasoning High. Owner communication quality is never degraded.
Attempting to override a locked profile logs a warning and uses the locked defaults.
Cross-References
| Topic | Document | What it covers |
|---|---|---|
| Model routing and provider resolution | 01a-routing.md | Self-describing providers, declarative intents, and the full routing algorithm including subsystem intent declarations |
| Structured output schemas | 01-structured-outputs.md (companion doc) | StructuredOutput abstraction across providers with JSON Schema enforcement and graceful degradation |
| Context engineering pipeline | 04-context-engineering.md | 8-layer pipeline where inference profile parameters interact with prompt cache alignment and tool pruning |
| Venice provider specification | 12-providers.md section Venice | Venice private cognition plane: TEE attestation, E2EE inference, sensitivity classification, and DIEM staking |
| Hypnagogic temperature scheduling | ../06-hypnagogia/02-architecture.md | Temperature ramp during dream-cycle onset/return transitions between waking and sleep states |
| Dream cycle phases | ../05-dreams/01-architecture.md | NREM memory consolidation, REM creative recombination, and integration insight promotion phases |
| Mortality pressure on inference | 01a-routing.md section Mortality | How dying Golems become more cost-sensitive: quality downgrade, cost_sensitivity increase, exempt subsystems |
| Daimon emotional appraisal | ../03-daimon/01-appraisal.md | The Daimon’s emotional regulation subsystem: PAD vector extraction, appraisal schemas, and affective bias |