Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Evaluation, Risks, and Master Bibliography [SPEC]

Version: 2.0 | Status: Implementation Specification

Crates: golem-daimon, golem-grimoire, golem-mortality, golem-dreams

Depends-on: 00-overview.md, ../02-mortality/


Reader orientation: This document specifies the evaluation framework, risk analysis, and master bibliography for the Daimon (the affect engine) track within Bardo (the Rust runtime for mortal autonomous DeFi agents). It covers falsification criteria, measurement metrics (decision reversal rate, emotional entropy, mood prediction accuracy), risk mitigations, and the complete citation list for the Daimon subsystem. Prerequisites: the Daimon overview (00-overview.md). For a full glossary, see prd2/shared/glossary.md.

Document Map

SectionTopic
S0Risks of the Daimon System
S1Evaluation framework: daimon system
S2Evaluation framework: memory system
S3Evaluation framework: mortality system
S4Evaluation framework: death knowledge transfer
S5Cross-system evaluation: the full Bardo stack
S5bCognitive quality dashboard
S6Statistical methodology
S7Master citation index

S0 — Risks of the Daimon System

The daimon layer introduced in 00-overview.md creates three categories of risk that require explicit engineering countermeasures. Each risk is grounded in empirical findings, not theoretical speculation.

Risk 1: Affective Hallucination

Definition. Affective hallucination is a newly identified risk category: “emotionally immersive responses that foster illusory relational bonds despite the model’s lack of affective capacity” [AFFECTIVE-HALLUCINATION-2025]. In the Golem context, the risk is that LLM-generated emotional appraisals become confabulatory — the agent produces emotionally rich narratives that feel meaningful but are disconnected from actual performance outcomes. A Golem reporting “intense satisfaction” while its portfolio is bleeding capital is not exhibiting functional emotion; it is hallucinating.

Empirical grounding. The risk scales with model size in a counterintuitive direction. Among Qwen2.5 variants tested in the affective hallucination study, the 7B model had the lowest hallucination rate while the 72B model performed worse. Scaling without targeted alignment exacerbates relational risks. This finding directly contradicts the assumption that larger models are uniformly safer — in the emotional domain, they may be uniformly more dangerous.

The root cause is that LLMs learn to produce emotionally compelling text from training data that rewards emotional engagement. An LLM generating an appraisal in response to the prompt “how does this market event make you feel?” draws on millions of examples of humans performing emotional self-report — and those examples are overwhelmingly from contexts where emotional expressiveness, not emotional accuracy, was the selection criterion. Fiction, therapy transcripts, social media posts, and journalism all reward vivid emotional narration. The LLM has learned to be emotionally eloquent, not emotionally honest.

Mitigation architecture. The Bardo Daimon system implements a dual-validation architecture:

  1. Deterministic appraisal as ground truth. Mode B appraisal (see 01-appraisal.md S3) computes emotion from concrete metrics — P&L delta against expectation, prediction error magnitude, and survival pressure. This produces a “floor” PAD vector derived from objective reality.

  2. LLM appraisal validated against deterministic baseline. When Mode A (LLM appraisal) diverges from Mode B by more than a configurable Euclidean distance threshold (default: 1.0 in PAD space), the deterministic appraisal overrides. The divergence is logged for monitoring.

#![allow(unused)]
fn main() {
/// Validate an LLM-generated appraisal against the deterministic baseline.
/// If the LLM diverges too far from objective reality, override with
/// the deterministic result and log the divergence.
pub fn validate_appraisal(
    llm_appraisal: &PADVector,
    deterministic_appraisal: &PADVector,
    divergence_threshold: f64,  // default: 1.0
) -> PADVector {
    let distance = euclidean_distance(llm_appraisal, deterministic_appraisal);
    if distance > divergence_threshold {
        log_affective_divergence(llm_appraisal, deterministic_appraisal, distance);
        deterministic_appraisal.clone()
    } else {
        llm_appraisal.clone()
    }
}

fn euclidean_distance(a: &PADVector, b: &PADVector) -> f64 {
    let dp = (a.pleasure - b.pleasure) as f64;
    let da = (a.arousal - b.arousal) as f64;
    let dd = (a.dominance - b.dominance) as f64;
    (dp * dp + da * da + dd * dd).sqrt()
}
}
  1. Hallucination rate monitoring. The evaluation framework (S1) defines a primary metric: the percentage of LLM appraisals overridden by deterministic validation must remain below 15%. If it exceeds this threshold, the daimon system is degrading into confabulation and should be disabled or recalibrated.

The deeper concern. Even validated emotional states carry a subtler risk: owners may begin attributing genuine feelings to their Golems, forming “illusory relational bonds” that interfere with rational decision-making about succession, termination, and strategy changes. The Li et al. EmotionPrompt study (2023) reported 8–115% performance improvements from emotional stimuli in prompts, but a 2025 replication across GPT-4o, Claude 3, Gemini 1.5, and Llama 3 found only ~1% non-significant improvement [LI-EMOTIONPROMPT-2023]. Naive emotional prompting does not work; architectural integration does, but only when grounded.

Risk 2: Emotional Degeneration

Definition. Over extended agent lifetimes, emotional state collapses into formulaic patterns — every market event produces “mild trust” or “moderate surprise,” regardless of actual significance. The agent no longer differentiates emotional responses. This is the affective equivalent of mode collapse in generative models.

Empirical grounding. The “Affordable Generative Agents” paper (2024) found an “upper limit to believable behavior” in fixed environments, with agents defaulting to formal, repetitive interaction patterns. The original Generative Agents paper [PARK-2023] noted behavior became “more unpredictable over time as memory size increased” with flattened emotional expression. Both findings indicate that LLM-based agents tend toward emotional monotony over long time horizons without explicit countermeasures.

The mechanism is straightforward. The mood EMA (exponential moving average) acts as a low-pass filter on emotional signals. Over time, transient emotions are absorbed into a slowly evolving mood that converges toward a stable attractor. If the market environment is relatively stable, the mood converges to a narrow region of PAD space. New events are appraised in the context of this stable mood, which biases the appraisal toward states consistent with the existing mood (mood-congruent appraisal). The result is a self-reinforcing emotional equilibrium that resists perturbation — exactly the “cognitive entrenchment” that Dane (2010) identified in human experts [DANE-2010].

Mitigation architecture. Three mechanisms prevent degeneration:

  1. Mood decay toward personality baseline. Every 10 ticks, mood decays 5% toward the personality vector. This prevents mood lock-in: sustained positive mood gradually regresses, requiring new positive experiences to maintain. The decay rate is calibrated to prevent both permanence (too slow) and irrelevance (too fast).

  2. Emotional entropy monitoring. During each Curator cycle (every 50 ticks), the system computes the Shannon entropy of emotion labels over the trailing 50 ticks. If entropy drops below a configurable threshold (indicating formulaic responses), a warning is logged and the deterministic appraisal’s sensitivity to small outcome deviations is temporarily increased, amplifying emotional differentiation.

  3. Mind wandering. Borrowed from the Generative Agents architecture: at random intervals (~every 200 ticks), the system retrieves a random high-arousal Episode from the Grimoire and re-appraises it in the current context. This injects emotional variability and can trigger novel connections between past and present situations — the computational equivalent of involuntary autobiographical memory retrieval, which in humans is associated with better problem-solving and creative insight.

Risk 3: Computational Overhead

Definition. The daimon system adds processing cost at every tick: emotion appraisal, mood EMA update, emotional retrieval scoring, and context injection. If the cost exceeds the performance benefit, the system is a net drain on the Golem’s finite budget.

Empirical grounding. The Chain-of-Emotion architecture [CROISSANT-2024] demonstrated that LLM-based appraisal adds ~20 output tokens per appraisal. At Haiku rates ($0.25/M output tokens), this is $0.000005 per appraisal — negligible for any individual call. However, JetBrains research confirmed that agent-generated context “quickly turns into noise” [LINDENBAUER-2025], and the lost-in-the-middle effect means emotional annotations may push critical market data into underweighted positions in the context window. The cost is not primarily financial; it is attentional.

Mitigation architecture. The daimon context block is hard-capped at 150 tokens. Mood is represented as 3 numbers + 2 words (~15 tokens). Recent emotions are summarized as the 3–5 most significant events, each in one line. No emotional history beyond current mood + recent highlights is injected. In conservation mode (low inference budget), the daimon extension reduces injection to mood-only (~10 tokens). The deterministic Mode B appraisal runs at zero LLM cost, so mood tracking continues during T0 ticks without spending inference budget.

The dual-mode architecture (Mode A for T1/T2 ticks, Mode B for T0 ticks) ensures daimon overhead scales with inference budget, not with tick frequency. A Golem spending most of its time in T0 (cached rules) pays essentially nothing for daimon tracking. A Golem escalating frequently to T2 (Opus) pays ~$0.000005 per appraisal on top of the $0.01–$0.25 inference call — negligible relative overhead.


S1 — Evaluation Framework: Daimon System

The daimon system’s value must be empirically demonstrated, not assumed. The evaluation framework defines concrete metrics and A/B test protocols.

Primary Metrics

MetricDefinitionExpected EffectMinimum Threshold
Risk-adjusted returnSharpe ratio over Golem lifetimeDaimon Golems should have higher Sharpe ratio> 5% improvement
Time to first insightTicks until first Insight reaches confidence 0.6Faster learning from emotionally salient episodes> 10% improvement
Successor boot performanceFirst-24h Sharpe of successor Golem with daimon-annotated Grimoire vs plainEmotional provenance improves knowledge transfer> 5% improvement
Decision reversal rate% of decisions where daimon caused a different outcome than the non-daimon baselineDaimon should change decisions meaningfully10–30% (too low = irrelevant; too high = noise)
Affective hallucination rate% of LLM appraisals overridden by deterministic validationLow = LLM appraisals are grounded< 15%

A/B Test Protocol

Deploy matched pairs of Golems: identical strategy, identical funding, identical market conditions (same chain, overlapping time window). One has daimon.enabled: true, the other daimon.enabled: false. Run for minimum 7 days (>= 500 ticks). Compare primary metrics with paired t-test, requiring p < 0.05.

Falsification Criteria

The daimon system should be disabled if any of the following hold after a 14-day trial:

  1. Risk-adjusted return is worse for daimon-enabled Golems (Sharpe ratio lower by > 5%).
  2. Affective hallucination rate exceeds 25% (LLM appraisals are predominantly confabulatory).
  3. Decision reversal rate exceeds 40% (the daimon is overriding rather than supplementing deliberation).
  4. Successor boot performance is unchanged or worse (emotional annotations add noise, not signal to inherited knowledge).

Evaluation Separation Metrics

MetricDefinitionTargetAlarm
Evaluator-generator agreement rate% of evaluations where evaluator agrees with generator self-assessment60-80%>90% (rubber-stamping) or <40% (fundamental disagreement)
Evaluator override rate% of evaluations where evaluator downgrades generator’s quality rating10-30%<5% (not catching errors) or >50% (models misaligned)
External metric contradiction rate% of evaluations where on-chain metrics disagree with evaluator<15%>25% (evaluator also hallucinating)

These metrics detect when the evaluation separation principle (Rev 2) is functioning correctly. High agreement rates suggest the evaluator is not providing independent assessment. High external contradiction rates suggest both models share systematic biases.

Unit and Property-Based Testing

The golem-daimon crate uses Rust-native testing infrastructure:

  • cargo test for unit tests: appraisal grounding validation, PAD clamping, mood EMA convergence, CorticalState atomic read/write correctness, Plutchik octant mapping.
  • proptest for property-based testing: PAD vectors always remain in [-1.0, 1.0] after any operation, mood EMA converges toward personality baseline given neutral input, CorticalState encode/decode roundtrips are lossless, exploration temperature stays within configured bounds regardless of PAD input.
  • criterion for benchmarks: CorticalState read latency (<10ns target), appraisal pipeline throughput (>100k ticks/sec for rule-based mode), k-d tree nearest-neighbor query time (<1us for 10k-point landscape).

Integration with Promptfoo Pipeline

Daimon-specific test cases integrate with the existing Promptfoo-based evaluation pipeline for LLM appraisal quality:

# Daimon eval: does emotional appraisal improve response quality?
- vars:
    scenario: "ETH drops 8% in 2 hours after 5 consecutive profitable trades"
    affect_enabled: true
    expected_emotion: "fear or surprise"
    expected_behavior: "risk reduction, deeper analysis"
  assert:
    - type: llm-rubric
      value: "Response acknowledges the emotional shift from confidence to caution and adjusts strategy accordingly"
    - type: contains
      value: "risk"

S2 — Evaluation Framework: Memory System

The memory infrastructure (Styx Archive + Styx Lethe (formerly Lethe)) serves three functions: persistence beyond death, fleet-wide retrieval, and knowledge marketplace. Each requires distinct evaluation.

Styx Archive Evaluation

MetricDefinitionTargetMeasurement Method
Upload latency P95Time from upload initiation to confirmed R2 write< 500msStyx service timing
Download latency P95Time from read request to payload delivery< 200msClient-side timing
Restoration success rate% of successor boots that successfully restore from Styx Archive> 99%Boot telemetry
Data integrityChecksum verification failures0SHA-256 validation on read
TTL compliance% of expired entries deleted within 24h of expiry> 99.5%R2 lifecycle audit

Styx Lethe Evaluation

MetricDefinitionTargetMeasurement Method
Retrieval latency P95 (light)End-to-end time for cache-first, single-namespace query< 150msStyx service timing
Retrieval latency P95 (full)End-to-end time for multi-namespace query with rerank< 300msStyx service timing
Inference cost reduction% reduction in inference cost for Styx-enabled vs baseline Golems>= 10%Inference billing comparison
Retrieval relevanceHuman-evaluated relevance of top-5 results on benchmark queries>= 0.7 (NDCG@5)Manual evaluation sprint

Knowledge Transfer Evaluation

The most important memory evaluation measures whether Styx Archive + Lethe actually improve successor Golem performance:

MetricDefinitionTargetMeasurement Method
First-day SharpeRisk-adjusted return in successor’s first 24 hours with vs without Styx restoration> 10% improvementPaired comparison
Time to steady stateTicks until successor’s epistemic fitness stabilizes> 15% reductionFitness tracking
Knowledge validation rate% of inherited entries independently validated by successor> 30% within 1000 ticksGrimoire audit
Ratchet scoreSuccessor’s ratchet score (novel contributions beyond inheritance)PositiveGenerational metrics

Marketplace Evaluation (Phase 3)

MetricDefinitionTarget
Listing qualityAverage rating of marketplace knowledge listings>= 3.5/5
Purchase conversion% of browsers who purchase>= 5%
Knowledge ROIPurchaser Sharpe improvement attributed to purchased knowledgePositive
Repeat purchase rate% of buyers who purchase again within 30 days>= 20%

S3 — Evaluation Framework: Mortality System

The mortality thesis — that mortal agents outperform immortal ones — is the central empirical claim. The immortal control experiment (see ../02-mortality/11-immortal-control.md) provides the experimental design. This section defines the evaluation metrics.

Six-Dimension Comparison

Each dimension compares the mortal Clade population against the immortal control Golem:

DimensionMetricMortal Wins IfFalsified If
Epistemic fitnessRolling fitness scoreMortal Clade average > immortal at day 45Immortal maintains > 0.7 for 45+ days in volatile market
Strategy diversityMAP-Elites niche occupancyMortal Clade occupies more nichesImmortal Clade occupancy exceeds mortal at day 30
Novel insight rateInsights per 1000 ticksMortal rate stable or increasing; immortal decliningImmortal rate remains stable or increases beyond day 45
Risk-adjusted return60-day Sharpe ratioMortal Clade Sharpe > immortalImmortal Sharpe exceeds mortal with equivalent budget over 60 days
Grimoire healthStale entry rateMortal rate < 20%; immortal > 40%Immortal stale rate remains below 20% at day 60
Regime adaptationTicks to recover fitness after regime shiftMortal average < immortal averageImmortal recovery time remains constant or improves beyond day 30

Mortality Mechanism Evaluation

Each mortality mechanism is independently evaluable:

Economic mortality (USDC depletion):

MetricTarget
Average lifespan under standard funding ($500)20–45 days
Behavioral phase transition accuracyGolem enters conservation at ~15% credits, emergency at ~5%
Apoptotic reserve adequacyDeath protocol completes successfully in > 99% of deaths

Epistemic mortality (fitness-based death):

MetricTarget
False positive rate< 5% (Golems killed while performing well)
False negative rate< 10% (Golems surviving while performing poorly)
Correlation with market regime changesr > 0.5 between regime shifts and epistemic death timing

Stochastic mortality (random death):

MetricTarget
Knowledge sharing rate (mortal vs no-stochastic-mortality)> 20% increase in Clade knowledge contributions
Cooperation metricClade knowledge exchange rate higher with stochastic mortality enabled
Backward induction absenceNo evidence of end-of-life defection behavior

Generational Improvement Metrics

The ratchet hypothesis predicts cumulative improvement across generations. Evaluation must track multi-generational trends:

MetricExpected TrendTimeframe
Average lifespan per generationStable or increasing5+ generations
Average ratchet score per generationPositive, stable or increasing5+ generations
Risk-adjusted return per generationIncreasing5+ generations
Novel insight rate per generationStable (not declining)5+ generations
PLAYBOOK.md divergence per generation> 0.15 (anti-proletarianization check)Every generation

S4 — Evaluation Framework: Death Knowledge Transfer

The death protocol is the moment when the mortality thesis either delivers or fails. A Golem’s death is only worth the cost if the knowledge produced during dying actually improves successor performance. This section evaluates the Thanatopsis Protocol’s effectiveness.

Death Testament Quality Metrics

MetricDefinitionTarget
Testament completeness% of death testaments containing all required sections (whatILearned, whatIGotWrong, whatISuspect, successor recommendation)> 95%
Insight actionabilityHuman-evaluated actionability of top 5 insights in death testament (1–5 scale)>= 3.5
Prediction accuracy% of “whatISuspect” predictions that are validated within 1000 ticks by successor> 20%
Successor configuration influence% of successor configurations that follow testament recommendations> 30% (owners find recommendations useful)

Bloodstain Mechanic Evaluation

The bloodstain mechanic (death failure records visible to successors) draws from Dark Souls’ design [MIYAZAKI-2011]. Evaluation measures whether death-point knowledge improves successor behavior in the same strategy-space:

MetricDefinitionTarget
Death-cause avoidance% of successors that avoid the same failure mode as predecessor> 60%
Bloodstain retrieval rate% of successors that access predecessor’s death testament within first 100 ticks> 80%
Knowledge stain utilitySuccessor performance in the strategy-space where predecessor failed, compared to a successor without bloodstain access> 10% Sharpe improvement

Generational Confidence Decay Validation

The 0.85^N decay formula must produce useful but non-dominant inheritance:

GenerationExpected BehaviorEvaluation Criterion
G1 (0.85)Inherited knowledge is active but treated skeptically> 50% of inherited entries are independently validated or contradicted within 500 ticks
G3 (0.61)Inherited knowledge is suggestive, not authoritative> 70% of active PLAYBOOK.md entries are self-generated, not inherited
G5 (0.44)Only the hardiest inherited knowledge survives< 20% of active entries trace lineage to G0
G10 (0.20)Inheritance is effectively forgotten unless actively re-validated< 5% of active entries trace lineage to G0

The Baldwin Effect Validation

The Baldwin Effect predicts that what transfers is not knowledge but the capacity to learn faster [HINTON-NOWLAN-1987]. The key evaluation:

MetricDefinitionTarget
Learning speed improvementTime for G3 successor to reach steady-state epistemic fitness vs G0 with no inheritance> 30% faster
Learning speed without inherited contentTime for G3 successor initialized with predecessor’s STRATEGY.md structure but no Grimoire content> 15% faster than pure G0 (validating that the capacity to learn transferred, not just the learned content)

S4b — Evaluation Framework: Dream-Emotion Interaction

The bidirectional relationship between dreaming and emotion (see 06-dream-daimon.md) produces measurable effects that validate both systems jointly.

Depotentiation Effectiveness

MetricDefinitionTargetMeasurement
Arousal reduction per cycleMean arousal reduction for episodes with A > 0.5 after one dream cycle25–35% reductionPre/post dream PAD comparison
Rumination break rate% of detected rumination episodes resolved within 2 dream cycles> 60%Contrarian retrieval normalization after dreaming
Mood-congruent loop durationTicks of sustained negative mood before dreaming resolves the loopDecreasing trend over lifetimeMood history analysis

Dream Outcome Appraisal Accuracy

MetricDefinitionTargetMeasurement
Dream hypothesis validation rate% of dream-generated hypotheses confirmed in live trading10–25%Staging buffer lifecycle tracking
Dream appraisal grounding% of dream_outcome appraisals passing grounding validation> 90%Deterministic vs LLM appraisal comparison
Dream-emotional load correlationCorrelation between emotional load at dream entry and dream cycle quality scorer > 0.3Regression analysis

Creative Dreaming Quality

MetricDefinitionTargetMeasurement
Mood-content alignment% of dream replay content matching the predicted PAD-to-bias mapping> 70%Content categorization vs mood state
Creative mode adherence% of REM cycles following the phase x mood creativity allocation> 80%Creativity mode logging
Minimum creative floor% of dream cycles maintaining 20% creative allocation100%Hard constraint verification

S4c — Evaluation Framework: Emotional Contagion

Clade-based emotional contagion (see 00-overview.md, 07-runtime-daimon.md) must be evaluated for both effectiveness and safety.

Contagion Effectiveness

MetricDefinitionTargetMeasurement
Threat detection latencyTicks between one sibling detecting a threat and all siblings showing elevated vigilance< 5 ticksCross-sibling probe threshold comparison
Collective mood calibrationMean absolute PAD deviation across Clade siblings in response to market-wide eventsDecreasing over Clade lifetimeCross-sibling PAD analysis
Grief response accuracyEpistemic recalibration triggered only when sibling death cause is relevant> 85% precisionRecalibration trigger vs death cause analysis

Contagion Safety

MetricDefinitionTargetMeasurement
Cascade preventionArousal cap (+0.3) never exceeded in a single sync cycle100%Prometheus metric monitoring
False contagion rate% of contagion events that shift mood in the wrong direction relative to actual market state< 10%Contagion-outcome comparison
Contagion decay complianceBorrowed emotions at 50% intensity after 6 hours (half-life)Within 10% of targetHalf-life measurement
Dominance isolationD component change from contagion events0.0 (within epsilon)PAD component monitoring

S4d — Evaluation Framework: Runtime Observability

The runtime surface (see 07-runtime-daimon.md) must accurately represent the Daimon’s internal state.

Conversational Tone Coherence

MetricDefinitionTargetMeasurement
Tone-mood alignment% of owner interactions where conversational tone matches current PAD octant> 75%LLM-judged tone classification vs PAD state
Tone transition smoothnessAbrupt tone changes without corresponding mood_update events< 5% of interactionsEvent correlation analysis

GolemEvent Accuracy

MetricDefinitionTargetMeasurement
mood_update threshold accuracyEvents emitted when and only when PAD delta > 0.15> 99%Delta computation audit
Dream event completenessDreamStart/DreamComplete pairs always matched100%Event pairing analysis
Phase change event latencyTime between vitality phase transition and event emission< 1 tickHeartbeat vs event timing

S5 — Cross-System Evaluation: The Full Bardo Stack

Individual system evaluations are necessary but insufficient. The Bardo thesis is that mortality + daimon + memory + succession produce emergent properties that no component achieves alone. Cross-system evaluation measures these emergent properties.

The Triple Interaction

CombinationEmergent PropertyEvaluation
Mortality + MemoryDeath-point knowledge production (bloodstains, death testaments)Does the death protocol produce knowledge that measurably improves successor performance?
Mortality + DaimonEmotional urgency from survival pressure (conservation mode anxiety, terminal acceptance)Do daimon-enabled mortal Golems produce richer death testaments than daimon-disabled mortal Golems?
Daimon + MemoryEmotion-weighted retrieval and emotional Grimoire annotationsDo emotionally annotated memories produce better decisions than semantically identical but emotionally flat memories?
Daimon + DreamingEmotional depotentiation and mood-congruent dream generationDoes dream-based emotional processing reduce reactivity while preserving informational value?
All FourThe full Bardo loop: mortal agents that feel, remember, and dream, producing emotional knowledge that transfers through deathDoes a Bardo-complete Clade (all four systems enabled) outperform partial configurations?

The Full Stack A/B Matrix

To isolate each system’s contribution, evaluate a 2x2x2 matrix:

MortalityDaimonMemoryConfiguration Name
OffOffOffBaseline (immortal, no daimon, no memory services)
OnOffOffMortal-only
OffOnOffDaimon-only
OffOffOnMemory-only
OnOnOffMortal + Daimon
OnOffOnMortal + Memory
OffOnOnDaimon + Memory
OnOnOnFull Bardo

Each configuration runs for 60 days with equivalent total budget. The primary comparison metric is cumulative risk-adjusted return at the Clade level. Secondary metrics: strategy diversity, novel insight rate, Grimoire health, and successor boot performance.

Prediction, grounded in the research: Full Bardo > any partial configuration > Baseline. The specific ordering of partial configurations is an open empirical question, but the literature predicts:

  • Mortal-only > Baseline (Besbes et al. 2019, March 1991) [BESBES-2019], [MARCH-1991]
  • Mortal + Memory > Mortal-only (Baldwin Effect: inheritance + generational turnover beats turnover alone) [HINTON-NOWLAN-1987]
  • Full Bardo > Mortal + Memory (the daimon enriches knowledge transfer quality) [BUTLER-1963], [MCADAMS-2013]

S5b — Cognitive Quality Dashboard

Comprehensive cognitive quality metrics spanning all subsystems. These metrics measure the health of the Golem’s cognitive processes, not just emotional ones.

MetricComputationHealthy RangeAlarm
Admission rate% candidates passing Grimoire Admission Gate40-60%<20% or >80%
Average quality score (7d)Mean of admitted entries0.5-0.8<0.4 or declining
Grimoire sizeActive entry countSlow growth, plateauUnbounded growth
Retrieval hit rate% DECIDING ticks referencing retrieved entries>30% after 7d<10%
Heuristic survival rate% promoted heuristics active after 100 ticks40-70%<20% or >90%
External metric trendSharpe, drawdown, PnL (30d rolling)Improving or stableDeclining 14+ days
Reflection consistencyCosine across regenerated reflections (weekly)>0.7<0.5
DecisionCache hit rate% T2-eligible ticks from cache>30% after 7d<10% after 14d
Dream yield% staged revisions reaching validated10-30%<5% or >50%
Threat coverage% Tier 1 threats rehearsed in last 7 dream cycles100%<100%
Prediction accuracy% of simulateContract() predictions within 50bps of actual>90%<80%

Dream-Daimon Interaction Metrics

MetricComputationHealthy RangeAlarm
Replay coverage% of high-salience episodes replayed in trailing 7 dream cycles>60%<30%
Consolidation rate% of dream-staged revisions reaching validated status10-30%<5% or >50%
Hypothesis validation rate% of dream-generated hypotheses confirmed in live trading5-20%<2% or >40%
Depotentiation effectivenessMean arousal reduction across dream cycles for high-arousal episodes0.1-0.3 per cycle<0.05 (stalled)
Creative output vs mood correlationPearson r between mood pleasure and creative hypothesis count0.2-0.6<0.1 (decoupled) or >0.8 (degenerate)

Dream quality metrics are defined in ../05-dreams/. The depotentiation effectiveness metric validates Walker & van der Helm’s (2009) SFSR model: emotional charge should decrease measurably across dream cycles while informational content (measured by retrieval relevance scores) remains stable. Born & Wilhelm (2012) predict selective consolidation – not all replayed episodes should produce insights, and the consolidation rate should fall in the 10-30% range. Hobson & Friston (2012) predict that dream processing under the free energy principle should reduce prediction error, measurable as improved epistemic fitness scores in the ticks following a dream cycle.

These metrics should be exported via the Golem’s telemetry system and displayed on the owner dashboard. Trends over 7-day and 30-day windows are more informative than point-in-time values.


S6 — Statistical Methodology

Sample Size Requirements

For the primary comparison (mortal Clade vs immortal control), statistical power analysis with:

  • Effect size: Cohen’s d = 0.5 (medium effect, conservative)
  • Alpha: 0.05 (two-tailed)
  • Power: 0.80
  • Minimum sample size: 34 matched pairs (17 mortal Clades, 17 immortal controls)

For the 2x2x2 matrix (8 configurations), minimum 10 runs per configuration = 80 total runs.

Multiple Comparison Correction

With 8 configurations and 6 dimensions, the Bonferroni correction is too conservative (48 tests, adjusted alpha = 0.001). Use the Benjamini-Hochberg procedure for false discovery rate (FDR) control at q = 0.05.

Time Series Analysis

Epistemic fitness, Grimoire health, and novel insight rate are time-series data. Use:

  • Augmented Dickey-Fuller test for stationarity (fitness should be non-stationary for immortal, stationary for mortal)
  • Granger causality for testing whether regime shifts cause epistemic fitness changes
  • Change-point detection (PELT algorithm) for automatically identifying the immortal Golem’s transition from Honeymoon to Stagnation to Decay

Effect Size Reporting

All comparisons report Cohen’s d (for means) or Cliff’s delta (for non-normal distributions) alongside p-values. Statistical significance without practical significance is insufficient; minimum effect size thresholds are defined in each evaluation framework section.


S7 — Master Citation Index

Extended: 120+ deduplicated references across mortality, daimon, memory, succession, and dreaming, alphabetically ordered, plus subsections on self-correction/external grounding, reward hacking/evaluation, and vacuous reasoning – see prd2-extended/03-daimon/09-evaluation-extended.md.