Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

The Complete Evaluation Map [SPEC]

Version: 1.0 | Status: Draft

Depends on: All previous testing documents. This is the capstone showing how evaluation systems compose.

Source: active-inference-research/new/17-evaluation-map.md

Reader orientation: This is the capstone document for Section 16 (Testing), showing how all 14 feedback loops compose across five speed tiers. It provides a single view of what evaluates what, at what speed, and how the loops feed each other – from machine-speed residual correction (~15,000/day) through weekly meta-learning evaluation. If you are reading one testing document, this is the map. See prd2/shared/glossary.md for full term definitions.


Purpose

The Bardo architecture has 14 distinct feedback loops spread across multiple documents. A reader who has been through them all needs a single view: what evaluates what, at what speed, and how the loops feed each other.


The 14 Feedback Loops, Ordered by Speed

Tier 1: Machine Speed (per-resolution, 5-15 seconds, ~zero cost)

#LoopWhat It EvaluatesMetricGate Action
1Residual correctionSystematic bias in predictionsMean residual per (category, regime)Shift prediction centers, adjust widths
2Confidence calibrationLLM stated confidence vs actual accuracyECE per (category, regime)Calibrate stated confidence before use
3Adversarial awarenessSlippage excess, sandwich patternsMean excess bps, sandwich frequencyEscalate to private mempool, adjust gas

These three loops run on every prediction resolution or every action. Pure arithmetic – no LLM, no human, no waiting. They compound at ~15,000 iterations/day.

Tier 2: Cognitive Speed (per-theta-tick, 30-120 seconds, T0 cost, where T0/T1/T2 are the three inference tiers: T0 is fast/cheap local inference, T1 is mid-tier cloud inference, T2 is the most capable and expensive model)

#LoopWhat It EvaluatesMetricGate Action
4Prediction accuracyWere specific claims correct?Hit rate per categoryAction gate permits/blocks
5Action gatingShould the golem trade this tick?Action accuracy vs inaction accuracyPermit or suppress execution
6Context attributionWere the right memories retrieved?Credit/debit per Grimoire entryBoost/demote retrieval ranking
7Cost-effectivenessIs expensive inference worth it?delta-accuracy per dollar per tierShift inference tier routing
8Attention foragingWhat should the golem monitor?Prediction error per itemPromote/demote attention tier

These five loops run on every theta tick (or a subset). Mostly arithmetic with occasional cheap reads. ~1,000 iterations/day.

Tier 3: Consolidation Speed (per-dream-cycle, 4-12 hours, T1-T2 cost)

#LoopWhat It EvaluatesMetricGate Action
9NREM residual replayWhat patterns exist in prediction errors?Pattern detection in residualsPropose bias corrections, environmental models
10REM counterfactual generationWhat would have happened if…?Counterfactual prediction accuracyGenerate novel testable hypotheses
11Reasoning quality reviewIs reasoning consistent and aligned?Alignment rate, contradiction countFlag fragile reasoning for replay
12Tool selection evaluationDid we use the best tool?Execution quality vs alternativesUpdate tool routing preferences

These four loops run during dream cycles or per-action. LLM reasoning (T1/T2). ~30-50 iterations/day.

Tier 4: Retrospective Speed (daily/weekly, T1 cost)

#LoopWhat It EvaluatesMetricGate Action
13Retrospective PnLWas this decision good in hindsight?Position lifecycle PnL, regret scoreRe-tag episodes, update somatic markers
14Heuristic auditWhich PLAYBOOK rules make money?PnL per heuristic citationPromote/demote/investigate heuristics

These two loops run on daily/weekly schedules. They require full position lifecycle data before producing meaningful results.

Tier 5: Meta Speed (weekly/generational)

Not numbered above because meta-learning evaluates the other loops, not domain performance:

MetricWhat It MeasuresHealthy Signal
Corrector convergence rateHow fast residuals converge per categoryDecreasing over time
Dream yieldFraction of creative predictions confirmedStable or increasing
Attention precisionFraction of promoted items that become usefulAbove 50%
Heuristic half-lifeHow long PLAYBOOK rules surviveIncreasing over time
Time-to-competenceTicks for gen N+1 to reach 70% accuracyDecreasing over generations
Shadow strategy comparisonWould different params have been better?Real config outperforms shadows

How the Loops Feed Each Other

TIER 1 (machine speed)
+-------------------+  +--------------------+  +---------------------+
| 1. Residual       |  | 2. Confidence      |  | 3. Adversarial      |
|    correction     |  |    calibration      |  |    awareness        |
+--------+----------+  +---------+----------+  +---------+-----------+
         |                       |                        |
         | corrected             | calibrated             | slippage
         | predictions           | confidence             | patterns
         |                       |                        |
         v                       v                        v
TIER 2 (cognitive speed)
+-------------------+  +--------------------+  +---------------------+
| 4. Prediction     |  | 5. Action gate     |  | 6. Context          |
|    accuracy       |<-| (uses 4's data)    |  |    attribution      |
+--------+----------+  +--------------------+  +---------+-----------+
         |                                                |
+--------+----------+  +--------------------+             |
| 7. Cost-          |  | 8. Attention       |             |
|    effectiveness  |  |    foraging        |             |
+--------+----------+  +---------+----------+             |
         |                       |                        |
         | tier routing          | promoted items         | entry scores
         | adjustments           | for richer             | for retrieval
         |                       | predictions            | ranking
         v                       v                        v
TIER 3 (consolidation speed)
+-------------------+  +--------------------+  +---------------------+
| 9. NREM replay    |  | 10. REM            |  | 11. Reasoning       |
| (largest          |-->|  counterfactuals   |  |     quality         |
|  residuals)       |  |  (seeded by 9)     |  |     review          |
+--------+----------+  +---------+----------+  +---------+-----------+
         |                       |                        |
         | bias corrections      | novel hypotheses       | alignment flags
         | environmental         | testable               | contradiction
         | models                | predictions            | counts
         |                       |                        |
         +----------------------------------------------- +
                                 |
+----------------------------------------------------------------+
| 12. Tool selection evaluation                                    |
|     (feeds tool routing, prediction calibration)                 |
+-----------------------------+----------------------------------+
                              |
                              v
TIER 4 (retrospective speed)
+-------------------------------+  +----------------------------+
| 13. Retrospective PnL         |  | 14. Heuristic audit        |
|     position lifecycle        |  |     which rules work?      |
|     vs-inaction comparison    |  |     promote/demote         |
|     regret scoring            |  |                            |
+-------------------------------+  +----------------------------+
         |                                    |
         | re-tagged episodes                 | heuristic
         | somatic markers                    | adjustments
         | long-horizon bias corrections      |
         v                                    v
TIER 5 (meta speed)
+----------------------------------------------------------------+
| META-LEARNING EVALUATION                                         |
|                                                                  |
| Evaluates all loops above:                                       |
| - Is the corrector converging faster?                            |
| - Are dreams producing better hypotheses?                        |
| - Is attention finding useful items?                             |
| - Are heuristics lasting longer?                                 |
| - Is each generation reaching competence faster?                 |
|                                                                  |
| Adjusts: dream frequency, attention thresholds, corrector        |
| parameters, inheritance compression ratio                        |
+----------------------------------------------------------------+
         |
         | If meta-learning score negative for 2 consecutive weeks:
         v
+----------------------------------------------------------------+
| MORTALITY ENGINE                                                 |
| Accelerate epistemic death clock -- the golem can't improve      |
+----------------------------------------------------------------+

The Compounding Property

Each tier’s output feeds the tier below it:

  • Tier 1 produces corrected, calibrated predictions. Tier 2 makes better gating and attention decisions with better data.
  • Tier 2 identifies high-error items and useful context. Tier 3 dreams about the right things and reviews the right reasoning.
  • Tier 3 discovers patterns and novel hypotheses. Tier 4 evaluates whether those discoveries translate to PnL.
  • Tier 4 audits strategic effectiveness. Tier 5 evaluates whether the evaluation loops are improving.
  • Tier 5 adjusts the loops, feeding back to Tier 1 with better parameters.

After 30 days: ~450,000 Tier 1 corrections, ~30,000 Tier 2 assessments, ~150 Tier 3 dream evaluations, ~4 Tier 4 retrospectives, ~4 Tier 5 meta-evaluations. Each tier amplifies the tiers above it.


The Karpathy Property Across All Loops

Every loop has the same structure: one metric, one arena, one gate, running as fast as the arena allows.

LoopArenaMetricGateSpeed
Residual correction(category, regime) bufferMean residualShift center, adjust widthPer-resolution
Confidence calibration(category, regime) calibration curveECECalibrate stated confidencePer-resolution
Adversarial awarenessPer-transaction slippageExcess bpsEscalate gas/mempool strategyPer-action
Prediction accuracyPer-category ledgerHit ratePermit/block actionsPer-tick
Action gatingAction vs inaction accuracyRelative accuracyPermit/blockPer-tick
Context attributionPer-entry credit/debitContext value scoreBoost/demote retrievalPer-tick
Cost-effectivenessPer-tier accuracy gaindelta-accuracy / $costShift tier routingPer-tick
Attention foragingPer-item prediction errorViolation countPromote/demote tierPer-tick
NREM replayResidual patternsPattern significancePropose correctionsPer-dream
REM counterfactualsHypothetical scenariosCounterfactual accuracyRegister predictionsPer-dream
Reasoning qualityTrace consistencyAlignment rateFlag for replayPer-dream
Tool selectionExecution vs alternativesExecution quality gapUpdate preferencesPer-action
Retrospective PnLPosition lifecyclePnL trajectory, regretRe-tag episodes, somatic markersDaily/weekly
Heuristic auditPLAYBOOK citations vs PnLAvg PnL per citationPromote/demote/investigateWeekly

The power is not in any individual loop. It is in all 14 running simultaneously and compounding across each other.


The Owner’s View: What They See and When

WhenWhat the Owner SeesWhich Loops Are Visible
Every glanceSpectre clarity (accuracy), eye state (affect), status bar (accuracy %, vitality)1, 2, 4
Every few minutesDecision ring blazes on escalated ticks, toasts for promotions/blocks4, 5, 8
After dream cycleDream results toast, creative predictions registered9, 10, 11
DailyDaily review in FATE > Reviews. PnL attribution, position retrospectives13
WeeklyWeekly review: heuristic audit, shadow strategy comparison, meta-learning dashboard6, 7, 12, 13, 14, Meta
At deathDeath testament: comprehensive epoch review, inheritance report, successor recommendationsAll loops contribute

The owner does not need to understand the 14 loops to benefit from them. They see:

  1. Is the golem getting smarter? (Spectre clarity + accuracy trend)
  2. Is it making money? (PnL in status bar + weekly review)
  3. What did it learn? (Dream results + PLAYBOOK evolution)
  4. Should I change anything? (Shadow comparisons + heuristic audit)
  5. Is the system itself improving? (Meta-learning dashboard)

What Could Go Wrong

Feedback loop oscillation

If meta-learning adjusts dream frequency, and dream frequency changes creative yield, which changes meta-learning scores, the system could oscillate. Mitigation: meta-learning adjustments are rate-limited (one per weekly review) with asymmetric bias (reducing > increasing).

Over-evaluation paralysis

With 14 feedback loops, the system might spend more time evaluating than acting. Mitigation: Tiers 1-2 are near-zero cost. Tiers 3-4 are bounded by dream cycle and retrospective schedules. Total evaluation cost is <5% of inference budget.

Metric gaming across loops

If the action gate uses calibrated confidence, and calibration is computed from the same predictions the gate evaluates, there is circularity. Mitigation: calibration is computed on a rolling window excluding the most recent 50 predictions.

False signal from small samples

Many loops need significant data before metrics are reliable. Every loop reports sample size, and metrics are marked “insufficient data” below configurable thresholds.


References

  • [KARPATHY-2026] Karpathy, A. “autoresearch.” GitHub, March 2026. — Demonstrates a single-metric, single-arena evaluation loop running at maximum speed; the structural pattern adopted by every tier in the evaluation map.
  • [XIONG-2023] Xiong, M. et al. “Can LLMs Express Their Uncertainty?” arXiv:2306.13063, 2023. — Shows LLMs are systematically overconfident (80-100% stated vs 50-60% actual accuracy), motivating the Tier 1 confidence calibration loop.
  • [LIU-METACOGNITION-2025] Liu, T. & Hernández-Lobato, J.M. “Truly Self-Improving Agents Require Intrinsic Metacognitive Learning.” ICML 2025. arXiv:2506.05109. — Argues that agents need loops that evaluate their own evaluation process; the theoretical basis for Tier 5 meta-learning.
  • [CHEN-2025] Chen, A. et al. “Reasoning Models Don’t Always Say What They Think.” arXiv:2505.05410, 2025. — Shows reasoning traces may not reflect actual computation, motivating the Tier 3 reasoning quality review loop.
  • [VOVK-2005] Vovk, V. et al. Algorithmic Learning in a Random World. Springer, 2005. — Foundational conformal prediction framework that underpins the residual correction loop’s interval width computation.
  • [BARRETT-2017] Barrett, L.F. How Emotions Are Made: The Secret Life of the Brain. Houghton Mifflin, 2017. — Constructionist emotion theory informing how the Daimon’s affect signals feed into retrospective episode re-tagging.
  • [HUANG-2024] Huang, J. et al. “Large Language Models Cannot Self-Correct Reasoning Yet.” ICLR 2024. — Demonstrates that intrinsic self-correction degrades LLM performance, constraining what the reasoning quality review (Loop 11) can trust from LLM-generated narratives.
  • [PAN-2024] Pan, A. et al. “Feedback Loops Drive In-Context Reward Hacking in LLMs.” ICML 2024. — Warns that iterated LLM feedback loops can produce reward hacking; the reason meta-learning adjustments are rate-limited to one per weekly review.