[SUPERSEDED by 14-rust-implementation.md]
Bardo Inference: Performance, Latency & Practicality
Document Type: SPEC (normative) | Version: 1.0 | Status: Draft
Last Updated: 2026-03-14
Package:
@bardo/inferenceDepends on:
prd2-bardo-inference.mdPurpose: Honest analysis of the performance cost of Bardo Inference’s 8-layer context engineering pipeline. Per-layer latency budgets, parallelization strategy, conditional bypass rules, streaming implications, and the architecture decisions that keep total overhead under the threshold where users notice.
Reader orientation: This document (superseded by
14-rust-implementation.md) analyzes the performance cost of Bardo Inference’s (the LLM inference gateway for mortal autonomous DeFi agents called Golems) 8-layer context engineering pipeline. It belongs to the inference plane and covers per-layer latency budgets, parallelization strategy, conditional bypass rules, and streaming implications from the original TypeScript/Bun perspective. The key concept is that the full pipeline adds <50ms at P95, dominated by ONNX model inference (DeBERTa at 8ms, nomic-embed at 5ms), keeping overhead well below the threshold where users notice. For term definitions, seeprd2/shared/glossary.md.
1. The Core Tension
Every layer in the context engineering pipeline adds value (cost savings, quality improvement, safety) but also adds latency. An 8-layer pipeline that takes 500ms to process before the first token streams from the LLM is a terrible UX — the user stares at a blank screen for half a second before anything happens, on top of the LLM’s own time-to-first-token (TTFT).
The target: total pipeline overhead < 50ms for the p95 request. This is achievable because most layers are either sub-millisecond (hash lookups, reordering), parallelizable, or conditionally skipped.
For context, production LLM gateway benchmarks show:
- Bifrost (Go-based): 11µs overhead at 5,000 RPS — effectively invisible
- Cloudflare AI Gateway: 10–50ms proxy latency
- TrueFoundry: ~3–4ms latency at 350+ RPS
- LiteLLM (Python-based): 15–30ms reported by users, with concurrency bottlenecks
Bardo Inference is a TypeScript/Bun service with heavier per-request processing than a pure proxy, but lighter than a Python-based gateway. The goal is to sit in the 20–50ms range for the full pipeline, which is imperceptible when the LLM itself takes 500ms–30s.
2. Per-Layer Latency Budget
2.1 Layer Breakdown
| Layer | Operation | Expected Latency | Parallelizable? | Can Skip? |
|---|---|---|---|---|
| L1: Prompt cache alignment | Reorder system prompt segments, place cache_control breakpoints | < 1ms | N/A (pure computation) | Skip if no Anthropic-backed request |
| L2: Semantic cache check | Embed request → cosine similarity search against vector store | 5–25ms | ✅ Parallel with L3, L7, L8 | Skip for streaming conversations, tool-heavy requests |
| L3: Hash cache check | SHA-256 hash → O(1) lookup | < 0.5ms | ✅ Parallel with L2 | Never skip — always cheap |
| L4: Tool pruning | Analyze tool definitions, replace with meta-tools | 1–3ms | N/A (must complete before request is sent) | Skip if < 5 tools defined |
| L5: History compression | Check token count; if over threshold, summarize older turns | 0ms (check) / 200–2000ms (compress) | ⚠️ Only when triggered | Skip if under 80% context limit |
| L6: Lost-in-the-middle | Reorder context by priority within message array | < 1ms | N/A (pure computation) | Skip if < 10 context items |
| L7: PII masking | Presidio NER scan + regex patterns + entity replacement | 5–50ms | ✅ Parallel with L2 | Skip if no PII detected in fast pre-scan |
| L8: Injection detection | DeBERTa-v3 classifier on user input | 10–40ms | ✅ Parallel with L2, L7 | Skip for Golem-internal requests (trusted source) |
2.2 The Critical Insight: Most Layers Are Parallel
The pipeline is NOT a sequential 8-step chain. Layers 2, 3, 7, and 8 all operate on the raw input and can run simultaneously. The execution graph looks like this:
Request arrives
│
├─ PARALLEL GROUP A (fire all at once):
│ ├─ L2: Semantic cache check (~15ms)
│ ├─ L3: Hash cache check (~0.5ms)
│ ├─ L7: PII masking pre-scan (~10ms)
│ └─ L8: Injection detection (~20ms)
│
│ ← Wait for all to complete (~20ms wall clock)
│
├─ DECISION POINT:
│ ├─ L2 or L3 cache hit? → Return cached response. Done. (0ms more)
│ ├─ L8 injection detected? → Block or warn. Done.
│ └─ Continue to sequential layers ↓
│
├─ SEQUENTIAL GROUP B (must happen in order):
│ ├─ L7: Full PII masking (if pre-scan found entities) (~15ms)
│ ├─ L1: Prompt cache alignment (~1ms)
│ ├─ L4: Tool pruning (~2ms)
│ ├─ L6: Position optimization (~1ms)
│ └─ L5: History compression check (~0ms, or triggers async)
│
└─► Send to backend (~0ms — the request is ready)
← LLM TTFT: 200ms–2000ms (this dominates)
Total pipeline overhead: ~20ms (parallel) + ~19ms (sequential) = ~39ms p50
~25ms (parallel) + ~40ms (sequential) = ~65ms p95
The key point: The LLM’s own time-to-first-token is typically 200ms (Groq/Cerebras fast models) to 2000ms+ (Opus 4.6 with deep thinking). Even at the p95 of 65ms, the pipeline overhead is 3–30% of the total TTFT, which is imperceptible to users.
2.3 Per-Layer Deep Dive
L2: Semantic Cache — The Biggest Variable
The semantic cache is the most latency-variable layer. It requires:
- Embedding the request — converting the input to a vector
- Similarity search — finding the nearest neighbor in the cache
Embedding latency depends heavily on the model:
- all-MiniLM-L6-v2 (CPU, ONNX optimized): ~10–20ms per sentence on modern CPU. Sub-30ms confirmed by benchmark. This is a 33M parameter model that runs entirely on CPU.
- Model2Vec (static embeddings): ~0.04ms per sentence (25,000+ sentences/sec). 500x faster than MiniLM with ~90% of the quality. Suitable if MiniLM is too slow.
- Ollama embedding endpoint: ~5–15ms (local, GPU-backed if available)
Similarity search latency depends on the vector store:
- In-memory (HNSW via hnswlib-node): < 1ms for 100K vectors
- SQLite + vector extension: 1–3ms
- External (Turbopuffer, Qdrant): 5–20ms (network hop)
Recommendation: Use ONNX-optimized MiniLM-L6-v2 with in-memory HNSW for the semantic cache. Total: ~15ms. If this is too slow, switch to Model2Vec (~2ms total) with acceptable quality tradeoff.
Bypass condition: Skip semantic cache for streaming conversations where the user is mid-flow (the response needs to be fresh, not cached). Also skip for requests with tool calls (tool outputs vary, caching is rarely useful).
L7: PII Masking — Presidio Overhead
Presidio uses spaCy NER (en_core_web_lg, ~750MB model) plus regex patterns. The NER step is the bottleneck.
Latency profile:
- spaCy NER on short text (< 500 chars): ~5–15ms CPU
- spaCy NER on medium text (500–2000 chars): ~15–30ms CPU
- spaCy NER on long text (2000+ chars): ~30–80ms CPU
- Regex-only patterns (no NER): < 1ms
Optimization strategy: Two-phase approach.
- Fast pre-scan (< 2ms): Regex-only check for obvious PII patterns (emails, phone numbers, SSNs, credit cards, wallet addresses). If nothing found, skip NER entirely. This catches ~60% of PII cases.
- Full NER scan (only when pre-scan finds patterns OR for high-security requests): Run spaCy NER. Run in parallel with other layers.
Alternative: Use a lighter spaCy model (en_core_web_sm, ~12MB) for the NER step. 3–5x faster with ~10% accuracy loss. Acceptable for most use cases — false negatives go to the LLM, which is less sensitive than sending unmasked PII.
Bypass condition: Skip PII masking entirely for Venice-routed requests (Venice has structural zero data retention — PII masking is redundant). Skip for Golem-internal requests (heartbeat, dreams) where no operator PII is present.
L8: Injection Detection — DeBERTa Overhead
The DeBERTa-v3-base prompt injection classifier is a ~184M parameter model.
Latency profile:
- DeBERTa-v3-base (CPU, PyTorch): ~30–80ms
- DeBERTa-v3-base (CPU, ONNX Runtime): ~15–35ms
- DeBERTa-v3-small (CPU, ONNX Runtime): ~8–20ms (slightly less accurate)
- DeBERTa-v3-base (GPU): ~3–5ms
Recommendation: Use ONNX-optimized DeBERTa-v3-small for default, DeBERTa-v3-base for high-security. Run in parallel with semantic cache and PII masking.
Bypass condition: Skip for Golem-internal requests (heartbeat, dreams, curator — the prompts are generated by trusted code, not user input). Only run on operator-facing requests and tool results from external sources.
L5: History Compression — The Sleeping Giant
History compression is not on the critical path for most requests. It only triggers when the conversation approaches the context limit (80% of the model’s window). When it does trigger, it’s expensive: an LLM call to summarize older turns (200–2000ms depending on the summarization model).
Mitigation:
- Never block on compression. Compress asynchronously: send the current request with the full history, and compress in the background for the next request.
- Pre-compress proactively: When history reaches 60% of the limit, start compressing in the background (during idle time between heartbeats).
- Use Anthropic Compaction when available: Server-side compaction happens in-band and is typically faster + higher quality than client-side.
- Compression model choice matters: Use Haiku or Gemini Flash (~200ms TTFT) for compression, not Opus (~1500ms TTFT). The summary doesn’t need to be brilliant — it needs to be fast and accurate.
3. Streaming Considerations
3.1 Time-to-First-Token (TTFT) Impact
TTFT is the most perceptible latency metric. Users notice when the screen is blank.
Without Bardo Inference:
User sends message → LLM processes → first token streams
TTFT = LLM processing time (200ms–2000ms)
With Bardo Inference:
User sends message → Pipeline (20–65ms) → LLM processes → first token streams
TTFT = Pipeline (20–65ms) + LLM processing time (200ms–2000ms)
Overhead: 1–30% increase in TTFT
For fast models (Groq Llama at ~200ms TTFT), the pipeline adds a noticeable ~30% overhead. For standard models (Claude Opus at ~1500ms TTFT), it adds ~3%. For thinking models (R1 extended thinking at 5000ms+), it’s invisible.
3.2 Streaming Passthrough
Once the pipeline completes and the request reaches the backend, streaming works identically to a direct connection. Bardo Inference forwards SSE events as they arrive — there is no buffering or batching of streaming tokens. The bardo-ui-bridge parser operates on individual events as they stream through.
Backend SSE event → Bardo Inference (passthrough, ~0ms) → Client
The only streaming-specific overhead is the bardo-ui-bridge parsing (think tag detection, citation extraction), which is pure string processing at < 0.1ms per event.
3.3 Cache Hits: Dramatically Faster
When the semantic or hash cache hits, the response is returned without any LLM call:
Cache hit path:
User sends message → Pipeline parallel group (~20ms) → Cache hit → Return cached response
TTFT: ~20ms (vs. 200–2000ms without cache)
This is 10–100x faster than a fresh LLM call.
For Golem heartbeat ticks in calm markets (where ~20% of requests are semantically identical to recent ones), this means 1 in 5 ticks completes in ~20ms instead of ~500ms. This is a massive UX win for the TUI and web dashboard — the Golem appears to “think” instantly for routine checks.
4. Conditional Bypass Rules
Not every request needs every layer. The pipeline dynamically skips layers based on request characteristics:
4.1 Bypass Matrix
| Layer | Skip When | Rationale |
|---|---|---|
| L1: Cache alignment | Request targets non-Anthropic model | Only Anthropic benefits from prefix caching |
| L2: Semantic cache | Streaming multi-turn conversation, tool-heavy request, cache: false header | Mid-conversation responses should be fresh |
| L3: Hash cache | Never | Always < 0.5ms, always worth checking |
| L4: Tool pruning | < 5 tools defined, or model has native tool_search (GPT-5.4) | Overhead not worth it for small tool sets |
| L5: History compression | Under 80% context limit | No compression needed |
| L6: Position optimization | < 10 context items | Reordering is negligible for small contexts |
| L7: PII masking | Venice-routed request, Golem-internal request, fast pre-scan negative | Venice = structural privacy. Internal = no user PII |
| L8: Injection detection | Golem-internal request (trusted source), security: skip header (non-Golem power users) | Heartbeat/dream prompts are generated code, not user input |
4.2 Fast Path Profiles
The pipeline automatically selects a fast path based on request metadata:
type PipelineProfile = "full" | "standard" | "fast" | "minimal";
function selectProfile(request: InferenceRequest): PipelineProfile {
// Golem heartbeat T0 — deterministic, no LLM call
if (request.bardo?.subsystem === "heartbeat_t0") return "minimal";
// Golem internal subsystems (dreams, curator, daimon)
if (request.bardo?.subsystem && !request.bardo.isOperatorFacing) return "fast";
// Operator-facing or external user request
if (request.bardo?.isOperatorFacing || !request.bardo) return "standard";
// High-security request (death reflection, MEV-sensitive)
if (request.securityClass === "private") return "full";
return "standard";
}
const PROFILES: Record<PipelineProfile, LayerConfig> = {
full: {
// All 8 layers, no skips. Maximum optimization + safety.
cacheAlignment: true, semanticCache: true, hashCache: true,
toolPruning: true, historyCompression: true, positionOptimization: true,
piiMasking: true, injectionDetection: true,
},
standard: {
// Standard path for most requests. Skip injection detection for trusted sources.
cacheAlignment: true, semanticCache: true, hashCache: true,
toolPruning: true, historyCompression: true, positionOptimization: true,
piiMasking: true, injectionDetection: true,
},
fast: {
// Golem internal. Skip PII masking and injection detection.
cacheAlignment: true, semanticCache: true, hashCache: true,
toolPruning: true, historyCompression: true, positionOptimization: true,
piiMasking: false, injectionDetection: false,
},
minimal: {
// T0 heartbeat. Only hash cache (in case of exact repeat).
cacheAlignment: false, semanticCache: false, hashCache: true,
toolPruning: false, historyCompression: false, positionOptimization: false,
piiMasking: false, injectionDetection: false,
},
};
Latency by profile:
| Profile | Typical Latency | When Used |
|---|---|---|
minimal | < 1ms | T0 heartbeat (no LLM call anyway) |
fast | ~5–15ms | Golem internal (dreams, curator, daimon) |
standard | ~20–40ms | Operator conversation, external users |
full | ~30–65ms | Private/high-security requests |
5. Infrastructure Sizing
5.1 Compute Requirements for Low Latency
| Component | Minimum for < 50ms | Recommended |
|---|---|---|
| CPU | 4 cores (embedding + DeBERTa compete for CPU) | 8 cores (comfortable headroom) |
| RAM | 4GB (MiniLM ~100MB + DeBERTa ~370MB + spaCy ~750MB + cache) | 8GB |
| GPU | Not required (ONNX on CPU is sufficient) | Optional: small GPU (T4/L4) for DeBERTa (~5ms instead of ~20ms) |
| Vector store | In-memory HNSW (part of process RAM) | Same |
| Disk | 1GB (models + SQLite cache) | 10GB (larger cache) |
5.2 Model Loading Strategy
The biggest single latency penalty is cold start — loading models into memory on first request. This can take 5–15 seconds for spaCy + DeBERTa + MiniLM.
Mitigation:
- Load all models at server startup, not on first request
- Use Bun’s fast startup (vs. Node.js) for ~2x faster cold start
- Keep the server warm with a health check endpoint that exercises the pipeline
- If using serverless (not recommended for Bardo Inference), use provisioned concurrency
5.3 Concurrency
Each request’s parallel group (L2, L3, L7, L8) uses ~4 concurrent operations. With Bun’s event loop + worker threads for CPU-bound ONNX inference:
- 1 concurrent request: Full CPU available, ~20ms parallel group
- 10 concurrent requests: CPU contention on ONNX inference, ~40ms parallel group
- 50+ concurrent requests: Queue depth increases, consider horizontal scaling
Recommendation for single-instance public Bardo Inference: 8-core machine handles ~20 concurrent requests with < 50ms pipeline latency. Above that, horizontal scaling (multiple instances behind a load balancer) with shared cache (Redis or Turso for SQLite).
6. The “Is It Worth It?” Analysis
6.1 Latency Cost vs. Dollar Cost Savings
| Scenario | Pipeline Overhead | Dollar Savings | Worth It? |
|---|---|---|---|
| Operator typing a message (Claude Opus) | +40ms on 1500ms TTFT | 60–70% cost reduction | Yes — imperceptible latency, massive savings |
| Heartbeat T1 (Gemini Flash) | +15ms on 300ms TTFT | 40–50% cost reduction | Yes — 5% latency increase, significant savings |
| Heartbeat T0 (no LLM call) | +1ms on 0ms | $0 (no LLM call to save on) | Neutral — minimal profile adds ~nothing |
| Dream cycle (DeepSeek R1, 120s timeout) | +40ms on 5000ms+ | 20–30% cost reduction | Yes — 0.8% latency increase, good savings |
| Risk assessment (Claude Opus, tools) | +40ms on 2000ms TTFT | 50% cost reduction (tool pruning alone) | Yes — 2% latency increase, critical savings |
| Fast model (Groq Llama, 200ms TTFT) | +40ms on 200ms TTFT | Semantic cache hits = 0ms | Maybe — 20% latency increase, but cache hits are 10x faster |
6.2 When Bardo Inference Adds Negative Latency
Cache hits don’t just save money — they’re faster than direct provider calls:
Direct to provider: ~800ms average TTFT (Claude Sonnet)
Bardo Inference miss: ~840ms average TTFT (40ms overhead + 800ms LLM)
Bardo Inference hit: ~15ms TTFT (cached response, no LLM call)
At 20% hit rate:
Weighted average TTFT = 0.20 × 15ms + 0.80 × 840ms = 675ms
675ms < 800ms → Bardo Inference is FASTER on average despite per-request overhead
This is the key insight: semantic caching transforms Bardo Inference from a latency cost into a latency benefit at hit rates above ~5% (which heartbeat ticks easily achieve in calm markets).
7. What to Monitor
7.1 Latency Metrics
interface PipelineMetrics {
// Per-layer timing
layers: {
cacheAlignment: { p50Ms: number; p95Ms: number; p99Ms: number; skipRate: number };
semanticCache: { p50Ms: number; p95Ms: number; hitRate: number; skipRate: number };
hashCache: { p50Ms: number; p95Ms: number; hitRate: number };
toolPruning: { p50Ms: number; p95Ms: number; tokensSaved: number; skipRate: number };
historyCompression: { p50Ms: number; p95Ms: number; triggerRate: number };
positionOptimization: { p50Ms: number; p95Ms: number; skipRate: number };
piiMasking: { p50Ms: number; p95Ms: number; detectionRate: number; skipRate: number };
injectionDetection: { p50Ms: number; p95Ms: number; flagRate: number; skipRate: number };
};
// Aggregate
pipeline: {
totalP50Ms: number; // Target: < 30ms
totalP95Ms: number; // Target: < 50ms
totalP99Ms: number; // Target: < 100ms
cacheHitRate: number; // Target: > 15%
skipRate: number; // How often layers are bypassed
};
// End-to-end (including LLM)
endToEnd: {
ttftP50Ms: number;
ttftP95Ms: number;
totalP50Ms: number; // Including full response streaming
};
}
7.2 Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Pipeline p95 latency | > 80ms | > 150ms |
| Semantic cache embedding latency | > 30ms | > 50ms |
| DeBERTa inference latency | > 40ms | > 80ms |
| PII masking latency | > 40ms | > 100ms |
| History compression trigger rate | > 30% of requests | > 50% (context window too small or conversations too long) |
| Cache hit rate (heartbeat) | < 10% | < 5% (cache not working) |
8. Implementation Recommendations
8.1 Language & Runtime
Bun over Node.js: Bun’s faster startup (~2x), native SQLite support, and efficient event loop make it the better choice. The ONNX Runtime for DeBERTa and MiniLM runs via native bindings (C++) regardless of the JS runtime.
Not Go: Unlike Bifrost (11µs overhead in Go), Bardo Inference does real computation per request (embedding, NER, classification). The proxy routing overhead is negligible; the ML inference layers dominate. Go’s advantage in raw proxy throughput doesn’t apply here. TypeScript is the right choice for integration with the Pi ecosystem (all TypeScript).
8.2 ONNX for All ML Models
Convert all ML models (MiniLM, DeBERTa, spaCy) to ONNX format for inference:
- 2–4x faster than PyTorch on CPU
- Consistent latency (no Python GIL contention)
- Smaller memory footprint
- Native C++ execution via onnxruntime-node
8.3 Consider Model2Vec for Semantic Cache
If MiniLM embedding latency (~15ms) is too high, switch to Model2Vec static embeddings:
- 500x faster (~0.04ms per sentence)
- ~90% of MiniLM quality on retrieval tasks
- 2.5MB model size (vs. 100MB for MiniLM)
- No ONNX needed — pure JavaScript computation
The quality tradeoff is acceptable for semantic caching (threshold-based matching), where near-misses don’t matter (they just result in a cache miss, not a wrong answer).
8.4 Presidio Alternatives
If Presidio’s spaCy dependency is too heavy (~750MB for en_core_web_lg + ~30ms latency):
-
Regex-only PII masking: Handles emails, phone numbers, SSNs, credit cards, wallet addresses. Covers ~80% of PII cases at < 1ms. Good enough for most DeFi use cases (wallet addresses are the primary PII concern).
-
Lighter NER model: Use spaCy’s en_core_web_sm (12MB, ~5ms) instead of en_core_web_lg. Catches names and locations with lower accuracy but much faster.
-
LLM-based PII detection: Use the LLM itself to detect and redact PII in its response. This adds zero pre-request latency but means PII reaches the LLM (acceptable for Venice, not for other providers).
8.5 Async Everything That Can Be Async
- History compression: Never block. Compress in the background after responding.
- PII masking on responses: Mask in the streaming response handler, not pre-response.
- Injection detection: If running in
fastprofile, can be deferred to post-response logging (detect-and-flag rather than detect-and-block). - Usage tracking: Log asynchronously. Never block a response to write a metric.
9. Failure Modes & Graceful Degradation
| Failure | Impact | Mitigation |
|---|---|---|
| Embedding model crashes | Semantic cache unavailable | Hash cache still works. Requests pass through uncached. |
| DeBERTa model crashes | Injection detection unavailable | Log the failure, continue without detection. Alert operator. |
| spaCy/Presidio crashes | PII masking unavailable | Regex-only fallback (< 1ms). Log the failure. |
| Vector store corrupted | Semantic cache miss on everything | Rebuild from scratch (cache is ephemeral). No data loss. |
| Cache full (memory pressure) | Eviction rate increases | LRU eviction is automatic. Worst case: lower hit rate. |
| All layers fail | Pipeline degrades to passthrough | Request goes directly to backend with zero optimization. Still works. |
Core principle: Every pipeline layer failing degrades to passthrough, not to error. The worst case is “Bardo Inference acts like a dumb proxy.” The user’s request always goes through.
10. Benchmarking Plan
Before shipping, run these benchmarks on the target hardware:
// Benchmark suite
const benchmarks = [
// Individual layer benchmarks
{ name: "hash_cache_lookup", iterations: 10000, target_p95_ms: 1 },
{ name: "semantic_cache_embed", iterations: 1000, target_p95_ms: 25 },
{ name: "semantic_cache_search", iterations: 1000, target_p95_ms: 5 },
{ name: "tool_pruning_171_tools", iterations: 1000, target_p95_ms: 5 },
{ name: "prompt_cache_alignment", iterations: 1000, target_p95_ms: 2 },
{ name: "pii_masking_short_text", iterations: 1000, target_p95_ms: 15 },
{ name: "pii_masking_long_text", iterations: 100, target_p95_ms: 50 },
{ name: "injection_detection", iterations: 1000, target_p95_ms: 30 },
// Full pipeline benchmarks
{ name: "pipeline_minimal_profile", iterations: 1000, target_p95_ms: 2 },
{ name: "pipeline_fast_profile", iterations: 1000, target_p95_ms: 20 },
{ name: "pipeline_standard_profile", iterations: 1000, target_p95_ms: 50 },
{ name: "pipeline_full_profile", iterations: 100, target_p95_ms: 80 },
// Concurrent load benchmarks
{ name: "10_concurrent_standard", iterations: 100, target_p95_ms: 60 },
{ name: "50_concurrent_standard", iterations: 50, target_p95_ms: 100 },
// End-to-end (including backend call)
{ name: "e2e_cache_hit", iterations: 100, target_p95_ms: 25 },
{ name: "e2e_cache_miss_haiku", iterations: 50, target_p95_ms: 600 },
{ name: "e2e_cache_miss_opus", iterations: 20, target_p95_ms: 2500 },
];
References
- [BIFROST-BENCHMARKS-2026] Bifrost. “11µs overhead at 5,000 RPS.” Maxim AI benchmarks, 2026.
- [CLOUDFLARE-AI-GATEWAY-2025] Cloudflare. “AI Gateway: 10–50ms proxy latency.” Docs.
- [TRUEFOUNDRY-BENCHMARK-2025] TrueFoundry. “~3–4ms latency, 350+ RPS on 1 vCPU.” Blog.
- [LITELLM-LATENCY-2025] Reddit users. “15–30ms additional latency.” r/LLMDevs.
- [MINILM-BENCHMARK-2025] AIMultiple. “Sub-30ms cluster: MiniLM-L6-v2 < 30ms latency.” Benchmark, 2025.
- [MODEL2VEC-2025] Hrishikesh. “500x faster than MiniLM.” Medium, 2025.
- [PRESIDIO-LATENCY-2024] Microsoft Presidio. “Latency depends heavily on recognizers and NLP models.” GitHub Discussion #1097.
- [PROTECTAI-DEBERTA-2024] ProtectAI. “deberta-v3-base-prompt-injection-v2.” HuggingFace.
- [PORTKEY-SEMANTIC-CACHE-2025] Portkey. “Semantic Caching for AI Gateway.” 2025.
- [ONNX-INFERENCE-2025] ONNX Runtime. “2–4x faster than PyTorch on CPU.” Docs.