Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

14 – Rust gateway implementation [SPEC]

Single-binary inference gateway with sub-millisecond routing

Cross-references: 00-overview.md (gateway architecture and design principles), 12-providers.md (five provider backends with Rust trait implementations), 13-reasoning.md (unified reasoning chain integration and streaming parser), 04-context-engineering.md (8-layer pipeline this gateway implements)


Reader orientation: This document specifies the Rust implementation of Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents called Golems). It belongs to the inference plane and describes the 10-crate Cargo workspace, Axum HTTP server, dependency catalog, WASM compilation targets, and build/deployment process. The key concept is that a single static ~50 MB Rust binary replaces a ~2 GB Node/Bun deployment, with <50ms cold start and sub-millisecond routing at thousands of concurrent connections. For term definitions, see prd2/shared/glossary.md.

1. Why Rust

TypeScript on Bun or Node degrades above ~50 concurrent users. The event loop stalls when ONNX inference (thread pool) contends with SSE streaming (I/O). At 200 concurrent Claude Code sessions the tail latency doubles; at 500 it becomes unusable.

Rust eliminates this. Tokio’s work-stealing scheduler handles thousands of concurrent connections without a shared event loop bottleneck. ONNX inference runs on dedicated threads via ort, completely decoupled from async I/O.

DimensionTypeScript (Bun/Node)Rust (Tokio + Axum)
Deployable artifact~2 GB (runtime + node_modules)~50 MB static binary
Cold start500 ms+<50 ms
P99 at 500 concurrent>200 ms pipeline overhead<50 ms pipeline overhead
Memory at 100 concurrent~800 MB<200 MB
ONNX thread contentionShares libuv poolIsolated thread pool

If Bardo serves hundreds of concurrent Claude Code and Cursor users alongside autonomous Golems, the gateway must be Rust. Pi runtime stays TypeScript. The boundary is HTTP.

Key Rust ecosystem advantages

fastembed-rs: Production-grade embedding library (used by Qdrant) wrapping ort + HuggingFace tokenizers. The entire embed-search-compare path is a single native binary with zero FFI overhead. In TypeScript, this path crosses JS-to-C++ boundaries on every call.

alloy: Canonical Rust Ethereum library. Zero-copy ECDSA signing for x402 payment authorizations (~0.1ms native vs. ~2ms through ethers.js). Transaction batching via sol! macro for contract interaction.

In-process vector search: HNSW implementations (instant-distance, qdrant-client embedded) run similarity search with zero serialization overhead. The embedding-to-search-to-result path stays in the same memory space.

Zero-copy SSE streaming: Ownership model enables forwarding SSE bytes from backends to clients without copying into string buffers, parsing JSON, and re-serializing. At high throughput (128K output tokens), TypeScript creates thousands of GC-pressured string allocations. Rust forwards with zero allocations.

Prior art

  • TensorZero – Rust. <1 ms P99 routing at 10K QPS. Similar architecture.
  • Bifrost – Rust. 11 us median at 5K RPS. Zero-copy SSE forwarding.
  • Helicone – TypeScript, heavily optimized. 8 ms P50. Hits a wall at high concurrency (V8 GC pauses).
  • LiteLLM – Python. Adds 15-30 ms per request. Compatibility reference, not a performance target.

2. Crate workspace layout

[workspace]
members = [
    "crates/bardo-gateway",
    "crates/bardo-router",
    "crates/bardo-pipeline",
    "crates/bardo-cache",
    "crates/bardo-ml",
    "crates/bardo-providers",
    "crates/bardo-x402",
    "crates/bardo-safety",
    "crates/bardo-telemetry",
    "crates/bardo-wasm",
]

bardo-gateway – Axum HTTP server. SSE streaming with zero-copy chunk forwarding. Auth middleware (API key + x402 payment verification). Rate limiting via token bucket. Health endpoints. Only crate that binds a port. See: 09-api.md.

bardo-router – Provider resolution, model-to-provider mapping, health monitoring with exponential backoff, failover chains. Stateless: routing state lives in Arc<DashMap>. See: 01a-routing.md.

bardo-pipeline – 8-layer context engineering pipeline. Parallel execution of independent layers via tokio::join!. Owns the InferenceRequest lifecycle from receipt to provider dispatch. See: 04-context-engineering.md.

bardo-cache – Three tiers. L1: prompt alignment (prefix matching for KV cache reuse). L2: semantic similarity via fastembed-rs + HNSW index. L3: deterministic hash via DashMap (lock-free concurrent hash map). See: 02-caching.md.

bardo-ml – ONNX Runtime via ort. Two models: DeBERTa-v3 for injection detection (L8, quantized INT8, <8ms) and nomic-embed-text-v1.5 for semantic embeddings (L2, 768-dim, <5ms via fastembed-rs). Both load once at startup, infer on dedicated threads via rayon. See: 07-safety.md.

bardo-providersProvider trait with five implementations: BlockRun, OpenRouter, Venice, Bankr, Direct Key. Each adapter handles auth, request translation, SSE parsing, error mapping. Streaming is the default path. See: 12-providers.md.

bardo-x402 – On-chain verification via alloy. EIP-3009 payment proof validation. ERC-8004 identity reads for agent authentication. Zero-copy ECDSA signing (~0.1ms in native Rust vs. ~2ms through JS bindings). Stateless: each request carries its own proof. See: 00-overview.md.

bardo-safety – PII detection via compiled regex sets (SSN, credit card, email, phone). Injection classification delegates to bardo-ml for DeBERTa inference. Returns typed verdicts. See: 07-safety.md.

bardo-telemetrytracing subscriber config. OpenTelemetry metrics (request count, latency histograms, cache hit rates, model inference times). Structured JSON for production, pretty output for development. See: 08-observability.md.

bardo-wasmwasm32-wasi compilation target. Exposes pipeline subset (L2, L3, L8) for edge deployment. Uses ort-tract (pure-Rust ONNX alternative) since native ort cannot compile to WASM. See: section 7 below.


3. Crate-to-domain file mapping

Each crate’s business logic, data types, and protocol details are specified in a domain file. This file covers Rust-specific concerns: workspace layout, dependencies, performance, and compilation targets.

CratePrimary domain fileContent
bardo-gateway09-api.mdRouter, handlers, AppState, SSE streaming
bardo-router01a-routing.mdClassification, tier assignment, provider selection
bardo-pipeline04-context-engineering.md8-layer pipeline, prompt assembly, tool pruning
bardo-cache02-caching.mdHash cache, semantic cache, encryption
bardo-ml07-safety.mdDeBERTa injection detection, embedding
bardo-providers12-providers.mdProvider trait, 5 implementations
bardo-x40200-overview.mdPayment verification, pricing
bardo-safety07-safety.mdPII detection, injection classification
bardo-telemetry08-observability.mdOTEL metrics, audit logging, spend analytics
bardo-wasm(this file, section 7)Edge deployment, WASM compilation

4. Dependency manifest

[workspace.dependencies]
# HTTP & async
axum = "0.8"
tokio = { version = "1.43", features = ["full"] }
hyper = "1.6"
tower = "0.5"
reqwest = { version = "0.12", features = ["json", "stream"] }

# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

# ML inference
ort = "2.0"
fastembed = "5.0"
tokenizers = "0.21"

# Blockchain
alloy = { version = "1.0", features = ["full"] }

# Concurrency & data structures
dashmap = "6.1"
bytes = "1.9"
futures = "0.3"

# Observability
tracing = "0.1"
tracing-subscriber = "0.3"
opentelemetry = "0.28"

# Safety & parsing
regex = "1.11"
uuid = { version = "1.0", features = ["v4"] }

# Error handling
anyhow = "1.0"
thiserror = "2.0"

Version pins are minimums. ort = "2.0" requires ONNX Runtime 1.19+ shared libraries; the build script downloads them on first compile. alloy = "1.0" with full includes providers, signers, and contract bindings. fastembed = "5.0" bundles ort internally but we configure it to share the workspace instance to avoid duplicate runtime loads.


5. Gateway and pipeline

Router setup, AppState, and the SSE streaming handler are defined in 09-api.md.

The streaming path is the hot path. Bytes buffers from upstream providers forward to the client without copying. The only allocation per chunk is the SSE event wrapper itself.

Pipeline implementation details are in 04-context-engineering.md. Four layers (L2, L3, L7, L8) are independent and run in parallel via tokio::join!, collapsing ~30 ms of sequential work into ~8 ms wall clock (bounded by DeBERTa inference).

Pipeline timing breakdown

Common case (no compression):

L1 align         -- <1ms  --+
                             |
L2 semantic      -- ~5ms  --+
L3 hash          -- <0.1ms --+-- parallel: ~8ms wall clock
L7 PII regex     -- <1ms  --+
L8 DeBERTa       -- ~8ms  --+
                             |
L4 tool prune    -- <1ms  --+
L6 route         -- <1ms  --+

Total (no L5)    -- ~12ms

When L5 fires it adds 200-2000 ms. Acceptable because compression only triggers when the context window is nearly full, and the alternative (truncation) loses information.


6. Performance targets

MetricTargetNotes
Pipeline P95 (no compression)<50 msL5 excluded; gate-to-gate
Routing overhead<1 msProvider resolution + health check
DeBERTa inference<8 msQuantized INT8, single input
Embedding (nomic-embed)<5 ms384-dim via fastembed-rs
Hash cache lookup<0.1 msDashMap, lock-free read
Semantic cache lookup<5 msHNSW search, top-1
Concurrent connections500+Tokio work-stealing
Memory (idle)<80 MBModels loaded, no active requests
Memory (100 concurrent)<200 MBPer-request allocation minimal
Cold start<50 msBinary startup to first request
Binary size~50 MBStatic link, release build with LTO

Targets assume x86-64 with 4+ cores. ARM (Graviton, Apple Silicon) is faster on ML inference due to better SIMD throughput on quantized models.

Concurrency under load

The decisive Rust advantage: latency stays flat under concurrent load. Tokio handles I/O across all cores; rayon handles CPU-bound ML inference with work-stealing that naturally batches concurrent requests.

Concurrent requestsP95 target
112ms
1013ms
5015ms
10018ms
50025ms

TypeScript degrades above ~50 concurrent ONNX calls because all ML inference contends for the same thread pool (default 4 threads). Request 50 waits ~187ms in queue time alone. In Rust, 50 concurrent embeddings become 6-7 batched ONNX calls via rayon work-stealing. Request 50 completes in the same time as request 1.

Comparison to alternatives

GatewayLanguageP50P99Concurrent
Bardo (target)Rust<10 ms<50 ms500+
TensorZeroRust<0.5 ms<1 ms10K+
BifrostRust11 us~50 us5K+
HeliconeTypeScript8 ms~40 ms~200
LiteLLMPython15 ms~80 ms~100

TensorZero and Bifrost are faster because they skip ML inference in the hot path. Bardo’s DeBERTa injection scan adds ~8 ms. Deliberate trade: safety costs single-digit milliseconds.


7. WASM compilation

Target: wasm32-wasi for edge deployment on Cloudflare Workers and Fastly Compute.

What runs at the edge

LayerEdgeOriginReason
L1 Prompt alignmentYesYesPure string manipulation
L2 Semantic cacheYesYesort-tract for embedding
L3 Hash cacheYesYesSHA-256 + in-memory map
L4 Tool pruningYesYesString matching
L5 History compressionNoYesRequires LLM call
L6 Provider routingNoYesNeeds provider health state
L7 PII detectionNoYesRegex works but model is better
L8 Injection detectionYes*YesQuantized model via ort-tract

*L8 at edge uses a smaller quantized variant. Full DeBERTa runs on origin.

Build configuration

# crates/bardo-wasm/Cargo.toml
[lib]
crate-type = ["cdylib"]

[dependencies]
bardo-cache = { path = "../bardo-cache", features = ["wasm"] }
bardo-safety = { path = "../bardo-safety", features = ["wasm"] }
bardo-pipeline = { path = "../bardo-pipeline", features = ["wasm"] }
tract-onnx = "0.21"

[target.'cfg(target_arch = "wasm32")'.dependencies]
getrandom = { version = "0.2", features = ["js"] }

Constraints

  • Module size: ~120 MB with quantized INT8 models embedded. Cloudflare Workers paid plan allows up to 200 MB.
  • No native ort: ONNX Runtime’s C++ core does not compile to WASM. tract-onnx is a pure-Rust reimplementation covering the ONNX operators DeBERTa and nomic-embed need.
  • No outbound HTTP during edge execution: Cache lookups and safety checks run locally. Cache misses proxy to origin.
  • Startup: WASM instantiation ~200 ms on Cloudflare (model deserialization dominates). Per-request overhead after instantiation matches native targets.

8. Build instructions

# Release build (native)
cargo build --release

# Run tests
cargo test

# WASM target (edge deployment)
cargo build --target wasm32-wasi -p bardo-wasm

# Check all targets without building
cargo check --workspace

The release build uses LTO and codegen-units = 1 for size optimization. WASM builds require the wasm32-wasi target installed via rustup target add wasm32-wasi.


9. Migration path

Phase 1: Rust gateway from day one

The Rust binary is the gateway. There is no TypeScript gateway to replace. Pi runtime stays TypeScript; the boundary is HTTP. The gateway calls Pi over localhost or Unix socket for agent state, memory, and tool execution.

Client --HTTP--> Rust Gateway --HTTP--> Pi Runtime (TypeScript)
                     |
                     +-- Provider (Anthropic, OpenAI, ...)
                     +-- Cache (L1, L2, L3)
                     +-- Safety (L7, L8)

Acceptance criteria:

  • All API tests pass against the Rust binary
  • P95 latency below 50 ms
  • Memory below 200 MB at 100 concurrent
  • Zero downtime deployment via blue-green

Phase 2: Edge deployment

WASM binary on Cloudflare Workers. Edge handles L1-L4 and L8. Cache misses proxy to origin.

Client --HTTP--> Edge (WASM)
                   |  cache hit --> respond directly
                   |  cache miss --> proxy to Origin (Rust Gateway)
                   +--------------> Rust Gateway --> Provider

Expected edge cache hit rate: 15-25% for repeated queries within a session. Offloads roughly a quarter of origin traffic.

Phase 3: Client-side WASM

WASM module in the browser or Claude Code’s Node process. Offline hash cache and tool pruning before the request leaves the client. No ML models (too large for client distribution).

Claude Code --> Client WASM (L1, L3, L4) --HTTP--> Gateway

Phase 3 is speculative. Depends on whether ~5-10 ms client-side savings justify integration complexity.


10. Key Rust ecosystem crates

fastembed-rs (embedding)

Production-grade embedding library (used by Qdrant) wrapping ort + HuggingFace tokenizers. The entire embed-search-compare path is a single native binary with zero FFI overhead. In TypeScript, this path crosses JS-to-C++ boundaries on every call. Supports MiniLM-L6-v2, BGE, SPLADE sparse embeddings, and reranking.

alloy (x402 / crypto)

Canonical Rust Ethereum library. Zero-copy ECDSA signing for x402 payment authorizations (~0.1ms native vs. ~2ms through ethers.js). Transaction batching via sol! macro. The payment layer and inference layer run in the same process, eliminating a service boundary.

candle (optional ML)

HuggingFace’s pure-Rust ML framework runs transformer models natively with CUDA, Metal, and CPU backends. No ONNX conversion needed. Enables loading newer embedding models immediately when released without waiting for ONNX export support.

HNSW implementations (instant-distance, qdrant-client embedded) run similarity search with zero serialization overhead. The embedding -> search -> result path stays in the same memory space. No external vector service needed for the default deployment.

PII masking in Rust

No Presidio equivalent in Rust. Built as compiled regex sets (SSN, credit card, email, phone, wallet addresses) + a small ONNX NER model via ort for name/location detection. The regex crate is compiled to native code with DFA optimization – 5-10x faster than JavaScript’s RegExp for pattern-heavy PII scanning.


11. Estimated performance after Rust rewrite

Pipeline latency comparison

ProfileTypeScript/Bun p95Rust p95Improvement
minimal (T0 heartbeat)<1ms<0.1msBoth negligible
fast (Golem internal)~15ms~5ms3x
standard (user-facing)~40ms~12ms3.3x
full (high-security)~65ms~20ms3.2x

Resource usage

MetricTypeScript/BunRust
Memory (idle)~200MB (V8 heap + models)~80MB (models only)
Memory (100 concurrent)~500MB~120MB
CPU at 50 RPS~40% (4 cores)~15% (8 cores, work-stealing)
Binary/deployment size~2GB (runtime + deps + models)~50MB (static binary + models)
Cold start~500ms (Bun)~50ms

Development effort estimate

ComponentEffort
HTTP proxy + SSE streaming + routing1 week
Backend integrations (5 providers)1 week
Semantic cache (fastembed-rs + HNSW)3 days
Hash cache (in-memory DashMap)1 day
Prompt cache alignment + tool pruning4 days
PII masking (regex + ONNX NER)1 week
Injection detection (ort + DeBERTa)3 days
x402 payment integration (alloy)3 days
History compression (LLM call)1 day
Configuration + API surface3 days
Testing + benchmarking1 week
Total~6 weeks

12. Cross-references

DocumentRelationshipWhat it covers
00-overview.mdSystem architecture; gateway position in the stackTop-level gateway spec: architecture, payment flows, five-provider routing, and deployment topology
04-context-engineering.mdDefines the 8-layer pipeline this gateway implementsPrompt cache alignment, semantic cache, hash cache, tool pruning, history compression, KV-cache routing, PII masking, and injection detection
12-providers.mdProvider routing logic, failover chains, health monitoringFive provider backends with full Rust Provider trait implementations and self-describing resolution
13-reasoning.mdReasoning token parsing applied during SSE streamingUnified reasoning chain integration: extended thinking, visible think tags, and provider-agnostic normalization
03-economics.mdCost budgets enforced at the routing layerx402 spread revenue model, per-tenant cost attribution, and budget enforcement logic
07-safety.mdSafety policy that L7 and L8 enforcePII detection via compiled regex, prompt injection defense via DeBERTa classifier, and audit logging