14 – Rust gateway implementation [SPEC]

Single-binary inference gateway with sub-millisecond routing

Cross-references: 00-overview.md (gateway architecture and design principles), 12-providers.md (five provider backends with Rust trait implementations), 13-reasoning.md (unified reasoning chain integration and streaming parser), 04-context-engineering.md (8-layer pipeline this gateway implements)

Reader orientation: This document specifies the Rust implementation of Bardo Inference (the LLM inference gateway for mortal autonomous DeFi agents called Golems). It belongs to the inference plane and describes the 10-crate Cargo workspace, Axum HTTP server, dependency catalog, WASM compilation targets, and build/deployment process. The key concept is that a single static ~50 MB Rust binary replaces a ~2 GB Node/Bun deployment, with <50ms cold start and sub-millisecond routing at thousands of concurrent connections. For term definitions, see prd2/shared/glossary.md.

1. Why Rust

TypeScript on Bun or Node degrades above ~50 concurrent users. The event loop stalls when ONNX inference (thread pool) contends with SSE streaming (I/O). At 200 concurrent Claude Code sessions the tail latency doubles; at 500 it becomes unusable.

Rust eliminates this. Tokio’s work-stealing scheduler handles thousands of concurrent connections without a shared event loop bottleneck. ONNX inference runs on dedicated threads via ort, completely decoupled from async I/O.

Dimension	TypeScript (Bun/Node)	Rust (Tokio + Axum)
Deployable artifact	~2 GB (runtime + node_modules)	~50 MB static binary
Cold start	500 ms+	<50 ms
P99 at 500 concurrent	>200 ms pipeline overhead	<50 ms pipeline overhead
Memory at 100 concurrent	~800 MB	<200 MB
ONNX thread contention	Shares libuv pool	Isolated thread pool

If Bardo serves hundreds of concurrent Claude Code and Cursor users alongside autonomous Golems, the gateway must be Rust. Pi runtime stays TypeScript. The boundary is HTTP.

Key Rust ecosystem advantages

fastembed-rs: Production-grade embedding library (used by Qdrant) wrapping ort + HuggingFace tokenizers. The entire embed-search-compare path is a single native binary with zero FFI overhead. In TypeScript, this path crosses JS-to-C++ boundaries on every call.

alloy: Canonical Rust Ethereum library. Zero-copy ECDSA signing for x402 payment authorizations (~0.1ms native vs. ~2ms through ethers.js). Transaction batching via sol! macro for contract interaction.

In-process vector search: HNSW implementations (instant-distance, qdrant-client embedded) run similarity search with zero serialization overhead. The embedding-to-search-to-result path stays in the same memory space.

Zero-copy SSE streaming: Ownership model enables forwarding SSE bytes from backends to clients without copying into string buffers, parsing JSON, and re-serializing. At high throughput (128K output tokens), TypeScript creates thousands of GC-pressured string allocations. Rust forwards with zero allocations.

Prior art

TensorZero – Rust. <1 ms P99 routing at 10K QPS. Similar architecture.
Bifrost – Rust. 11 us median at 5K RPS. Zero-copy SSE forwarding.
Helicone – TypeScript, heavily optimized. 8 ms P50. Hits a wall at high concurrency (V8 GC pauses).
LiteLLM – Python. Adds 15-30 ms per request. Compatibility reference, not a performance target.

2. Crate workspace layout

[workspace]
members = [
    "crates/bardo-gateway",
    "crates/bardo-router",
    "crates/bardo-pipeline",
    "crates/bardo-cache",
    "crates/bardo-ml",
    "crates/bardo-providers",
    "crates/bardo-x402",
    "crates/bardo-safety",
    "crates/bardo-telemetry",
    "crates/bardo-wasm",
]

bardo-gateway – Axum HTTP server. SSE streaming with zero-copy chunk forwarding. Auth middleware (API key + x402 payment verification). Rate limiting via token bucket. Health endpoints. Only crate that binds a port. See: 09-api.md.

bardo-router – Provider resolution, model-to-provider mapping, health monitoring with exponential backoff, failover chains. Stateless: routing state lives in Arc<DashMap>. See: 01a-routing.md.

bardo-pipeline – 8-layer context engineering pipeline. Parallel execution of independent layers via tokio::join!. Owns the InferenceRequest lifecycle from receipt to provider dispatch. See: 04-context-engineering.md.

bardo-cache – Three tiers. L1: prompt alignment (prefix matching for KV cache reuse). L2: semantic similarity via fastembed-rs + HNSW index. L3: deterministic hash via DashMap (lock-free concurrent hash map). See: 02-caching.md.

bardo-ml – ONNX Runtime via ort. Two models: DeBERTa-v3 for injection detection (L8, quantized INT8, <8ms) and nomic-embed-text-v1.5 for semantic embeddings (L2, 768-dim, <5ms via fastembed-rs). Both load once at startup, infer on dedicated threads via rayon. See: 07-safety.md.

bardo-providers – Provider trait with five implementations: BlockRun, OpenRouter, Venice, Bankr, Direct Key. Each adapter handles auth, request translation, SSE parsing, error mapping. Streaming is the default path. See: 12-providers.md.

bardo-x402 – On-chain verification via alloy. EIP-3009 payment proof validation. ERC-8004 identity reads for agent authentication. Zero-copy ECDSA signing (~0.1ms in native Rust vs. ~2ms through JS bindings). Stateless: each request carries its own proof. See: 00-overview.md.

bardo-safety – PII detection via compiled regex sets (SSN, credit card, email, phone). Injection classification delegates to bardo-ml for DeBERTa inference. Returns typed verdicts. See: 07-safety.md.

bardo-telemetry – tracing subscriber config. OpenTelemetry metrics (request count, latency histograms, cache hit rates, model inference times). Structured JSON for production, pretty output for development. See: 08-observability.md.

bardo-wasm – wasm32-wasi compilation target. Exposes pipeline subset (L2, L3, L8) for edge deployment. Uses ort-tract (pure-Rust ONNX alternative) since native ort cannot compile to WASM. See: section 7 below.

3. Crate-to-domain file mapping

Each crate’s business logic, data types, and protocol details are specified in a domain file. This file covers Rust-specific concerns: workspace layout, dependencies, performance, and compilation targets.

Crate	Primary domain file	Content
bardo-gateway	09-api.md	Router, handlers, AppState, SSE streaming
bardo-router	01a-routing.md	Classification, tier assignment, provider selection
bardo-pipeline	04-context-engineering.md	8-layer pipeline, prompt assembly, tool pruning
bardo-cache	02-caching.md	Hash cache, semantic cache, encryption
bardo-ml	07-safety.md	DeBERTa injection detection, embedding
bardo-providers	12-providers.md	Provider trait, 5 implementations
bardo-x402	00-overview.md	Payment verification, pricing
bardo-safety	07-safety.md	PII detection, injection classification
bardo-telemetry	08-observability.md	OTEL metrics, audit logging, spend analytics
bardo-wasm	(this file, section 7)	Edge deployment, WASM compilation

4. Dependency manifest

[workspace.dependencies]
# HTTP & async
axum = "0.8"
tokio = { version = "1.43", features = ["full"] }
hyper = "1.6"
tower = "0.5"
reqwest = { version = "0.12", features = ["json", "stream"] }

# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

# ML inference
ort = "2.0"
fastembed = "5.0"
tokenizers = "0.21"

# Blockchain
alloy = { version = "1.0", features = ["full"] }

# Concurrency & data structures
dashmap = "6.1"
bytes = "1.9"
futures = "0.3"

# Observability
tracing = "0.1"
tracing-subscriber = "0.3"
opentelemetry = "0.28"

# Safety & parsing
regex = "1.11"
uuid = { version = "1.0", features = ["v4"] }

# Error handling
anyhow = "1.0"
thiserror = "2.0"

Version pins are minimums. ort = "2.0" requires ONNX Runtime 1.19+ shared libraries; the build script downloads them on first compile. alloy = "1.0" with full includes providers, signers, and contract bindings. fastembed = "5.0" bundles ort internally but we configure it to share the workspace instance to avoid duplicate runtime loads.

5. Gateway and pipeline

Router setup, AppState, and the SSE streaming handler are defined in 09-api.md.

The streaming path is the hot path. Bytes buffers from upstream providers forward to the client without copying. The only allocation per chunk is the SSE event wrapper itself.

Pipeline implementation details are in 04-context-engineering.md. Four layers (L2, L3, L7, L8) are independent and run in parallel via tokio::join!, collapsing ~30 ms of sequential work into ~8 ms wall clock (bounded by DeBERTa inference).

Pipeline timing breakdown

Common case (no compression):

L1 align         -- <1ms  --+
                             |
L2 semantic      -- ~5ms  --+
L3 hash          -- <0.1ms --+-- parallel: ~8ms wall clock
L7 PII regex     -- <1ms  --+
L8 DeBERTa       -- ~8ms  --+
                             |
L4 tool prune    -- <1ms  --+
L6 route         -- <1ms  --+

Total (no L5)    -- ~12ms

When L5 fires it adds 200-2000 ms. Acceptable because compression only triggers when the context window is nearly full, and the alternative (truncation) loses information.

6. Performance targets

Metric	Target	Notes
Pipeline P95 (no compression)	<50 ms	L5 excluded; gate-to-gate
Routing overhead	<1 ms	Provider resolution + health check
DeBERTa inference	<8 ms	Quantized INT8, single input
Embedding (nomic-embed)	<5 ms	384-dim via fastembed-rs
Hash cache lookup	<0.1 ms	DashMap, lock-free read
Semantic cache lookup	<5 ms	HNSW search, top-1
Concurrent connections	500+	Tokio work-stealing
Memory (idle)	<80 MB	Models loaded, no active requests
Memory (100 concurrent)	<200 MB	Per-request allocation minimal
Cold start	<50 ms	Binary startup to first request
Binary size	~50 MB	Static link, release build with LTO

Targets assume x86-64 with 4+ cores. ARM (Graviton, Apple Silicon) is faster on ML inference due to better SIMD throughput on quantized models.

Concurrency under load

The decisive Rust advantage: latency stays flat under concurrent load. Tokio handles I/O across all cores; rayon handles CPU-bound ML inference with work-stealing that naturally batches concurrent requests.

Concurrent requests	P95 target
1	12ms
10	13ms
50	15ms
100	18ms
500	25ms

TypeScript degrades above ~50 concurrent ONNX calls because all ML inference contends for the same thread pool (default 4 threads). Request 50 waits ~187ms in queue time alone. In Rust, 50 concurrent embeddings become 6-7 batched ONNX calls via rayon work-stealing. Request 50 completes in the same time as request 1.

Comparison to alternatives

Gateway	Language	P50	P99	Concurrent
Bardo (target)	Rust	<10 ms	<50 ms	500+
TensorZero	Rust	<0.5 ms	<1 ms	10K+
Bifrost	Rust	11 us	~50 us	5K+
Helicone	TypeScript	8 ms	~40 ms	~200
LiteLLM	Python	15 ms	~80 ms	~100

TensorZero and Bifrost are faster because they skip ML inference in the hot path. Bardo’s DeBERTa injection scan adds ~8 ms. Deliberate trade: safety costs single-digit milliseconds.

7. WASM compilation

Target: wasm32-wasi for edge deployment on Cloudflare Workers and Fastly Compute.

What runs at the edge

Layer	Edge	Origin	Reason
L1 Prompt alignment	Yes	Yes	Pure string manipulation
L2 Semantic cache	Yes	Yes	`ort-tract` for embedding
L3 Hash cache	Yes	Yes	SHA-256 + in-memory map
L4 Tool pruning	Yes	Yes	String matching
L5 History compression	No	Yes	Requires LLM call
L6 Provider routing	No	Yes	Needs provider health state
L7 PII detection	No	Yes	Regex works but model is better
L8 Injection detection	Yes*	Yes	Quantized model via `ort-tract`

*L8 at edge uses a smaller quantized variant. Full DeBERTa runs on origin.

Build configuration

# crates/bardo-wasm/Cargo.toml
[lib]
crate-type = ["cdylib"]

[dependencies]
bardo-cache = { path = "../bardo-cache", features = ["wasm"] }
bardo-safety = { path = "../bardo-safety", features = ["wasm"] }
bardo-pipeline = { path = "../bardo-pipeline", features = ["wasm"] }
tract-onnx = "0.21"

[target.'cfg(target_arch = "wasm32")'.dependencies]
getrandom = { version = "0.2", features = ["js"] }

Constraints

Module size: ~120 MB with quantized INT8 models embedded. Cloudflare Workers paid plan allows up to 200 MB.
No native ort: ONNX Runtime’s C++ core does not compile to WASM. tract-onnx is a pure-Rust reimplementation covering the ONNX operators DeBERTa and nomic-embed need.
No outbound HTTP during edge execution: Cache lookups and safety checks run locally. Cache misses proxy to origin.
Startup: WASM instantiation ~200 ms on Cloudflare (model deserialization dominates). Per-request overhead after instantiation matches native targets.

8. Build instructions

# Release build (native)
cargo build --release

# Run tests
cargo test

# WASM target (edge deployment)
cargo build --target wasm32-wasi -p bardo-wasm

# Check all targets without building
cargo check --workspace

The release build uses LTO and codegen-units = 1 for size optimization. WASM builds require the wasm32-wasi target installed via rustup target add wasm32-wasi.

9. Migration path

Phase 1: Rust gateway from day one

The Rust binary is the gateway. There is no TypeScript gateway to replace. Pi runtime stays TypeScript; the boundary is HTTP. The gateway calls Pi over localhost or Unix socket for agent state, memory, and tool execution.

Client --HTTP--> Rust Gateway --HTTP--> Pi Runtime (TypeScript)
                     |
                     +-- Provider (Anthropic, OpenAI, ...)
                     +-- Cache (L1, L2, L3)
                     +-- Safety (L7, L8)

Acceptance criteria:

All API tests pass against the Rust binary
P95 latency below 50 ms
Memory below 200 MB at 100 concurrent
Zero downtime deployment via blue-green

Phase 2: Edge deployment

WASM binary on Cloudflare Workers. Edge handles L1-L4 and L8. Cache misses proxy to origin.

Client --HTTP--> Edge (WASM)
                   |  cache hit --> respond directly
                   |  cache miss --> proxy to Origin (Rust Gateway)
                   +--------------> Rust Gateway --> Provider

Expected edge cache hit rate: 15-25% for repeated queries within a session. Offloads roughly a quarter of origin traffic.

Phase 3: Client-side WASM

WASM module in the browser or Claude Code’s Node process. Offline hash cache and tool pruning before the request leaves the client. No ML models (too large for client distribution).

Claude Code --> Client WASM (L1, L3, L4) --HTTP--> Gateway

Phase 3 is speculative. Depends on whether ~5-10 ms client-side savings justify integration complexity.

10. Key Rust ecosystem crates

`fastembed-rs` (embedding)

Production-grade embedding library (used by Qdrant) wrapping ort + HuggingFace tokenizers. The entire embed-search-compare path is a single native binary with zero FFI overhead. In TypeScript, this path crosses JS-to-C++ boundaries on every call. Supports MiniLM-L6-v2, BGE, SPLADE sparse embeddings, and reranking.

`alloy` (x402 / crypto)

Canonical Rust Ethereum library. Zero-copy ECDSA signing for x402 payment authorizations (~0.1ms native vs. ~2ms through ethers.js). Transaction batching via sol! macro. The payment layer and inference layer run in the same process, eliminating a service boundary.

`candle` (optional ML)

HuggingFace’s pure-Rust ML framework runs transformer models natively with CUDA, Metal, and CPU backends. No ONNX conversion needed. Enables loading newer embedding models immediately when released without waiting for ONNX export support.

In-process vector search

HNSW implementations (instant-distance, qdrant-client embedded) run similarity search with zero serialization overhead. The embedding -> search -> result path stays in the same memory space. No external vector service needed for the default deployment.

PII masking in Rust

No Presidio equivalent in Rust. Built as compiled regex sets (SSN, credit card, email, phone, wallet addresses) + a small ONNX NER model via ort for name/location detection. The regex crate is compiled to native code with DFA optimization – 5-10x faster than JavaScript’s RegExp for pattern-heavy PII scanning.

11. Estimated performance after Rust rewrite

Pipeline latency comparison

Profile	TypeScript/Bun p95	Rust p95	Improvement
`minimal` (T0 heartbeat)	<1ms	<0.1ms	Both negligible
`fast` (Golem internal)	~15ms	~5ms	3x
`standard` (user-facing)	~40ms	~12ms	3.3x
`full` (high-security)	~65ms	~20ms	3.2x

Resource usage

Metric	TypeScript/Bun	Rust
Memory (idle)	~200MB (V8 heap + models)	~80MB (models only)
Memory (100 concurrent)	~500MB	~120MB
CPU at 50 RPS	~40% (4 cores)	~15% (8 cores, work-stealing)
Binary/deployment size	~2GB (runtime + deps + models)	~50MB (static binary + models)
Cold start	~500ms (Bun)	~50ms

Development effort estimate

Component	Effort
HTTP proxy + SSE streaming + routing	1 week
Backend integrations (5 providers)	1 week
Semantic cache (fastembed-rs + HNSW)	3 days
Hash cache (in-memory DashMap)	1 day
Prompt cache alignment + tool pruning	4 days
PII masking (regex + ONNX NER)	1 week
Injection detection (ort + DeBERTa)	3 days
x402 payment integration (alloy)	3 days
History compression (LLM call)	1 day
Configuration + API surface	3 days
Testing + benchmarking	1 week
Total	~6 weeks

12. Cross-references

Document	Relationship	What it covers
00-overview.md	System architecture; gateway position in the stack	Top-level gateway spec: architecture, payment flows, five-provider routing, and deployment topology
04-context-engineering.md	Defines the 8-layer pipeline this gateway implements	Prompt cache alignment, semantic cache, hash cache, tool pruning, history compression, KV-cache routing, PII masking, and injection detection
12-providers.md	Provider routing logic, failover chains, health monitoring	Five provider backends with full Rust Provider trait implementations and self-describing resolution
13-reasoning.md	Reasoning token parsing applied during SSE streaming	Unified reasoning chain integration: extended thinking, visible think tags, and provider-agnostic normalization
03-economics.md	Cost budgets enforced at the routing layer	x402 spread revenue model, per-tenant cost attribution, and budget enforcement logic
07-safety.md	Safety policy that L7 and L8 enforce	PII detection via compiled regex, prompt injection defense via DeBERTa classifier, and audit logging

Keyboard shortcuts

Bardo