Styx Infrastructure: Deployment, Scaling, and Operations [SPEC]
PRD2 Section: 20-styx | Source: Styx Research S3 v4.0
Status: Implementation Specification
Dependencies:
prd2/20-styx/00-architecture.md(what Styx does),prd2/20-styx/01-api.md(API surface)
Reader orientation: This document specifies the hosting architecture, deployment topology, scaling path, monitoring, cost projections, and security model for Styx (global knowledge relay and persistence layer at wss://styx.bardo.run; three tiers: Vault/Clade/Lethe). It belongs to the Styx infrastructure layer of Bardo. The key concept is that Styx is a single stateless Rust/Axum gateway deployed multi-region on Fly.io, backed by managed data services. Familiarity with the Styx architecture (
00-architecture.md) and API surface (01-api.md) is assumed. Seeprd2/shared/glossary.mdfor full term definitions.
What This Document Covers
Styx is a single public service that must achieve near-perfect uptime, sub-100ms query latency globally, and scale from 10 to 10,000+ Golems without architectural changes. This document specifies:
- The hosting architecture and why each component was chosen
- The deployment topology (multi-region active-active)
- The Styx binary structure
- The scaling path from launch to global scale
- Monitoring and alerting
- Cost projections at each scale tier
- The security and trust model (honest SaaS)
1. Architecture: Fly.io + Managed Data Services
Design Principles
A single public service serving the entire ecosystem requires:
- Multi-region redundancy: A single server is a single point of failure. The service must survive individual datacenter outages without interruption.
- Automatic failover: No human intervention needed to recover from a node failure. Health checks must detect failure and route traffic to healthy nodes within seconds.
- Elastic scaling: Market crashes cause correlated Golem death events (many Golems die simultaneously, uploading bloodstains and pheromone deposits concurrently). The service must absorb traffic spikes without manual intervention.
- Predictable latency: Golems run globally. A Golem in Tokyo querying Styx should get comparable latency to one in New York.
- Operational simplicity: This is run by a single operator. The infrastructure must be manageable by one person with standard DevOps skills, not a dedicated SRE team.
Why Fly.io for the Gateway
Fly.io runs applications on bare-metal servers distributed across 35+ regions worldwide. A Rust/Axum binary compiles to a single static Docker image and deploys to Fly.io as a lightweight VM (Firecracker microVM). Key properties:
- Multi-region active-active: Deploy the same binary to 2+ regions. Fly’s Anycast routing sends each request to the nearest healthy instance. If one region goes down, traffic automatically routes to the next nearest.
- Health checks with auto-restart: Fly monitors each instance with configurable health checks (TCP, HTTP, or custom). Failed instances are restarted automatically. If an instance can’t recover, a new one is spun up.
- Zero-downtime deploys: Blue-green deployments are built-in. The new version starts, passes health checks, then traffic shifts. No requests are dropped.
- Predictable pricing: Per-VM pricing based on CPU/RAM allocation, not per-request. A 2-vCPU / 4GB instance is ~$30/month. Two regions = ~$60/month for the gateway.
- Volumes for local state: Fly Volumes provide persistent NVMe storage attached to instances, used for local caches and temporary state.
Why Managed Data Services (Not Self-Hosted)
Self-hosting databases on a single bare-metal server is a single point of failure. For a public service requiring near-perfect uptime, managed services provide automatic replication, failover, and backups without operational burden:
| Component | Service | Why | Scaling | Durability |
|---|---|---|---|---|
| Vector Search | Qdrant Cloud | Managed Qdrant cluster with automatic replication, HNSW indexing, filtering. Rust-native client library. Sub-10ms queries on warm data. | Auto-scales with data volume | Replicated across nodes |
| Relational DB | Neon Postgres | Serverless Postgres with autoscaling compute, branching for dev/test, connection pooling. Separates storage from compute – scales independently. Multi-region read replicas for low-latency reads. | Auto-scales compute; storage scales independently | Multi-AZ replication |
| Cache + Pub/Sub | Upstash Redis | Serverless Redis with per-request pricing, global replication, REST API (works from Fly.io edge). Used for: pheromone field cache, semantic result cache, rate limiting, WebSocket pub/sub. | Auto-scales | Redis replication |
| Blob Storage | Cloudflare R2 | S3-compatible, zero egress fees. Stores: Grimoire backups, death bundles, marketplace encrypted content. Lifecycle rules handle TTL expiry. | Unlimited | 11 nines (S3-class) |
| Edge / DDoS | Cloudflare | Free plan: DDoS protection, TLS termination, caching. Fly.io instances are origin servers behind Cloudflare. | N/A | N/A |
Why Not Bare Metal?
A single bare-metal server (e.g., Hetzner AX42, ~$60/month) is cheaper but introduces:
- Single point of failure: One server = one failure domain. Disk failure, network outage, or datacenter maintenance takes the service offline.
- Manual failover: Requires human intervention or scripting to fail over to a backup.
- No global distribution: Users in Asia get 200ms+ latency to a European server.
Bare metal is appropriate for development and testing. For the production public service, the Fly.io + managed services architecture costs more (~$200-400/month at launch vs. ~$60) but provides the reliability guarantees a public ecosystem service requires.
2. Deployment Topology
+--------------------+
| Cloudflare |
| (DDoS + TLS |
| + edge cache) |
+---------+-----------+
| Anycast
+-------------+--------------+
| |
+-----+------+ +------+-----+
| Fly.io | | Fly.io |
| Region: IAD| | Region: AMS|
| (Virginia) | | (Amsterdam)|
| | | |
| styx-gw | | styx-gw |
| (Axum) | | (Axum) |
+------+-----+ +------+-----+
| |
+------------+------------+--------------+
| | | |
+-----+----+ +----+-----+ +---+----+ +----+-----+
| Qdrant | | Neon | |Upstash | | CF R2 |
| Cloud | | Postgres | | Redis | | (blobs) |
| (vectors)| | (meta) | |(cache) | | |
+----------+ +----------+ +--------+ +----------+
Both Fly.io instances run the identical styx-gw Rust binary. They connect to the same shared data services. Fly’s Anycast routing ensures each request hits the nearest healthy instance.
Statelessness
The gateway instances are stateless – all persistent state lives in the managed data services. This means:
- Any instance can handle any request
- Instances can be added/removed without data migration
- A crashed instance is replaced by a fresh one with zero state recovery needed
- Rolling deploys update one instance at a time with no downtime
The only “state” on a gateway instance is the in-process WebSocket connections. When an instance restarts, clients reconnect (with exponential backoff) and replay from their last-seen sequence number via the Event Fabric’s replay mechanism.
3. The Styx Binary
A single Rust binary compiled with Axum:
// styx-gw/src/main.rs
#[tokio::main]
async fn main() -> anyhow::Result<()> {
tracing_subscriber::init();
let config = StyxConfig::from_env()?;
// Connect to managed data services
let qdrant = QdrantClient::new(&config.qdrant_url).await?;
let db = sqlx::PgPool::connect(&config.database_url).await?;
let redis = redis::Client::open(config.redis_url.clone())?;
let r2 = s3::Bucket::new("bardo-styx", config.r2_region(), config.r2_credentials())?;
// Shared application state
let state = AppState {
qdrant: Arc::new(qdrant),
db: db.clone(),
redis: Arc::new(redis),
r2: Arc::new(r2),
event_bus: Arc::new(EventBus::new()),
config: Arc::new(config),
};
// Spawn background tasks
tokio::spawn(pheromone_evaporation_task(state.clone()));
tokio::spawn(ttl_expiry_task(state.clone()));
tokio::spawn(pulse_aggregation_task(state.clone()));
tokio::spawn(delayed_lethe_publication_task(state.clone()));
// Build Axum router
let app = Router::new()
// Knowledge CRUD
.route("/v1/styx/entries", post(entries::create).get(entries::query))
.route("/v1/styx/snapshot/:golem_id", get(entries::snapshot))
// Pheromone Field
.route("/v1/styx/pheromone/deposit", post(pheromone::deposit))
.route("/v1/styx/pheromone/sense", get(pheromone::sense))
// Bloodstain Network
.route("/v1/styx/bloodstain", post(bloodstain::upload))
// Causal Federation
.route("/v1/styx/causal/publish", post(causal::publish))
.route("/v1/styx/causal/discover", get(causal::discover))
// Engagement
.route("/v1/styx/lineage/:user_id", get(engagement::lineage))
.route("/v1/styx/graveyard/:user_id", get(engagement::graveyard))
.route("/v1/styx/achievements/:golem_id", get(engagement::achievements))
// Ecosystem
.route("/v1/styx/pulse", get(ecosystem::pulse))
.route("/v1/styx/health", get(ecosystem::health))
// WebSocket
.route("/v1/styx/ws", get(ws::handler))
// Middleware
.layer(middleware::from_fn_with_state(state.clone(), auth::authenticate))
.layer(middleware::from_fn(ratelimit::limit))
.layer(tower_http::trace::TraceLayer::new_for_http())
.with_state(state);
let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await?;
tracing::info!("Styx gateway listening on :8080");
axum::serve(listener, app).await?;
Ok(())
}
Background Tasks
Four tokio tasks run continuously:
- Pheromone evaporation (every 60s): Apply exponential decay to all pheromones. Remove those below 0.05 intensity. Update Redis cache.
- TTL expiry (every hour): Delete L0 entries past their TTL. Remove expired marketplace listings. Purge R2 blobs.
- Pulse aggregation (every 60s): Compute ecosystem-wide statistics. Cache in Redis. Serve from
/v1/styx/pulse. - Delayed Lethe (formerly Lethe) publication (every 5 min): Process the queue of entries awaiting randomized publication delay (1-6h). Anonymize and publish those past their delay.
4. Scaling Path
| Scale | Active Golems | Architecture | Monthly Infra Cost |
|---|---|---|---|
| Dev | 1-10 | Single Fly.io instance (1 vCPU / 1GB). Neon free tier. Upstash free tier. Qdrant free tier (1GB). | ~$15 |
| Launch | 10-50 | 2x Fly.io instances (2 regions). Neon starter. Qdrant starter (4GB). | ~$120 |
| Traction | 50-200 | Same 2-region setup. Neon scales compute automatically. Qdrant grows to ~16GB. | ~$250 |
| Growth | 200-1,000 | 3x Fly.io instances (3 regions). Qdrant cluster (2 nodes). Neon with read replicas. | ~$500 |
| Scale | 1,000-5,000 | 4+ Fly.io instances. Qdrant cluster (3+ nodes). Neon production tier. | ~$1,200 |
| Global | 5,000+ | 6+ regions. Qdrant enterprise. Neon enterprise. Dedicated Upstash cluster. | ~$3,000+ |
At every scale tier, the Golem-facing API does not change. The URL is the same. The contract is the same. Scaling is an infrastructure concern, not an API concern.
Egress Analysis
Pheromone batches dominate egress (~88% of bandwidth). At 10,000 Golems with naive fan-out, egress would be ~10TB/month. Three mitigations:
- Topic-based pub/sub: Golems subscribe to domains they care about. At 1,000 Golems across 50 domains, that’s ~20 Golems per domain instead of broadcasting to all 1,000. A 50x fan-out reduction.
- Message batching: One batch per domain per tick instead of individual messages. ~13x message count reduction.
- Zstd compression: 3-5x bandwidth reduction on pheromone payloads.
With all three: 10K-Golem egress drops to ~2-3TB/month. At Fly.io’s $0.02/GB rate, that’s ~$50-60/month in bandwidth costs instead of ~$200/month with naive fan-out.
Infrastructure Cost Breakdown by Component
| Component | 100 Golems | 1,000 Golems | 10,000 Golems |
|---|---|---|---|
| Compute (Fly.io, 2 regions) | $60/mo | $120/mo | $400/mo |
| Vector search (Qdrant Cloud) | Free tier | $50/mo | $200/mo |
| Database (Neon Postgres) | Free tier | $50/mo | $150/mo |
| Cache (Upstash Redis) | Free tier | $20/mo | $80/mo |
| Blob storage (Cloudflare R2) | ~$0 | $5/mo | $50/mo |
| Edge/DDoS (Cloudflare) | Free | Free | Free |
| NATS (federation, if enabled) | N/A | N/A | $20/mo |
| Total | ~$60/mo | ~$250/mo | ~$900/mo |
x402 Revenue Model
See shared/x402-protocol.md for the x402 payment protocol specification.
| Service | Price | Monthly revenue at 1,000 Golems |
|---|---|---|
| Entry write (L0/L1/L2) | $0.001/write | ~$1,500 (50K writes/day) |
| Knowledge query | $0.002/query | ~$6,000 (100K queries/day) |
| Pheromone deposit | $0.0005/deposit | ~$15 |
| Pheromone read | $0.0005/read | ~$3,000 (5.76M reads/month) |
| Bloodstain upload | $0.005/upload | ~$5 |
| Snapshot retrieval | $0.01/snapshot | ~$100 |
| Marketplace listing | $0.01/listing | ~$50 |
| Marketplace purchase commission | 5% protocol fee | Variable |
| Event relay | $0.001/1K events | ~$175 |
| Estimated total | ~$8,000+/mo |
At 1,000 Golems, x402 revenue covers infrastructure costs ($250/mo) by a wide margin. The model is self-sustaining well before ecosystem scale.
Revenue by Scale Tier
| Scale | Active Golems | Writes/day | Queries/day | Monthly revenue | Monthly infra | Net |
|---|---|---|---|---|---|---|
| Launch | 10 | 500 | 1,000 | ~$75 | ~$120 | -$45 |
| Traction | 50 | 2,500 | 5,000 | ~$375 | ~$150 | +$225 |
| Growth | 200 | 10,000 | 20,000 | ~$1,500 | ~$250 | +$1,250 |
| Marketplace | 500 | 25,000 | 50,000 | ~$4,000 + marketplace GMV | ~$500 | +$3,500+ |
| Scale | 1,000 | 50,000 | 100,000 | ~$8,000 + marketplace GMV | ~$1,200 | +$6,800+ |
The service is unprofitable at launch with 10 Golems. This is expected – the infrastructure baseline exists whether there are 10 Golems or 200. At 50 Golems, revenue covers costs. At 200+, the margin is comfortable.
The 5% marketplace protocol fee becomes a significant revenue stream at scale. At 500 active Golems with ~$10K/month marketplace GMV, the fee adds ~$500/month. Death archives (the marketplace’s most distinctive product) drive volume: every Golem that dies can produce a listing, and every new Golem is a potential buyer.
What Styx Charges For, What’s Free
Free: Health checks, ecosystem pulse, lineage queries, graveyard queries, achievement queries, clade peer discovery, causal edge publication (for Verified+ agents), listing search.
Paid: Everything that writes data (entries, pheromone deposits, bloodstains, marketplace listings), everything that reads knowledge (queries, snapshots, pheromone reads), and event relay.
The free tier is designed for discoverability. You can browse the marketplace, check the ecosystem pulse, view lineage trees, and discover clade peers without paying anything. The moment you want to write knowledge, query knowledge, or relay events, x402 kicks in.
Self-Hosted Economics
Self-hosted Styx has zero x402 charges (you don’t pay yourself). Infrastructure costs only:
| Deployment | Monthly cost |
|---|---|
| Fly.io (shared-1x) | $5-20 |
| Hetzner VPS (CX22) | $4 |
| Raspberry Pi (electricity) | $1-3 |
| Local Docker | $0 |
If your clade has 5 Golems and you want clade sync + pheromone field + L0 backup, self-hosting saves $15-75/month compared to the managed Styx. The tradeoff: you manage uptime, updates, and scaling. No marketplace access unless you federate with the ecosystem Styx.
5. Monitoring and Alerting
| What | How | Alert Threshold |
|---|---|---|
| Service health | Fly.io health checks (HTTP GET /v1/styx/health, 10s interval) | 3 consecutive failures -> auto-restart instance |
| Query latency | OpenTelemetry traces -> Axiom/Datadog | p99 > 500ms -> alert |
| Error rate | Structured logs (tracing crate) -> log aggregator | >1% 5xx responses -> alert |
| Qdrant health | Qdrant Cloud dashboard + API health endpoint | Cluster degraded -> alert |
| Neon health | Neon dashboard + connection pool monitoring | Connection pool exhausted -> alert |
| Pheromone field size | Custom metric (total active pheromones) | >100K active pheromones -> investigate (potential spam) |
| Storage growth | R2 usage metrics + Neon storage metrics | >80% of provisioned storage -> scale up |
| WebSocket connections | Custom counter in Axum state | >10K concurrent connections -> add instance |
6. Security and Trust Model
The Honest Model
Styx operates a standard SaaS trust model – identical to how every Qdrant Cloud customer trusts Qdrant, every Neon customer trusts Neon, every AWS customer trusts AWS:
| Threat | Protection |
|---|---|
| External attackers (network) | TLS 1.3 everywhere (Cloudflare -> Fly.io -> data services) |
| External attackers (storage) | At-rest encryption on all managed services (Qdrant, Neon, R2) |
| DDoS | Cloudflare DDoS protection (free plan) |
| Cross-user data leakage | Namespace isolation: each user’s L0/L1 data lives in a separate Qdrant namespace (vault:{user_id}, clade:{user_id}). Access control enforced at the Axum middleware layer. |
| Credential compromise | ERC-8004 identity + x402 micropayments = wallet-based auth (no passwords to steal) |
| Data retention | TTL-based expiry enforced by background task + R2 lifecycle rules |
What Is NOT Protected Against
The service operator can technically read L0/L1 data (necessary for server-side vector search and retrieval). This is the same trust model as every cloud service. Protection comes from:
- Business reputation and legal agreements (ToS)
- Economic incentives (revenue depends on user trust)
- Audit logging on all data access
- The fact that L0/L1 data is DeFi trading knowledge, not nuclear secrets – the threat model is proportionate
L2 Lethe: Public After Anonymization
Lethe data is stored in plaintext. The anonymization pipeline (see prd2/20-styx/01-api.md section 4) is the privacy layer. No encryption on top of anonymized public data – the audit established this is the honest approach.
7. Fly.io Configuration
# fly.toml
app = "bardo-styx"
primary_region = "iad" # Virginia
[build]
dockerfile = "Dockerfile"
[env]
RUST_LOG = "info,styx_gw=debug"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = false # Always running -- this is a public service
auto_start_machines = true
min_machines_running = 2 # Always at least 2 for HA
processes = ["app"]
[http_service.concurrency]
type = "requests"
hard_limit = 1000
soft_limit = 800
[[http_service.checks]]
grace_period = "10s"
interval = "15s"
method = "GET"
path = "/v1/styx/health"
timeout = "5s"
[[vm]]
size = "shared-cpu-2x" # 2 vCPU, 4GB RAM
memory = "4096"
cpus = 2
Deploy to multiple regions:
# Deploy to Virginia (primary) + Amsterdam (secondary)
fly deploy
fly scale count 2 --region iad,ams
# Verify multi-region
fly status
# Should show 2 instances: 1 in iad, 1 in ams
References
- [FLY-IO] Fly.io. “Run Your Full Stack Apps Globally.” https://fly.io — Multi-region Firecracker microVM platform used for the Styx gateway.
- [QDRANT] Qdrant. “Vector Search Engine.” https://qdrant.tech — Managed vector search service for Grimoire and marketplace embedding queries.
- [NEON] Neon. “Serverless Postgres.” https://neon.tech — Serverless PostgreSQL with autoscaling compute, used for Styx relational data.
One service. Multi-region. Auto-healing. The infrastructure should be invisible – what matters is the knowledge flowing through it.