language: en kae3g ← back to index
(unlisted) — This essay is not shown on the main index but remains accessible via direct link

kae3g 9992: Microbrewery → Compute Cluster — Self-Hosted mantraOS Backend

Timestamp: 12025-10-05–06thhouse01980
Series: Technical Writings (9999 → 0000)
Category: Infrastructure, Self-Hosting, Cost Analysis, Hardware Architecture
Reading Time: 50 minutes

Opening: From Fermentation to Computation

{:microbrewery-as-data-center
 "A microbrewery has what compute clusters need:
  - Temperature control (for brewing lager → for cooling CPUs)
  - Humidity management (for fermentation → for hardware longevity)
  - Power infrastructure (for boiling kettles → for power supplies)
  - Plumbing (for brewing water → for liquid cooling)
  - Space (for fermentation tanks → for server racks)
  - Ventilation (for CO2 exhaust → for heat exhaust)
  
  The brewery IS a data center, waiting to be repurposed."
 
 :mantraos-context
 "From 9993: mantraOS needs LOCAL compute.
  Not AWS. Not GCP. Not Azure.
  
  Self-hosted, city-state-owned, cooperative-operated.
  Running on commodity hardware, not cloud monopolies.
  
  This document: How to actually build it."}

Part I: The Microbrewery — Baseline Specifications

Typical American Microbrewery (1-5 BBL System)

Physical Infrastructure:

Why Microbreweries Fail (and become available):

Acquisition Cost (Q4 2025 projection):

We assume LEASE model for cost analysis (more flexible, lower upfront).

Part II: Conversion Plan — Brewery → Data Center

Phase 1: Infrastructure Assessment & Modification (Month 1-2)

{:infrastructure-modifications
 {:power-upgrade
  {:existing "200A service, single-phase 240V"
   :needed "400A service, 3-phase 208V (more efficient for PSUs)"
   :cost-budget "$8,000 - $15,000"
   :cost-better "$12,000 - $20,000 (with UPS integration)"
   :cost-best "$18,000 - $30,000 (with generator backup)"
   :timeline "2-4 weeks (utility coordination required)"}
  
  :cooling-system
  {:existing "Glycol chiller (3-5 ton) for fermentation tanks"
   :adaptation "Repurpose for liquid cooling loops OR
                Use existing building HVAC + add precision AC"
   :option-1-air-cooling
   {:approach "Remove glycol system, install precision AC units"
    :cost-budget "$8,000 (used Liebert/APC units, 5-10 ton)"
    :cost-better "$15,000 (new mid-tier precision AC, 10-15 ton)"
    :cost-best "$25,000 (redundant N+1 precision AC with hot aisle containment)"
    :power-draw "3-5 kW per 10 ton unit"}
   
   :option-2-liquid-cooling
   {:approach "Adapt glycol chiller for server liquid cooling"
    :cost-budget "$5,000 (retrofit chiller, add manifolds)"
    :cost-better "$12,000 (new chiller + distribution)"
    :cost-best "$20,000 (redundant chillers + cold plates for CPUs)"
    :advantage "More efficient (PUE ~1.1 vs 1.4 for air)"
    :complexity "Higher (leak risk, maintenance)"}
   
   :recommendation "Start with air cooling (simpler), upgrade to liquid if needed"}
  
  :networking-infrastructure
  {:fiber-internet
   {:speed-budget "1 Gbps symmetrical ($300-500/month)"
    :speed-better "10 Gbps symmetrical ($800-1500/month)"
    :speed-best "2x 10 Gbps diverse paths ($2000-3000/month)"
    :note "Business fiber, not residential"}
   
   :internal-network
   {:switches-budget "Used 10GbE switches (2x 48-port), $1,500"
    :switches-better "New 10GbE managed switches, $4,000"
    :switches-best "25GbE/40GbE spine-leaf topology, $12,000"
    :cabling "CAT6A or fiber, $2,000 - $5,000"}}
  
  :racks-and-physical
  {:server-racks
   {:budget "Used 42U open-frame racks (5x), $1,000"
    :better "New enclosed 42U racks w/ PDUs (5x), $4,000"
    :best "Seismic-rated racks w/ cable mgmt (5x), $8,000"}
   
   :flooring
   {:budget "Use existing concrete, add anti-static mats, $500"
    :better "Raised floor tiles (partial), $5,000"
    :best "Full raised floor w/ cable routing, $15,000"
    :note "Brewery concrete usually good enough"}
   
   :fire-suppression
   {:budget "Dry chemical extinguishers (10x), $500"
    :better "Pre-action sprinkler system, $8,000"
    :best "FM-200 clean agent suppression, $20,000"
    :requirement "May be required by insurance/code"}}
  
  :total-infrastructure-costs
  {:budget "$27,000 - $35,000"
   :better "$58,000 - $75,000"
   :best "$120,000 - $150,000"
   :timeline "2-3 months for buildout"}}}

Phase 2: Hardware Procurement (Month 2-3)

Strategy: Buy commodity used server hardware from decommissioned data centers, Bitcoin mining farms going bust, and enterprise upgrades.

Option A: AMD EPYC CPU-Based Cluster (Recommended for mantraOS)

{:amd-epyc-cluster
 {:node-specification-budget
  {:cpu "AMD EPYC 7402P (24C/48T, 2.8GHz)"
   :source "Used enterprise servers (Dell R7525, HPE DL385)"
   :cost-per-cpu "$800 - $1,200 (used market)"
   :motherboard "Included in server chassis"
   :ram "256GB DDR4-3200 ECC RDIMM"
   :ram-cost "$400 - $600 (used/refurb)"
   :storage "2x 1TB NVMe (OS + cache), 4x 4TB SATA SSD (data)"
   :storage-cost "$300 NVMe + $800 SATA = $1,100"
   :network "Dual 10GbE (onboard)"
   :psu "Redundant 1600W (included)"
   :total-per-node "$2,300 - $2,900"
   :compute-power "~35,000 PassMark score"
   :power-draw "250W avg, 350W peak"}
  
  :node-specification-better
  {:cpu "AMD EPYC 7543 (32C/64T, 2.8GHz)"
   :source "Recent decommissions, 1-2 years old"
   :cost-per-cpu "$1,800 - $2,400"
   :ram "512GB DDR4-3200 ECC RDIMM"
   :ram-cost "$800 - $1,200"
   :storage "2x 2TB NVMe, 8x 8TB SATA SSD"
   :storage-cost "$600 NVMe + $1,600 SATA = $2,200"
   :total-per-node "$4,800 - $6,200"
   :compute-power "~55,000 PassMark score"
   :power-draw "280W avg, 400W peak"}
  
  :node-specification-best
  {:cpu "AMD EPYC 9254 (24C/48T, 2.9GHz, Zen 4)"
   :source "Current gen, new or lightly used"
   :cost-per-cpu "$3,500 - $4,500"
   :ram "768GB DDR5-4800 ECC RDIMM"
   :ram-cost "$2,000 - $2,800"
   :storage "4x 4TB NVMe Gen4, 8x 16TB SATA SSD"
   :storage-cost "$2,400 NVMe + $3,200 SATA = $5,600"
   :total-per-node "$11,100 - $14,900"
   :compute-power "~70,000 PassMark score"
   :power-draw "320W avg, 450W peak"}
  
  :cluster-configuration
  {:budget-cluster
   {:nodes 10
    :cost-per-node "$2,500 avg"
    :total-cost "$25,000"
    :total-cores "240C/480T"
    :total-ram "2.5TB"
    :total-storage "40TB NVMe + 160TB SSD"
    :aggregate-power "2.5kW avg, 3.5kW peak"
    :note "Good for dev/test, small production"}
   
   :better-cluster
   {:nodes 15
    :cost-per-node "$5,500 avg"
    :total-cost "$82,500"
    :total-cores "480C/960T"
    :total-ram "7.5TB"
    :total-storage "120TB NVMe + 480TB SSD"
    :aggregate-power "4.2kW avg, 6kW peak"
    :note "Production-ready for medium workloads"}
   
   :best-cluster
   {:nodes 20
    :cost-per-node "$13,000 avg"
    :total-cost "$260,000"
    :total-cores "480C/960T (Zen 4, higher IPC)"
    :total-ram "15TB"
    :total-storage "320TB NVMe + 1.28PB SSD"
    :aggregate-power "6.4kW avg, 9kW peak"
    :note "High-performance production cluster"}}}}

Option B: AMD GPU-Based Cluster (For AI/ML Workloads)

{:amd-gpu-cluster
 {:why-amd-not-nvidia
  "NVIDIA has monopolistic pricing + vendor lock-in.
   AMD Instinct/Radeon Pro:
   - 40-60% cheaper per TFLOP
   - Open-source ROCm stack (vs CUDA lock-in)
   - Better for mantraOS philosophy (open, accessible)"
  
  :node-specification-budget
  {:cpu "AMD EPYC 7402P (24C/48T)" 
   :cost-cpu "$1,000"
   :gpu "2x AMD Radeon Pro W6800 (32GB)"
   :cost-gpu "$1,600 used ($800 each)"
   :ram "128GB DDR4 ECC"
   :cost-ram "$300"
   :storage "2x 1TB NVMe, 2x 4TB SSD"
   :cost-storage "$700"
   :total-per-node "$3,600"
   :gpu-tflops "2x 17.83 = 35.66 TFLOPS (FP32)"
   :gpu-vram "64GB total"
   :power-draw "600W avg, 850W peak"}
  
  :node-specification-better
  {:cpu "AMD EPYC 7543 (32C/64T)"
   :cost-cpu "$2,000"
   :gpu "4x AMD Radeon Pro W7900 (48GB)"
   :cost-gpu "$8,000 ($2,000 each, used)"
   :ram "256GB DDR4 ECC"
   :cost-ram "$600"
   :storage "2x 2TB NVMe, 4x 8TB SSD"
   :cost-storage "$1,400"
   :total-per-node "$12,000"
   :gpu-tflops "4x 61 = 244 TFLOPS (FP32)"
   :gpu-vram "192GB total"
   :power-draw "1,200W avg, 1,600W peak"}
  
  :node-specification-best
  {:cpu "AMD EPYC 9354 (32C/64T, Zen 4)"
   :cost-cpu "$5,000"
   :gpu "8x AMD Instinct MI210 (64GB HBM2e)"
   :cost-gpu "$40,000 ($5,000 each, used from HPC decommissions)"
   :ram "512GB DDR5 ECC"
   :cost-ram "$1,500"
   :storage "4x 4TB NVMe Gen4, 8x 16TB SSD"
   :cost-storage "$3,000"
   :total-per-node "$49,500"
   :gpu-tflops "8x 45.3 = 362.4 TFLOPS (FP32), 181.2 TFLOPS (FP64)"
   :gpu-vram "512GB total HBM2e"
   :power-draw "2,500W avg, 3,500W peak"
   :note "Datacenter-grade AI/HPC"}
  
  :cluster-configuration
  {:budget-cluster
   {:nodes 8
    :cost-per-node "$3,600"
    :total-cost "$28,800"
    :total-gpu-tflops "285 TFLOPS"
    :total-vram "512GB"
    :aggregate-power "4.8kW avg, 6.8kW peak"}
   
   :better-cluster
   {:nodes 6
    :cost-per-node "$12,000"
    :total-cost "$72,000"
    :total-gpu-tflops "1,464 TFLOPS"
    :total-vram "1.15TB"
    :aggregate-power "7.2kW avg, 9.6kW peak"}
   
   :best-cluster
   {:nodes 4
    :cost-per-node "$49,500"
    :total-cost "$198,000"
    :total-gpu-tflops "1,450 TFLOPS (FP32)"
    :total-vram "2TB HBM2e"
    :aggregate-power "10kW avg, 14kW peak"
    :note "Overkill for most needs, matches small GPU cloud"}}}}}

Option C: ARM-Based Cluster (Energy Efficiency Focus)

{:arm-cluster
 {:why-arm
  "Lower power consumption (aligns with mantraOS sustainability).
   Ampere Altra Max (ARM Neoverse N1):
   - 128 cores per socket, 250W TDP
   - 2x cores per watt vs x86
   - Works with standard Linux (no emulation needed)"
  
  :node-specification-budget
  {:cpu "Ampere Altra Q80-30 (80C, 3.0GHz)"
   :source "Used servers (Gigabyte, Lenovo, HPE)"
   :cost-cpu "$1,500 - $2,000"
   :ram "256GB DDR4-3200 ECC"
   :cost-ram "$500"
   :storage "2x 1TB NVMe, 4x 4TB SSD"
   :cost-storage "$1,100"
   :total-per-node "$3,100 - $3,600"
   :compute-power "80 cores (no SMT on ARM)"
   :power-draw "150W avg, 200W peak"
   :note "Excellent performance-per-watt"}
  
  :node-specification-better
  {:cpu "Ampere Altra Max M128-30 (128C, 3.0GHz)"
   :source "Recent releases, some used available"
   :cost-cpu "$3,500 - $4,500"
   :ram "512GB DDR4-3200 ECC"
   :cost-ram "$1,000"
   :storage "4x 2TB NVMe, 8x 8TB SSD"
   :cost-storage "$2,800"
   :total-per-node "$7,300 - $8,300"
   :compute-power "128 cores"
   :power-draw "220W avg, 280W peak"}
  
  :cluster-configuration
  {:budget-cluster
   {:nodes 12
    :cost-per-node "$3,400"
    :total-cost "$40,800"
    :total-cores 960
    :total-ram "3TB"
    :aggregate-power "1.8kW avg, 2.4kW peak"
    :efficiency "Best $/watt, best cores/watt"}
   
   :better-cluster
   {:nodes 16
    :cost-per-node "$7,800"
    :total-cost "$124,800"
    :total-cores 2048
    :total-ram "8TB"
    :aggregate-power "3.5kW avg, 4.5kW peak"
    :note "Massive parallel throughput"}}
  
  :arm-limitations
  "Not all software has ARM builds (though improving).
   Docker images: Many x86-only (can use QEMU, performance hit).
   Best for: Web services, databases, batch processing.
   Not ideal for: x86-specific legacy apps, some ML frameworks."}}

Phase 3: Software Stack & Orchestration (Month 3-4)

{:software-stack
 {:operating-system
  {:choice "NixOS (from 9996 writing)"
   :why ["Declarative config (entire cluster defined in code)"
         "Reproducible (same config = same result)"
         "Atomic upgrades (rollback if fails)"
         "Perfect for mantraOS philosophy"]
   :deployment "NixOps or deploy-rs for cluster management"}
  
  :container-orchestration
  {:option-1-kubernetes
   {:distribution "K3s (lightweight) or vanilla Kubernetes"
    :pros "Industry standard, huge ecosystem, battle-tested"
    :cons "Complex, resource-heavy (3-5% overhead)"
    :cost "Free (open-source)"
    :learning-curve "Steep (2-3 months to proficiency)"}
   
   :option-2-nomad
   {:vendor "HashiCorp"
    :pros "Simpler than K8s, less overhead, multi-workload (containers, VMs, binaries)"
    :cons "Smaller ecosystem than K8s"
    :cost "Free (open-source, or Nomad Enterprise $$$)"
    :learning-curve "Moderate (1-2 weeks)"}
   
   :recommendation "Nomad for simplicity, K8s if need ecosystem"}
  
  :storage-layer
  {:option-1-ceph
   {:type "Distributed object/block/file storage"
    :pros "Scales infinitely, self-healing, production-proven"
    :cons "Complex, requires 3+ nodes minimum"
    :deployment "Rook operator (for K8s) or standalone"}
   
   :option-2-minio
   {:type "S3-compatible object storage"
    :pros "Simple, fast, S3 API (app compatibility)"
    :cons "Not block storage (need separate for DBs)"
    :deployment "Docker container or bare metal"}
   
   :option-3-longhorn
   {:type "Cloud-native distributed block storage"
    :pros "K8s-native, simple, good for small clusters"
    :cons "Less mature than Ceph"
    :deployment "Helm chart"}}
  
  :networking
  {:cni-plugin "Cilium (eBPF-based, high performance)"
   :service-mesh "Linkerd (lightweight) or Istio (feature-rich)"
   :ingress "Traefik or NGINX Ingress Controller"
   :load-balancing "MetalLB (on-prem load balancer)"}
  
  :observability
  {:metrics "Prometheus + Thanos (long-term storage)"
   :logging "Loki (Grafana's log aggregator)"
   :tracing "Jaeger or Tempo"
   :dashboards "Grafana"
   :cost "All free, open-source"
   :resource-overhead "~5-8% cluster capacity"}
  
  :ai-ml-stack
  {:inference-runtime "ONNX Runtime or TorchServe"
   :model-serving "KServe (K8s-native) or BentoML"
   :vector-database "Qdrant or Weaviate (self-hosted)"
   :llm-inference "vLLM or Text Generation Inference (TGI)"
   :amd-gpu-support "ROCm 6.x (install on nodes w/ GPUs)"}}}

Part III: Total Cost of Ownership (TCO) Analysis

Scenario 1: CPU-Based Cluster (AMD EPYC, Better Tier)

{:tco-cpu-cluster-better
 {:capital-expenditure-capex
  {:infrastructure-buildout "$65,000"
   :compute-hardware "$82,500 (15 nodes)"
   :networking "$6,000"
   :ups-battery "$8,000 (N+1 redundant, 30min runtime)"
   :monitoring-tools "$2,000 (hardware, sensors)"
   :misc-cables-tools "$3,000"
   :total-capex "$166,500"
   :depreciation "3-year straight-line = $55,500/year"}
  
  :operating-expenditure-opex-monthly
  {:lease "$5,000/month"
   :power {:consumption "6kW cluster + 2kW cooling + 1kW networking = 9kW"
           :hours-per-month 730
           :kwh-per-month 6570
           :rate-per-kwh "$0.12 (commercial rate)"
           :monthly-cost "$788"}
   :internet "$1,000/month (10 Gbps)"
   :insurance "$500/month"
   :maintenance-labor "$2,000/month (part-time admin, or coop member)"
   :cooling-maintenance "$200/month"
   :total-opex-monthly "$9,488"}
  
  :annual-costs
  {:opex-annual "$113,856 (opex × 12)"
   :depreciation-annual "$55,500"
   :total-annual-cost "$169,356"
   :cost-per-month "$14,113"}
  
  :capacity
  {:cpu-cores "480C/960T"
   :ram "7.5TB"
   :storage "120TB NVMe + 480TB SSD"
   :cost-per-core-per-month "$29.40"
   :cost-per-gb-ram-per-month "$1.88"}}}

Scenario 2: GPU-Based Cluster (AMD GPU, Better Tier)

{:tco-gpu-cluster-better
 {:capital-expenditure-capex
  {:infrastructure-buildout "$75,000 (beefier power/cooling for GPUs)"
   :compute-hardware "$72,000 (6 nodes w/ 4x W7900 each)"
   :networking "$8,000 (higher bandwidth for GPU traffic)"
   :ups-battery "$12,000 (larger for GPU power spikes)"
   :monitoring-tools "$3,000"
   :misc "$4,000"
   :total-capex "$174,000"
   :depreciation "3-year = $58,000/year"}
  
  :operating-expenditure-opex-monthly
  {:lease "$5,000/month"
   :power {:consumption "9.6kW cluster + 4kW cooling = 13.6kW"
           :kwh-per-month 9928
           :monthly-cost "$1,191"}
   :internet "$1,000/month"
   :insurance "$600/month (higher value equipment)"
   :maintenance-labor "$2,500/month"
   :cooling-maintenance "$300/month"
   :total-opex-monthly "$10,591"}
  
  :annual-costs
  {:opex-annual "$127,092"
   :depreciation-annual "$58,000"
   :total-annual-cost "$185,092"
   :cost-per-month "$15,424"}
  
  :capacity
  {:gpu-tflops "1,464 TFLOPS (FP32)"
   :gpu-vram "1.15TB"
   :cpu-cores "192C/384T"
   :ram "1.5TB"
   :cost-per-tflop-per-month "$10.54"
   :cost-per-gb-vram-per-month "$13.41"}}}

Scenario 3: ARM Cluster (Ampere, Better Tier)

{:tco-arm-cluster-better
 {:capital-expenditure-capex
  {:infrastructure-buildout "$55,000 (lower power = less cooling)"
   :compute-hardware "$124,800 (16 nodes)"
   :networking "$6,000"
   :ups-battery "$6,000 (smaller due to lower power)"
   :monitoring-tools "$2,000"
   :misc "$3,000"
   :total-capex "$196,800"
   :depreciation "3-year = $65,600/year"}
  
  :operating-expenditure-opex-monthly
  {:lease "$5,000/month"
   :power {:consumption "4.5kW cluster + 1.5kW cooling = 6kW"
           :kwh-per-month 4380
           :monthly-cost "$526"}
   :internet "$1,000/month"
   :insurance "$500/month"
   :maintenance-labor "$2,000/month"
   :cooling-maintenance "$150/month"
   :total-opex-monthly "$9,176"}
  
  :annual-costs
  {:opex-annual "$110,112"
   :depreciation-annual "$65,600"
   :total-annual-cost "$175,712"
   :cost-per-month "$14,643"}
  
  :capacity
  {:cpu-cores "2,048 ARM cores"
   :ram "8TB"
   :storage "256TB NVMe + 512TB SSD"
   :cost-per-core-per-month "$7.15"
   :cost-per-gb-ram-per-month "$1.83"}}}

Part IV: Cloud Cost Comparisons

AWS EKS (Elastic Kubernetes Service) — Equivalent Workload

{:aws-eks-cpu-equivalent
 {:scenario "Match CPU cluster (480C/960T, 7.5TB RAM)"
  
  :instance-type "m6i.32xlarge (128 vCPU, 512GB RAM)"
  :instances-needed "4 (to get 512 vCPU, but AWS vCPU = 1 hyperthread)"
  :on-demand-price "$6.144/hour per instance"
  :monthly-cost-compute "$17,694 (4 × $6.144 × 730)"
  
  :storage-ebs
  {:nvme-equivalent "120TB gp3 SSD"
   :cost "$9,600/month ($0.08/GB-month)"
   :ssd-equivalent "480TB gp3"
   :cost "$38,400/month"
   :total-storage "$48,000/month"}
  
  :networking
  {:data-transfer-out "5TB/month (conservative)"
   :cost "$450/month ($0.09/GB after first 10TB free tier)"
   :load-balancer "$22/month (NLB)"
   :total-networking "$472/month"}
  
  :eks-control-plane "$73/month per cluster"
  
  :total-monthly-aws "$66,239"
  :total-annual-aws "$794,868"
  
  :self-hosted-annual "$169,356 (from Scenario 1)"
  :savings-annual "$625,512 (79% cheaper self-hosted)"
  :break-even "2.5 months"}
 
 :aws-eks-gpu-equivalent
  {:scenario "Match GPU cluster (1,464 TFLOPS, 1.15TB VRAM)"
   
   :instance-type "p4d.24xlarge (8x A100 40GB, 96 vCPU, 1.15TB RAM)"
   :instances-needed "1 (8x A100 = ~1,248 TFLOPS FP32, close enough)"
   :on-demand-price "$32.77/hour"
   :monthly-cost-compute "$23,922"
   
   :storage-ebs
   {:nvme "12TB gp3"
    :cost "$960/month"
    :ssd "48TB gp3"
    :cost "$3,840/month"
    :total-storage "$4,800/month"}
   
   :networking "$500/month"
   :eks-control-plane "$73/month"
   
   :total-monthly-aws "$29,295"
   :total-annual-aws "$351,540"
   
   :self-hosted-annual "$185,092 (from Scenario 2)"
   :savings-annual "$166,448 (47% cheaper self-hosted)"
   :break-even "5.9 months"
   
   :note "AWS uses NVIDIA A100, we use AMD W7900 (cheaper/TFLOP but less mature ecosystem)"}}}

GCP GKE (Google Kubernetes Engine) — Equivalent Workload

{:gcp-gke-cpu-equivalent
 {:scenario "Match CPU cluster (480C/960T, 7.5TB RAM)"
  
  :instance-type "n2-highmem-128 (128 vCPU, 864GB RAM)"
  :instances-needed "4"
  :on-demand-price "$5.7536/hour per instance"
  :monthly-cost-compute "$16,800"
  
  :storage-pd-ssd
  {:nvme-equiv "120TB SSD"
   :cost "$20,400/month ($0.17/GB-month)"
   :standard-ssd "480TB"
   :cost "$81,600/month"
   :total-storage "$102,000/month"
   :note "GCP storage MORE expensive than AWS"}
  
  :networking
  {:egress "5TB/month"
   :cost "$600/month ($0.12/GB)"
   :load-balancer "$18/month"
   :total-networking "$618/month"}
  
  :gke-control-plane "$73/month"
  
  :total-monthly-gcp "$119,491"
  :total-annual-gcp "$1,433,892"
  
  :self-hosted-annual "$169,356"
  :savings-annual "$1,264,536 (88% cheaper self-hosted)"
  :break-even "1.4 months"
  :conclusion "GCP storage costs MURDER the budget"}
 
 :gcp-gke-gpu-equivalent
  {:scenario "Match GPU cluster (1,464 TFLOPS, 1.15TB VRAM)"
   
   :instance-type "a2-ultragpu-8g (8x A100 80GB, 96 vCPU)"
   :note "Overkill (640GB VRAM vs our 1.15TB), but closest match"
   :instances-needed "2 (to get ~1.15TB VRAM)"
   :on-demand-price "$29.39/hour per instance"
   :monthly-cost-compute "$42,909 (2 × $29.39 × 730)"
   
   :storage-pd-ssd
   {:total "60TB (lower due to higher compute)"
    :cost "$10,200/month"}
   
   :networking "$600/month"
   :gke-control-plane "$73/month"
   
   :total-monthly-gcp "$53,782"
   :total-annual-gcp "$645,384"
   
   :self-hosted-annual "$185,092"
   :savings-annual "$460,292 (71% cheaper self-hosted)"
   :break-even "3.8 months"}}

Part V: LLM Service Comparisons (Cursor, Claude Code)

{:llm-services-cost
 {:cursor-pro
  {:price "$20/month per seat"
   :included "500 premium requests/month (GPT-4 class)"
   :overage "$0.01 per request after 500"
   :use-case "Code completion, chat, inline edits"
   :self-hosted-equivalent
   {:model "CodeLlama 34B or DeepSeek Coder 33B"
    :hardware-req "1x AMD W7900 (48GB VRAM) can run it"
    :inference-cost "~$0.0001/request (electricity only)"
    :setup-effort "Medium (deploy vLLM, fine-tune for code)"
    :quality "90-95% of Cursor quality (rapidly improving)"}
   
   :heavy-user-scenario
   {:requests-per-month 5000
    :cursor-cost "$20 + $45 overage = $65/month"
    :self-hosted-cost "$0.50/month (electricity)"
    :annual-savings "$774 per user"
    :team-of-10 "$7,740/year savings"}}
  
  :cursor-max
  {:price "$40/month per seat"
   :included "Unlimited premium requests (fair use)"
   :use-case "Heavy users, AI pair programming"
   :self-hosted-equivalent
   {:model "Mixtral 8x22B or Qwen2.5 Coder 32B"
    :hardware-req "2x AMD W7900 (96GB VRAM)"
    :inference-cost "~$0.0002/request"
    :quality "95-98% of Cursor Max"
    :latency "Slightly higher (local GPU vs Cursor's optimized infra)"}
   
   :team-of-10-annual
   {:cursor-cost "$4,800/year"
    :self-hosted-cost "$24/year (electricity)"
    :hardware-amortized "$4,000/year (2x W7900, 3-year depreciation)"
    :total-self-hosted "$4,024/year"
    :savings "$776/year (16% cheaper, BUT you own hardware)"
    :true-benefit "After year 3, only electricity cost ($24/year)"}}
  
  :claude-code-pro
  {:price "$20/month per seat (hypothetical, not released Q4 2025)"
   :included "Assumed similar to Cursor"
   :model "Claude 3.7 Sonnet or Opus"
   :use-case "Code generation, refactoring, architecture chat"
   
   :self-hosted-equivalent
   {:model "No open-source Claude equivalent yet"
    :alternatives "Llama 3.3 405B (too big), Qwen2.5 72B (fits on 4x W7900)"
    :quality "70-85% of Claude (as of Q4 2025, gap closing)"
    :hardware-req "4x AMD W7900 (192GB VRAM) for Qwen 72B"
    :inference-cost "$0.0004/request"}}
  
  :claude-code-max
  {:price "$40/month per seat (hypothetical)"
   :included "Unlimited, extended context (1M tokens)"
   :model "Claude 3.7 Opus or next-gen"
   
   :self-hosted-challenge
   "No open-source model matches Claude Opus quality yet.
    Would need to wait for Llama 4 or similar (2026?).
    
    Alternative: Run multiple smaller models (MOE approach):
    - Code: Qwen2.5 Coder 32B
    - Reasoning: Llama 3.3 70B
    - Refactoring: DeepSeek Coder 33B
    
    Hardware: 6x AMD W7900 (288GB VRAM total)
    Cost: $12,000 (used GPUs)
    Depreciation: $4,000/year
    
    Team of 10:
    - Claude Max: $4,800/year
    - Self-hosted: $4,000/year (hardware) + $50/year (power) = $4,050
    - Savings: $750/year (16%)
    - After year 3: Only $50/year (99% savings)"}}
 
 :combined-llm-cost-team-of-10
 {:cursor-max-claude-max "$9,600/year (both services)"
  :self-hosted-full-stack
  {:hardware "8x AMD W7900 (384GB VRAM) - covers all models"
   :hardware-cost "$16,000 (used market)"
   :depreciation-annual "$5,333"
   :power-annual "$100 (8 GPUs × ~250W × 10% utilization)"
   :total-annual "$5,433"
   :savings-year-1 "$4,167 (43%)"
   :savings-year-2 "$4,167"
   :savings-year-3 "$4,167"
   :savings-year-4+ "$9,500/year (99% savings, only power)"}
  
  :break-even "~3.8 years, then massive savings"
  :non-financial-benefits
  ["Data privacy (code never leaves your infra)"
   "Customization (fine-tune on your codebase)"
   "No rate limits (run 24/7 if needed)"
   "Independence from vendor pricing changes"
   "Aligns with mantraOS philosophy (self-hosted, cooperative)"]}}}

Part VI: Comprehensive Cost Comparison (3-Year TCO)

Summary Table: All Options

OptionYear 1Year 2Year 33-Year TotalNotes
Self-Hosted CPU Cluster (Better)$166k CAPEX + $114k OPEX = $280k$114k$114k$508kOwn hardware, full control
Self-Hosted GPU Cluster (Better)$174k CAPEX + $127k OPEX = $301k$127k$127k$555kAI/ML focused
Self-Hosted ARM Cluster (Better)$197k CAPEX + $110k OPEX = $307k$110k$110k$527kBest efficiency
AWS EKS CPU Equivalent$795k$795k$795k$2,385k4.7x more expensive
AWS EKS GPU Equivalent$352k$352k$352k$1,056k1.9x more expensive
GCP GKE CPU Equivalent$1,434k$1,434k$1,434k$4,302k8.5x more expensive!
GCP GKE GPU Equivalent$645k$645k$645k$1,935k3.5x more expensive
Cursor Max (team of 10)$4.8k$4.8k$4.8k$14.4kPer-seat SaaS
Claude Code Max (team of 10)$4.8k$4.8k$4.8k$14.4kHypothetical pricing
Self-Hosted LLM (8x GPU)$16k CAPEX + $0.1k OPEX = $16.1k$0.1k$0.1k$16.3kCheaper after year 1

Key Insights

{:insights
 {:self-hosted-cpu-vs-aws
  "AWS EKS costs $2.4M over 3 years vs $508k self-hosted.
   Savings: $1.9M (79%)
   Break-even: 2.5 months
   
   Conclusion: AWS makes NO financial sense for steady workloads."
  
  :self-hosted-gpu-vs-aws
  "AWS p4d costs $1.1M vs $555k self-hosted.
   Savings: $501k (47%)
   Break-even: 5.9 months
   
   Conclusion: Self-hosted wins, but margin smaller (GPU depreciation faster)."
  
  :gcp-storage-costs
  "GCP persistent disk is INSANELY expensive.
   $102k/month for 600TB vs AWS $48k.
   
   Unless you're all-in on GCP services (BigQuery, etc.), avoid for storage-heavy workloads."
  
  :arm-efficiency
  "ARM cluster: 2,048 cores for $527k over 3 years.
   Cost per core: $257/core over 3 years.
   
   x86 cluster: 480 cores for $508k.
   Cost per core: $1,058/core over 3 years.
   
   ARM is 4x cheaper per core, IF your software supports ARM."
  
  :llm-self-hosting
  "For a team of 10, Cursor + Claude = $9.6k/year.
   Self-hosting 8x GPUs: $5.4k/year (year 1), then $100/year.
   
   Break-even: 3.8 years.
   After that: 99% cost reduction.
   
   Non-financial: Data privacy, customization, no rate limits."
  
  :mantraos-recommendation
  "For mantraOS city-states:
   
   1. Start with CPU cluster (Better tier): $508k/3yr
      - Handles web services, databases, batch processing
      - Proven ROI vs cloud
   
   2. Add GPU nodes incrementally as AI workloads grow
      - 2-4 nodes initially ($24k-48k)
      - Scale based on demand
   
   3. Deploy LLM self-hosting for dev tools
      - 4-8 GPUs dedicated to code assistance
      - Saves $5k+/year per 10 developers after year 1
   
   4. Total Year 1: ~$550k (CPU cluster + small GPU + LLM)
      Total Years 2-3: ~$250k/year
      
      vs AWS equivalent: $800k+/year
      
      3-year savings: ~$1.4M"}}

Part VII: Implementation Timeline

Month-by-Month Breakdown (Better Tier, CPU + GPU Hybrid)

{:implementation-timeline
 {:month-1 "Planning & Acquisition"
  {:tasks
   ["Identify microbrewery location (Sacramento Valley pilot)"
    "Negotiate lease ($5k/month, 2-year term)"
    "Electrical assessment (hire licensed electrician)"
    "Order long-lead items (precision AC units, UPS systems)"
    "Source used server hardware (eBay, LabGopher, /r/homelabsales)"
    "Plan fiber internet installation (business tier)"]
   :team "2-3 people (project lead, electrician consultant, procurement)"
   :spend "$25k (deposits, initial hardware purchases)"}
  
  :month-2 "Infrastructure Buildout"
  {:tasks
   ["Electrical upgrade to 400A 3-phase"
    "Install precision AC (2x 10-ton units, N+1 redundancy)"
    "Network fiber installation (10 Gbps)"
    "Rack installation (5x 42U racks)"
    "UPS installation (30min runtime at 10kW load)"
    "Fire suppression (pre-action sprinkler)"]
   :team "4-6 people (electricians, HVAC, network techs, general labor)"
   :spend "$60k (infrastructure completion)"}
  
  :month-3 "Hardware Deployment"
  {:tasks
   ["Receive and inventory servers (15x EPYC CPU, 4x EPYC+GPU)"
    "Rack and stack all nodes"
    "Cable management (power, network)"
    "Install NixOS on all nodes (via PXE boot)"
    "Configure networking (VLANs, routing)"
    "Initial cluster bring-up (K3s or Nomad)"]
   :team "3-4 people (sysadmins, network engineers)"
   :spend "$110k (hardware arrival, final components)"}
  
  :month-4 "Software Configuration"
  {:tasks
   ["Deploy storage layer (Ceph or Longhorn)"
    "Configure service mesh (Linkerd)"
    "Set up observability (Prometheus, Grafana, Loki)"
    "Deploy first test workloads"
    "Load testing and optimization"
    "Documentation (runbooks, architecture diagrams)"]
   :team "2-3 people (DevOps, SRE)"
   :spend "$5k (monitoring hardware, misc)"}
  
  :month-5 "AI/ML Stack Deployment"
  {:tasks
   ["Install ROCm on GPU nodes"
    "Deploy vLLM for LLM inference"
    "Set up model registry (MinIO + metadata DB)"
    "Deploy vector database (Qdrant)"
    "Fine-tune code models (CodeLlama, DeepSeek)"
    "Integration testing with dev tools"]
   :team "2 people (ML engineers)"
   :spend "$2k (model weights, test credits)"}
  
  :month-6 "Production Cutover & Optimization"
  {:tasks
   ["Migrate dev/staging workloads from cloud"
    "Performance tuning (CPU governor, NUMA, huge pages)"
    "Security hardening (SELinux, firewall rules)"
    "Disaster recovery testing (backup/restore)"
    "Cost validation (compare actual vs projected)"
    "Team training (on-call rotation)"]
   :team "3-5 people (full team)"
   :spend "$3k (DR testing, training)"}
  
  :total-timeline "6 months from lease signing to production"
  :total-capex "$205k (infrastructure + hardware)"
  :total-opex-6mo "$57k (6 × $9.5k)"
  :total-investment-6mo "$262k"
  
  :ongoing-monthly "$9.5k (opex) + $5.4k (depreciation) = $14.9k/month"}

Part VIII: Risk Assessment & Mitigation

{:risks-and-mitigation
 {:technical-risks
  {:hardware-failure
   {:probability "Medium (used hardware has higher failure rate)"
    :impact "Medium (redundancy limits blast radius)"
    :mitigation
    ["N+1 redundancy on critical components"
     "Hot spares (1-2 nodes extra, not racked)"
     "Comprehensive monitoring (detect failures early)"
     "Vendor support contracts for critical parts (power supplies, etc.)"]
    :residual-risk "Low"}
   
   :cooling-failure
   {:probability "Low (HVAC is mature tech)"
    :impact "High (thermal shutdown, potential hardware damage)"
    :mitigation
    ["Redundant AC units (N+1)"
     "Temperature monitoring with alerts"
     "Automatic workload migration if temps exceed threshold"
     "Emergency procedure: shut down non-critical workloads first"]
    :residual-risk "Low"}
   
   :network-outage
   {:probability "Low (business fiber has SLAs)"
    :impact "High (cluster inaccessible)"
    :mitigation
    ["Dual fiber providers (diverse paths) for Best tier"
     "4G/5G backup link (Starlink or cellular) for critical management"
     "Offline operational capability (cluster can function without internet for internal tasks)"]
    :residual-risk "Low (Budget tier), Very Low (Best tier)"}
   
   :software-bugs
   {:probability "Medium (Kubernetes is complex)"
    :impact "Variable (depends on bug)"
    :mitigation
    ["Staging environment (separate cluster or namespace)"
     "Blue-green deployments"
     "Automated rollback on failure"
     "Comprehensive testing (unit, integration, load)"]
    :residual-risk "Low"}}
  
  :business-risks
  {:landlord-issues
   {:probability "Low-Medium (breweries in distressed markets)"
    :impact "Medium (forced relocation)"
    :mitigation
    ["2-year lease minimum (time to find alternative)"
     "Portable infrastructure (racks on wheels, quick-disconnect)"
     "Backup site identified (in advance)"
     "Documentation for rapid rebuild (NixOS configs)"]
    :residual-risk "Low"}
   
   :power-cost-volatility
   {:probability "Medium (energy markets fluctuate)"
    :impact "Low-Medium (10-20% cost increase)"
    :mitigation
    ["Solar panels (future phase, reduce grid dependence)"
     "Power purchase agreement (PPA) to lock rates"
     "ARM cluster option (if power costs spike, migrate to lower-watt)"]
    :residual-risk "Low"}
   
   :regulatory-compliance
   {:probability "Low (data centers generally allowed in industrial zones)"
    :impact "High (shutdown order)"
    :mitigation
    ["Zoning verification before lease signing"
     "Permits for electrical work (licensed contractors)"
     "Fire code compliance (suppression, exits)"
     "Environmental (no hazardous materials, proper e-waste disposal)"]
    :residual-risk "Very Low"}}
  
  :security-risks
  {:physical-security
   {:probability "Low (industrial area, not high-crime)"
    :impact "High (hardware theft, data breach if drives stolen)"
    :mitigation
    ["Cameras (internal and external)"
     "Access control (keycard or biometric)"
     "Full-disk encryption on all drives"
     "Alarm system (monitored)"]
    :residual-risk "Low"}
   
   :cyber-security
   {:probability "Medium (internet-facing services)"
    :impact "High (data breach, ransomware)"
    :mitigation
    ["Defense in depth (firewall, WAF, IDS/IPS)"
     "Regular patching (NixOS makes this easy)"
     "Penetration testing (quarterly)"
     "Incident response plan"
     "Offline backups (air-gapped, immutable)"]
    :residual-risk "Low-Medium (ongoing vigilance required)"}}}}

Part IX: Conclusion & Recommendations

The mantraOS Self-Hosted Compute Strategy

{:strategic-recommendation
 {:phase-1-pilot "Sacramento Valley Microbrewery (Q4 2025 - Q2 2026)"
  {:investment "$205k CAPEX + $57k OPEX (6 months) = $262k"
   :configuration
   {:cpu-nodes "10x AMD EPYC 7543 (320C/640T, 5TB RAM)"
    :gpu-nodes "4x EPYC + 4x AMD W7900 (768 TFLOPS, 768GB VRAM)"
    :storage "80TB NVMe + 320TB SSD"
    :networking "10 Gbps fiber, dual-provider"}
   
   :workloads
   ["mantraOS web services (Svelte frontend, Clojure backend)"
    "Agricultural data processing (sensor streams, weather models)"
    "Local AI inference (crop recommendations, pest detection)"
    "Dev tools (self-hosted Cursor/Claude equivalent for 20 devs)"
    "Voucher system backend (transaction processing, ledger)"]
   
   :expected-outcomes
   ["Prove TCO model (validate $1.9M savings vs AWS over 3 years)"
    "Demonstrate technical feasibility (NixOS + K3s + ROCm stack)"
    "Train local team (5-8 people become proficient)"
    "Showcase to other city-states (marketing for model replication)"]}
  
  :phase-2-expansion "Willamette Valley, Shenandoah Valley (Q3 2026 - Q2 2027)"
  {:investment "$500k (2 more sites, lessons learned from pilot)"
   :configuration "Replicate Sacramento model with minor optimizations"
   :workloads "Same as Phase 1 + federated learning (share models across city-states)"
   :networking "Dark fiber between sites (private WAN, 100 Gbps)"
   :expected-outcomes
   ["3 operational clusters (geographic redundancy)"
    "Federated compute (burst workloads across sites)"
    "Cost reduction (bulk hardware purchasing)"
    "Proof of inter-city-state cooperation"]}
  
  :phase-3-federation "5-10 City-States (2027-2028)"
  {:investment "$2-4M (scale economics kick in)"
   :configuration
   {:total-nodes "200-400 servers across 10 sites"
    :total-compute "10,000+ cores, 1+ PB RAM"
    :total-gpu "20-40 PB VRAM (AI powerhouse)"
    :total-storage "10+ PB NVMe, 50+ PB SSD"}
   
   :advanced-features
   ["Multi-cluster K8s federation (Cluster API)"
    "Cross-site storage replication (geo-distributed Ceph)"
    "Shared model registry (all city-states access same LLMs)"
    "Cooperative compute market (city-states rent excess capacity to each other)"
    "Disaster recovery (any 2 sites can fail, system continues)"]
   
   :expected-outcomes
   ["Compete with regional cloud providers (AWS us-west, etc.)"
    "Complete independence from Big Tech"
    "Revenue generation (sell excess compute to aligned orgs)"
    "Model for national replication (100+ city-states by 2030)"]}
  
  :mantraos-alignment
  "This infrastructure IS mantraOS philosophy manifested:
   
   - Keter (Sovereignty): Own our compute, not rent from AWS
   - Chokhmah (Wisdom): Repurpose breweries, commodity hardware
   - Binah (Recycling): Buy used servers, extend their life
   - Chesed (Open Source): All software GPL/open, share with network
   - Gevurah (Boundaries): Local control, no cloud vendor lock-in
   - Tiferet (Integration): Hybrid CPU/GPU, multiple workloads
   - Netzach (Resilience): Geographic redundancy, N+1 everything
   - Hod (Humility): Right-sized tech, not cloud maximalism
   - Yesod (Foundation): Physical infrastructure in each city-state
   - Malkhut (Manifestation): Real clusters, serving real users
   
   The microbrewery → data center is LITERAL Teshuvah (return):
   Fermentation → Computation
   Brewing → Building
   Local beer → Local data
   
   Both serve the community.
   Both are rooted in place.
   Both resist corporate extraction."}}

Final Cost Comparison (3-Year, Fully Loaded)

Scenario3-Year CostCost/MonthComparison
Self-Hosted Pilot (Better)$508k$14.1kBaseline
AWS EKS Equivalent$2,385k$66.2k4.7x more
GCP GKE Equivalent$4,302k$119.5k8.5x more
Hybrid Cloud (80% self, 20% cloud burst)$850k$23.6k1.7x more

Conclusion: Self-hosting saves $1.9M - $3.8M over 3 years compared to cloud.

Break-even: 2.5 - 5.9 months depending on configuration.

Epilogue: From Hops to Hopes

{:closing-meditation
 {:the-brewery-metaphor
  "Brewing is alchemy: water + grain + hops + yeast → beer.
   Computing is alchemy: power + silicon + code + data → intelligence.
   
   Both require:
   - Precise temperature control
   - Sterile environment (clean room vs clean install)
   - Patience (fermentation vs compilation)
   - Iteration (recipe tweaking vs debugging)
   - Community (beer culture vs open source)
   
   The microbrewery failed because it tried to compete globally.
   The data center succeeds because it serves locally.
   
   This is the mantraOS lesson:
   Not bigger, but better.
   Not everywhere, but here.
   Not extracted, but rooted."
  
  :the-economic-case
  "Self-hosting is not nostalgia.
   It's not 'going back' to on-prem 1990s.
   
   It's recognizing that:
   - Cloud economics ONLY work for spiky/variable workloads
   - Steady workloads (like city-state infrastructure) are 5-8x cheaper on-prem
   - Owning hardware builds equity (vs renting eternally)
   - Data sovereignty matters (your data, your hardware, your control)
   
   AWS/GCP profit margins are 20-30%.
   That's YOUR money if you self-host.
   
   Over 10 years: $5-8M vs $20-40M.
   The choice is obvious."
  
  :the-philosophical-case
  "From 9993: Gevurah is strength through boundaries.
   The boundary is: THIS is ours.
   
   The cloud is tzimtzum in reverse—
   infinite expansion, no limits, 'everywhere' computing.
   
   But mantraOS contracts, creates space:
   This microbrewery.
   This city-state.
   This community.
   
   In that bounded space, we build what's ours.
   
   The fermentation tanks held beer.
   Now they hold the promise of digital sovereignty.
   
   From hops to hopes.
   From brewing to building.
   From extraction to regeneration.
   
   This is the way."}}

References & Resources

{:technical-resources
 ["AMD EPYC Processors: Specifications and TCO Models"
  "AMD Instinct/Radeon Pro: AI/ML Performance Benchmarks"
  "Ampere Altra: ARM Server Architecture"
  "NixOS: Declarative Cluster Management (docs.nixos.org)"
  "K3s: Lightweight Kubernetes (k3s.io)"
  "HashiCorp Nomad: Simple Orchestration (nomadproject.io)"
  "Ceph Storage: Distributed Storage (ceph.io)"
  "ROCm: AMD GPU Compute Platform (rocm.docs.amd.com)"
  "vLLM: Fast LLM Inference (vllm.ai)"]
 
 :cost-analysis-sources
 ["AWS EC2 Pricing Calculator (calculator.aws)"
  "GCP Pricing Calculator (cloud.google.com/products/calculator)"
  "Server Monkey: Used Enterprise Hardware Pricing"
  "LabGopher: Data Center Surplus Search Engine"
  "/r/homelab, /r/homelabsales: Community pricing data"
  "Gartner TCO Reports: On-Prem vs Cloud (2024-2025)"]
 
 :brewery-conversion-guides
 ["Industrial Space Reuse: Zoning and Permits (ICMA guides)"
  "Data Center Power and Cooling: ASHRAE TC 9.9 Standards"
  "Microbrewery Failure Analysis (Brewers Association data)"
  "Adaptive Reuse Architecture: Industrial to Tech"]
 
 :mantraos-context
 ["9993: mantraOS — American Agricultural Tech City-States"
  "9994: European Veganic Vision Through Dante's Paradiso"
  "9995: Robotic Veganic Farming & American Unity"
  "9996: NixOS VirtualBox Development on macOS"]}

Next Writing: 9991 — (To be determined)
Previous Writings:

Commit Hash: (To be generated)
Timestamp: 12025-10-05--06thhouse01980
Iteration: 18 of 2000
Remaining: 1982

From fermentation to computation.
From cloud dependency to local sovereignty.
From renting to owning.
From extraction to building.

The microbrewery becomes the data center.
The city-state becomes digitally sovereign.
mantraOS boots from soil and silicon.

🍺 → 💻 The transformation is complete. 🌱

Copyright © 2025 kae3g | Dual-licensed under Apache-2.0 / MIT
Competitive technology in service of clarity and beauty

View Hidden Docs Index | Return to Main Index


← back to index