Building My Own Cloud

Six dedicated servers in a German data centre, and a multi-tenant Kubernetes platform on top of them. Most of what I run my life on now lives there. This is an honest account of why I built it, what it actually costs, and the situations in which you absolutely should not do the same.
Building My Own Cloud

I rent six dedicated servers from a company in Germany. Together they have more cores, more memory, and more SSD than most production clusters I worked on a decade ago.

I run my own Kubernetes on them. Not managed. Not EKS. Not GKE. The whole stack, from the immutable OS up to the workloads.

People who hear this ask me why, and the question usually arrives in one of two tones. The dangerous tone is “that’s amazing, how do I do it”. The responsible tone is “why on earth would you do that to yourself”.

This post is for the second group.

Why the cloud is the right answer for almost everyone

Let me get this out of the way honestly: by every conventional metric, I should be using the cloud.

Managed Kubernetes has become genuinely good. EKS has dramatically improved over the last three years. GKE has always been better than people gave it credit for. The serverless options are mature. The serverless databases are mature. The observability is mature. The bill is predictable in the way a Tuesday is predictable.

Self-hosting violates almost every assumption that makes a startup productive. Time is the most expensive resource you have. The cloud sells you abstractions that turn that time into product. Running your own substrate means the time goes into the substrate.

If you are trying to ship a product to customers — go use the cloud. Stop reading this post. It will only confuse you.

What the cloud does not sell you

Here is what the cloud will not sell you, even if you are willing to pay extra: control over your own roadmap.

The cloud’s roadmap is the cloud’s. They decide which APIs deprecate. They decide which regions get the new feature. They decide what your egress bill looks like. They decide whether your monitoring vendor — sitting on top of their infrastructure — is allowed to charge you eight times what it would cost you to host the same software yourself. They decide whether the small ML company hosting your fine-tuned model gets acquired by someone with very different priorities than the founders had.

Most of the time, none of this matters. The roadmap is fine. The bill is fine. The egress is fine. You do not think about it.

Then one day a regulation, an acquisition, or a pricing change makes you think about it, and you realise that the abstraction you bought was not an abstraction. It was a contract. With one customer.

I bought my own substrate for the same reason I write on my own blog instead of posting on a single platform. Not because the platform is bad — but because the platform is not mine.

What I actually run

The shape of the thing, layer by layer, told plainly.

The hardware. Six dedicated servers at a Hetzner data centre. Bare metal — not VMs, not “instances.” Each node has multiple cores, real NVMe, real network bandwidth. The bill is a fraction of what equivalent compute on AWS would cost, and most months I am using a fraction of that.

The OS. Talos Linux on every node. Immutable, API-driven, no SSH. You do not log into Talos boxes; you reconcile them. After running mutable Linux for fifteen years, the experience of “I cannot break this server even if I try” is psychologically corrective. I wish I had switched sooner.

The Kubernetes substrate. A multi-tenant Kubernetes distribution that gives me per-tenant control planes, per-tenant ingress, built-in storage, built-in observability, and a real package model. I treat tenants as isolation boundaries — one for each environment or product surface I want to keep separate. This is the layer that took me longest to understand and pays the most consistently.

Serverless. Knative on top of the substrate, with resource patches tuned aggressively for the cluster size. Cold starts matter on a small cluster, so I rewrote four Python services in Go specifically because the cold starts mattered for my AI workloads. There is a separate post about that rewrite.

Durable workflows. Temporal, exposed inside one tenant and bridged into another via a selector-based service so the UI works through the cluster’s ingress. Replaced what would otherwise have been a sprawl of SQS-and-Lambda glue with a single workflow engine that survives restarts and treats retries as a first-class concept.

Secrets. OpenBAO — a Vault drop-in — with AppRole auth scoped per service. Every service that needs a credential gets one through OpenBAO. Nothing lives in environment variables in a Helm values file. Rotating a credential is a single operation, not an archaeology dig.

LLM gateway. LiteLLM in front of a small fleet of model providers, with per-token pricing tiers configured explicitly so my AI agents have a budget. When an agent goes off the rails — and they do — the budget is the seatbelt.

Backups. Velero, snapshotting the platform on a schedule. I restored from one of these snapshots in anger exactly once, which was enough to convert me from “backups are a chore” to “backups are oxygen.”

Edge. A Nostr relay running on Cloudflare Workers, handling globally distributed traffic for a fraction of what a VPS would cost. The edge does what the edge is good at; the metal does what the metal is good at. The discipline is figuring out which is which.

That is the substrate. Everything else I build — the personal blog, the family Bitcoin wallet, the consulting tools, the AI agents — runs on this stack.

The mistakes

Three I will admit to in public.

The first one was treating Kubernetes like AWS. I tried to recreate cloud primitives one for one. NAT Gateways. Per-tenant load balancers. Internet egress proxies. Each of these has a Kubernetes-native equivalent that is better than the cloud version, and I spent two months reinventing things that were already in the platform’s package catalogue. If the platform ships an opinion, take the opinion before you write a flag to override it.

The second was undersizing the control plane. When you run nested control planes — child clusters whose API servers are pods on the parent — you can absolutely starve them by being stingy with CPU. A child cluster whose API server is being CPU-throttled looks indistinguishable from a child cluster that is just slow. I lost a weekend tracing a problem that turned out to be a requests.cpu: 100m set by a copy-paste two months earlier.

The third was ignoring the runbook. I wrote a Day-2 operations runbook early. Excellent. Then I did not update it for six months. Excellent operational hygiene right up until the morning I had to follow it, and discovered the version I was following described a cluster topology I had since changed. The runbook is not a write-once artefact. It is a contract you renew every time the cluster moves.

What it actually costs

The bill, all in: a couple hundred dollars a month for the bare metal. A small fraction of that for the edge bits and DNS. The Cloudflare side is free at my volume.

The same workload on AWS — managed control plane, NAT Gateways, egress, observability, secrets — would cost me five to ten times more. I have done the math. I keep doing the math. The math keeps coming out the same.

But the real cost is not the bill. The real cost is the time I spend keeping it running. Some weeks that is an hour. Some weeks it is a weekend. Some weeks it is the entire weekend, and I think dark thoughts about EKS at two in the morning.

Average it out and the time cost is significant — easily worth more than the bill differential at any honest hourly rate. So if you only care about cash plus time, this is a bad trade.

When you absolutely should not do this

You should not do this if:

  • You are building a product and your customers are not yet sure they want it. Self-hosting will eat the runway you need to find product-market fit.
  • You do not enjoy infrastructure. This is a hobby that pays in skill, not in money. If you do not like the hobby, you will resent every minute of it.
  • You do not have a fallback. The cluster will go down. You need to be okay with that, or you need someone else who is.
  • You think it will be cheaper without budgeting for time. It is not cheaper if you bill yourself honestly.

In other words: do this if and only if the learning is worth the cost on its own, separate from the workload it hosts.

Why I keep running it anyway

A platform you control is a platform that will host whatever you decide to build next.

I have a Bitcoin wallet I am building for my family. I have a content engine that publishes to a personal blog. I have a Frappe ERP I deploy for consulting clients. I have AI agents that need to run cheaply and reliably. I have a CLI that searches forty-seven thousand vectors of my own writing in under a second.

None of these existed when I built the cluster. The cluster was built so that they could exist when I needed them — on terms I controlled, with no third party between me and the workload.

That is the deepest difference between renting your substrate and building one. The rented substrate is a contract. The substrate you built is an option — a low-cost option, with a long expiry, on whatever you decide to ship next.

I will keep paying my Hetzner bill and my Talos curiosity tax for as long as I am building things. The day I stop is the day I migrate to managed Kubernetes and admit it.

That day is not today.


Write a comment
No comments yet.