Bootstrapping a Home K8s Cluster for a Multiplayer-Game Side Project
The multiplayer chess game needed somewhere to live, so the home k3s box became the cluster. Four weeks later it runs 14 Knative functions on FluxCD, Grafana at dashboard.wintersalmon.com, and sub-second cold starts. The trick was layering: get reachable first, monitor second, defer everything else.
- NGINX ingress + cert-manager DNS-01 was the reachable baseline; Traefik templates had to be re-keyed before anything deployed.
- FluxCD reconciled 14 Knative functions only after the workflow regex bumped from
dir_names_max_depth: 2to 4. kube-prometheus-stackHelmRelease landed a week after the apps;k8s-sidecar:2.5.0regressed, pinned to2.3.0in PR #85.min-scale: "1"on every Knative service traded 700m CPU for sub-second response (PR #82).- All of it sits in
infra/home-cluster/, declaratively, so the rebuild story is oneflux reconcileaway.
Home cluster keeps the rebuild story honest
The migration rule, written before any of this: provision a machine, install Kubernetes, clone the repo, apply the infra, services come online. A managed control plane breaks that — cloud console clicks aren’t in the repo. One AMD64 box on the home network is enough for the chess game and three friends. The cost is owning ingress-nginx, local-path storage, the letsencrypt-dns issuer, and the self-hosted runner. The surface area fits in one afternoon’s reading.
Traefik templates collided with an NGINX cluster
The first PR shipped manifests templated for a generic Traefik k3s. The actual cluster had been swapped to NGINX. Every ingress needed:
kubernetes.io/ingress.class: traefik->spec.ingressClassName: nginx- Drop the
traefik.ingress.kubernetes.io/router.middlewaresannotation cert-manager.io/cluster-issuer: letsencrypt-prod->letsencrypt-dns(HTTP-01 can’t reach the home box)
The second landmine was the shared myapps namespace. Both auth-service and multiplayer-chess-service shipped a mongodb StatefulSet with a generic name and a generic Service — guaranteed collision. Renamed to auth-mongodb and chess-mongodb. namespace.yaml came out of the deploy script too — kubectl apply is idempotent but the manifest had labels that would have overwritten hand-added ones.
DNS for auth.wintersalmon.com and chess-api.wintersalmon.com pointed at the MetalLB VIP 192.168.10.240. After that, kubectl rollout status and curl -k https://auth.wintersalmon.com/health.
Nested function paths broke the CI detector
Two weeks in, services split into Knative functions with a deep directory layout:
functions/auth/auth/login
functions/chess/lobby/create-game
functions/chess/game/submit-action
.github/workflows/functions.yml used dir_names_max_depth: 2, which detected functions/auth and never looked deeper. Bumped to depth 4 and added an awk-based path_to_name() helper to flatten paths into image names:
functions/auth/auth/login -> func-auth-service-login
functions/chess/lobby/create-game -> func-chess-service-lobby-create-game
The deeper trap: each Dockerfile does COPY functions/_shared/, which only resolves from the repo root. CI ran docker build from the function directory and silently missed _shared. Fix in 37cfd7f — explicit file: parameter, context set to repo root.
_shared is still a directory, not a real workspace package, so bun typecheck and bun test can’t resolve @cloudnest/functions-shared. Marked both continue-on-error: true (3f3c3a8, ce16e4a); the Docker build is the real validation. Future-me problem.
Monitoring waited a week, then got bitten by a single image tag
Pattern: get to production, then add observability before adding features. Once the 14 Knative functions were live, kube-prometheus-stack (chart 82.x) deployed via FluxCD HelmRelease, Grafana at dashboard.wintersalmon.com behind NGINX basic auth, 30Gi for Prometheus and 10Gi for Grafana on local-path, admin password SOPS-encrypted. Built-in dashboards covered cluster health for free. Blackbox probes and Telegram alert routing — deferred.
Then the chart pulled k8s-sidecar:2.5.0 for Grafana’s dynamic ConfigMap-driven dashboards. The sidecars went into CrashLoopBackOff:
couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s":
dial tcp [::1]:8080: connect: connection refused
The 2.5.0 image switched to Go-client internals for in-cluster config detection and silently fell back to localhost:8080 instead of the service account token. Grafana itself was healthy. RBAC (get/watch/list on configmaps and secrets) was correct. PR #85 pinned tag: "2.3.0". Sidecars healthy on next sync.
A follow-up gotcha: the init-chown-data init container failed with Permission denied on /var/lib/grafana/csv|pdf|png. Grafana creates those with mode 2740; the init container drops all caps except CHOWN, and without CAP_DAC_READ_SEARCH it can’t traverse a directory with no other-permissions. chmod 755 on the three directories from inside the running pod, then the next init retried clean.
min-scale=0 made first moves take 10 seconds
Knative defaults to min-scale: 0 — pods scale to zero after five idle minutes. First request after a lull paid for container start plus MongoDB connection setup. Measured: 10+ seconds. For a chess move, unusable.
PR #82 set autoscaling.knative.dev/min-scale: "1" on all 14 services. The trade-off is textbook:
| Item | Value |
|---|---|
| Always-on pods | 14 |
| Added CPU requests | 700m (14 x 50m) |
| Added memory requests | 896Mi (14 x 64Mi) |
| Steady state across 15 services | ~750m CPU, ~928Mi RAM |
Right call for an interactive game on a node with headroom. Wrong call for a batch job.
FluxCD reconciliation jammed once on an immutable serving.knative.dev/creator annotation when updating func-chess-service-lobby-games in place. Deleting the Knative service and letting FluxCD recreate it cleared the annotation.
What this enabled
Four weeks: GitOps, ingress-nginx, cert-manager DNS-01, MongoDB in myapps, 14 Knative functions, Prometheus + Grafana, sub-second cold starts. Nothing exotic. All of it is in infra/home-cluster/. The next chapters — staging environment, the Bun-to-Go auth API rewrite, the LLM gateway, this blog — sit on top of this base without changing it.
AI workflow note
Claude was on the keyboard for most of this period. The pattern that worked: ask for a phased plan first, then execute one phase at a time with a task-log entry per phase. The monitoring rollout split into six phases before any YAML got written, which made deferring blackbox probes a one-line decision. The architect agent was useful for the letsencrypt-prod vs letsencrypt-dns and min-scale: 0 vs 1 trade-offs — naming both options and the cost of each, not just picking one. When Claude proposed wholesale rewrites (“let me refactor _shared into a workspace package”), those got deferred to follow-up tasks rather than growing the current diff.
