Serverless was the wrong shape for the games: 17 Knative functions to 4 Go servers
Serverless was the wrong shape for steady-state game traffic, so the cluster collapsed 17 Knative Bun functions into 4 Go HTTP servers. The decision wasn’t aesthetic — the games were already always-on the moment Bootstrapping a Home K8s Cluster for a Multiplayer-Game Side Project set autoscaling.knative.dev/min-scale: "1" to fix cold starts. Knative’s only real benefit died that day; the migration just admitted it ten months later.
- Four servers —
auth-api-go(7 routes),game-api-go(11 routes),ws-relay-go(WS + Mongo change streams),doc-search-api-go(Qdrant + Neo4j + LLM Gateway RAG). - Always-on overhead removed — 17 × (50m CPU + 64Mi RAM) ≈ 750m CPU + 928Mi RAM permanently committed before any traffic. Down to 4 pods.
- Five-phase plan, ~20 PRs — foundation, auth, game, WS relay, doc-search, cutover. Each phase staging-deployed and smoke-tested before the next started.
- Deferred deletion — PR #359 scaled Bun to zero, manifests stayed. PR #23 (the actual
rm) was scheduled for 2026-04-20, two weeks later. - Three parallel cleanups — monorepo prune (March 9), function naming (March 19), Docker layer cache (April 12) ran on adjacent branches and made the Go cutover cheaper.
The games forced this, not Go fandom
min-scale: 1 was the post-1 fix that made the games usable. Once it shipped, the 17 Bun functions stopped scaling to zero. Cold-start was no longer a feature — it was a thing being suppressed at the cost of 750m CPU. The WS relay was already a Deployment because Knative scale-to-zero is incompatible with persistent WebSocket connections. The lesson: when every Knative service in your namespace has min-scale: 1, you’re paying serverless overhead for an HTTP server you forgot to write.
Go’s value here was narrow and concrete. ~15MB Alpine + static binary vs ~150MB Bun image, native goroutines for the relay’s connection manager, no build step on cluster pull. The dependency list stayed boring: mongo-driver/v2, golang-jwt/v5, bcrypt, github.com/coder/websocket (the maintained fork of the deprecated nhooyr.io/websocket), slog, testify.
Same routes, different binary
| Server | Endpoints | Replaces |
|---|---|---|
auth-api-go |
7 (register, login, logout, refresh, me, guest, google) | 7 Knative auth functions |
game-api-go |
11 (lobby, room, share-links, player stats) | 10 Knative game functions |
ws-relay-go |
WS upgrade + Mongo change streams | game-ws-relay-service (Bun Deployment) |
doc-search-api-go |
RAG over Qdrant + Neo4j + LLM Gateway | doc-search-api (Bun Deployment) |
Same paths under func-api.wintersalmon.com. Same MongoDB collections. Same JWT cookies. The ingress manifest in PR #354 changed exactly one field per route: backend.service.name from kourier-proxy to the matching *-go ClusterIP.
The five-phase plan
Phase 0 built the foundation — single Go module at go/ (github.com/wintersalmon/cloudnest-go), shared packages (internal/{mongo,jwt,middleware,crypto,response}), Makefile, golangci-lint v2, GHCR pipeline. Phases 1-4 each delivered one server with a staging deploy and smoke test before the next started. Phase 5 was the production cutover and Bun scale-down.
Order wasn’t arbitrary. Auth had to ship first because every other server reused the JWT middleware. The relay couldn’t ship until at least one game server was emitting events for it to watch.
Three afternoons that tried to derail it
PR #343 was a one-line Dockerfile fix and an hour of confusion. golang:1.24-alpine sets GOTOOLCHAIN=local, so a go.mod declaring go 1.26.1 doesn’t auto-download — the build silently used 1.24 and crashed on a 1.26-only stdlib call.
PR #346 fixed mongo-driver v2’s removed mongo.NewClient() API. client.Database("") does NOT fall back to the URI default — it returns a database literally named empty string. Every server now extracts the DB name from the URI.
PR #349 fixed a cookie name flip. Bun set access-token, Go set accessToken. Clients pinned to the old name broke during staging cutover. The lesson: staging cutover needs HTTPS, not curl localhost:18001 — cookies behave differently behind the real ingress.
Deferred deletion was the safety net
Production cut over on 2026-04-06. PR #354 flipped the ingress; PR #359 set min-scale: 0 and max-scale: 0 on all 17 Knative services plus replicas: 0 on the two Bun Deployments. The manifests stayed in the repo for two weeks. Rollback was two git reverts in a known order — restore Bun pods first (revert #359), then ingress (revert #354) — both single-commit operations FluxCD reconciles within a minute. PR #23, the actual deletion, was scheduled for 2026-04-20.
A delete-immediately cutover tells you nothing new. Two weeks of clean Go traffic catches every off-hours request, every weekly cron, every reconnect storm.
Three cleanups landed on the same arc
The migration didn’t run in a vacuum. The monorepo cleanup (March 9) deleted 12 deprecated apps — auth-playground, three standalone multiplayer clients, the v1 chess functions — leaving 9 apps and 9 packages. Fewer Bun functions to port. The function naming cleanup (March 19) replaced the CI path_to_name() helper’s -service- injection with a mechanical rule: func-{domain}-{category}-{function}. By the time PR #23 deleted them, the names were finally consistent. The Docker cache fix (April 12) removed no-cache: true from 18 workflows and added cache-from/cache-to: type=gha per-workflow scope — 30-50% second-build speedups across every Dockerfile, including the four new Go ones.
Pick the right shape for the traffic you actually have, not the traffic the architecture was designed for.
AI workflow note
Claude wrote the five-phase split as a planning document (.claude/plans/foamy-mixing-zephyr.md) before any Go code shipped — same pattern that worked for the cluster bootstrap. The code-reviewer agent (Go subagent) ran on every phase PR; most small fixes — the errcheck cursor close, the revive auth.AuthResponse → auth.Response stutter rename, the gocritic http.NoBody swap — surfaced from those reviews, not from running golangci-lint myself. The hard discipline written into the task-log and made non-negotiable: no Bun manifest deletions until production has run on Go for at least one full week. The handoff section in the task-log was the verification gate before each cutover — every smoke-test command and every secret-key gotcha lived there, and re-reading it before opening PR #354 caught two stale references that would have shipped otherwise.
