Two Control Planes, Clear Ownership: FluxCD for State, ArgoCD for Eyes — Wintersalmon

The games started shipping multiplayer, which meant every push to main was now affecting friends mid-match — so the cluster from the previous post needed a staging lane and a visual sync gate. Three weeks later FluxCD still wrote everything and ArgoCD became the dashboard, because ArgoCD cannot decrypt the SOPS secrets the functions namespace depends on. The trade was clean: one writer, one viewer, no fight over ownership.

Staging: same cluster, every name suffixed -stag, min-scale: 0, separate mongodb-stag (5Gi PVC, replica set rs0-stag).
WireGuard to MongoDB needed three independent fixes — route, masquerade, NetworkPolicy — plus directConnection=true in the URI.
ArgoCD-as-writer died on SOPS; FluxCD stayed sole writer, ArgoCD became read-only plus manual-sync UI at argocd.wintersalmon.com.
Loki + a zero-dep JSON logger at functions/_shared/src/logger/ made cross-service log search a single query.
A month later: a Secret committed but unlisted in kustomization.yaml silently never reconciled.

Staging shares the cluster, not the names

A second cluster was the textbook answer; I have one box. Rule instead: same namespaces, every resource suffixed -stag, no shared state except the control plane. mongodb-stag runs as its own StatefulSet with separate mongodb-stag-credentials. All Knative functions duplicate under apps/functions-stag/ with min-scale: 0 (cold-start is fine for staging). Ingress lives at stag-api.wintersalmon.com. CI on the new staging branch tags images -stag; FluxCD’s existing ImageUpdateAutomation already scanned ./infra/home-cluster/apps, so adding stag ImagePolicy and ImageRepository resources just worked.

The cost is one rogue kubectl apply without a suffix can stomp prod. The benefit is no duplicated namespace-level resources (pull-secrets, network policies, kourier-proxy) on a single-node cluster.

WireGuard to MongoDB needed three fixes, not one

Compass on my Mac had to read both Mongos over the home VPN. It failed three different ways before working — recorded in PR #198.

Route: WireGuard AllowedIPs only had 10.100.0.0/24. The k8s service CIDR 10.43.0.0/16 was missing on both peers, so packets to a ClusterIP exited via the default route and died.
Masquerade: iptables -t nat -I POSTROUTING 1 -s 10.100.0.0/24 -d 10.43.0.0/16 -j MASQUERADE. Position 1 is load-bearing — at position 6 it sits behind KUBE-POSTROUTING and FLANNEL-POSTRTG, both of which return early.
NetworkPolicy: pod firewall chain KUBE-POD-FW-... ended in REJECT with icmp-port-unreachable for any source not allowed. New policy allow-ingress-from-wireguard with ipBlock: 10.100.0.0/24 opened port 27017 on mongodb and mongodb-stag.

Then Compass connected, ran rs.status(), got member hostname localhost:27017, dialed it, and reported ECONNREFUSED 127.0.0.1:27017. Fix: directConnection=true in the URI to skip replica-set discovery. None of these are guessable from the symptom; the runbook at docs/guides/infrastructure/operational-runbook.md exists for that reason.

ArgoCD lost the writer fight to SOPS

Original plan: ArgoCD-as-writer for production, FluxCD pruned out, image promotion via GitHub Action. Phase 5 of the cutover hit the wall fast — ArgoCD cannot decrypt SOPS-encrypted secrets. The functions namespace carries mongodb-stag-credentials, jwt-secret, ghcr-pull-secret. ArgoCD would either skip them or apply ciphertext. Neither is useful.

The revision (recorded under “Architecture (Revised)” in the task-log) is short:

Layer	Manager
Infrastructure, secrets, shared resources	FluxCD
Staging apps (auto-deploy via ImageUpdateAutomation)	FluxCD
Production apps	FluxCD writer + ArgoCD dashboard / manual sync
Staging visualization	ArgoCD (read-only)

ArgoCD installs via FluxCD HelmRelease — same pattern as cert-manager and ingress-nginx. Production image auto-update is off ($imagepolicy markers removed); promotion is a workflow_dispatch action that bumps the tag in Git, after which ArgoCD shows “OutOfSync” and a human clicks Sync. The architect agent helped name the call: cleanliness loses to “the secrets actually decrypt.”

Loki without a vendor, JSON logs without a library

Loki landed as Phase 7: Helm chart, 7-day retention, one 10Gi PVC on local-path, datasource registered in Grafana. Promtail scrapes pod stdout. The interesting bit is the writer side — a zero-dependency JSON logger at functions/_shared/src/logger/, no pino, no winston. Bun’s native console.info(JSON.stringify(...)) is fast enough and stdout is the only sink. createLogger(service) returns level-keyed methods; withRequestLogging(handler, logger) emits one info line per request with requestId, method, path, statusCode, durationMs.

Promtail’s pipeline promotes level and service to indexed labels (low cardinality) and leaves requestId in the line for query-time | json extraction. The api-logs.yaml dashboard filters by service, level, and free text. “Find all 500s in the last hour for game-submit-event” is one query.

Kustomization is explicit-only — anything unlisted silently skips

Five weeks later, two notifier-api-go pods went into ImagePullBackOff and CreateContainerConfigError. Diagnosis is at docs/task-log/20260424-pod-issues/progress.md. The Secret notifier-credentials had been authored, SOPS-encrypted, and committed at infra/home-cluster/apps/functions-go/notifier-credentials.secret.yaml — but never added to the sibling kustomization.yaml. FluxCD reconciled everything else listed there cleanly. The Secret simply never existed in the cluster. The bug stayed invisible because the running pod ran on a previously cached image; the crash only surfaced when ImageUpdateAutomation rolled a new tag. Same trap caught a doc-search-api-config ConfigMap referenced by a Deployment but never created as infra.

The remediation is workflow, not tooling: every new resource file, the next file I open is the sibling kustomization.yaml. A linter is a future task.

#gitops #observability

AI workflow note

Claude wrote the staging plan as a phased five-step roadmap before any YAML — the same pattern that worked for the monitoring rollout in the previous post. When the ArgoCD-as-writer design hit SOPS, I asked the architect agent to compare ArgoCD-as-writer against FluxCD-writer-with-ArgoCD-visualization head-to-head; the SOPS asymmetry decided it in two messages. The pod-issues lesson reshaped my prompting too — when adding a Secret or ConfigMap, I now explicitly ask Claude to enumerate every resource in the existing kustomization.yaml and confirm the new file is listed. That one habit catches the silent-failure case before it ships.