Wintersalmon | Blog

Three event-sourcing bugs, three pillars of one contract

5 min read

Three bugs over six weeks broke the same contract three different ways: an applier that called Math.random(), a sequence allocator that did read-then-write, and a relay that trusted senderId from the client body. The unified platform from One API, N Games: Client-Authoritative Event Sourcing for a Unified Game Client is what lets a new game stay client-only — clients run the engine, the server stores opaque events, state is the replay from seq 0. That promise has a price, and these three bugs are the price being real instead of theoretical.

  • March 21, Augmented ChessgetRandomAugmentationChoices ran inside applyMove, so each client rolled different choices and validation desynced. Lesson: appliers must be deterministic.
  • April 12, submit-event backendgetLatestSequenceNumber then create raced under load, two writers got seq 8, Mongo E11000 surfaced as a 500. Lesson: allocation and write are one operation, or they’re a race.
  • April 24, game-validator — the relay believed whatever senderId the client sent, so a guest could trigger host-only resolveRound. Lesson: identity is the one thing the server cannot delegate.

Bug 1: randomness in an applier desyncs replay

Round 1 of a chess augmentation worked because choices were generated inside createInitialState() on the host and shipped as the init event — both clients replayed identical bytes. Round 2 broke because choices were generated inside applyMove itself:

if (turnNumber === 6) {
  state.triggeredAugmentationChoices = getRandomAugmentationChoices(3);
}

Player A saw [X1, X2, X3], Player B saw [Y1, Y2, Y3]. The moment A picked X1, B’s validateSelectAugmentation rejected it (augmentationId not in choices) and the clients silently desynced.

The fix hoists the non-determinism into the action. The submitting client rolls once, the action carries the result, every client applies the same bytes:

interface MoveAction {
  type: "move";
  from: Square;
  to: Square;
  triggeredAugmentationChoices?: AugmentationChoice[];
}

Three files: packages/augmented-chess-engine/src/types/actions.ts, move.applier.ts, and apps/board-game-client/src/games/chess/game-state-store.ts. The general rule: appliers are pure functions of (state, action) -> state. Math.random(), Date.now(), network reads — anything that isn’t a function of those inputs has to be hoisted into the action and serialized.

Bug 2: non-atomic sequence allocation races under load

functions/game/room/submit-event/logic.ts allocated sequence numbers with a read, then a write:

const seqNum = await eventRepository.getLatestSequenceNumber(roomId);
const nextSeq = seqNum + 1;
await eventRepository.create(roomId, eventType, payload, nextSeq);

Two concurrent submits both read seqNum = 7, both wrote 8. The unique compound index on (gameId, sequenceNumber) in game_room_events caught it — the second writer got Mongo code 11000 — but unhandled, that surfaced as a 500 to the client. Without the index, replay would have two events at seq 8 and engine state diverges.

createWithAtomicSequence() in game-event.repository.ts collapses the window into one round trip:

const counter = await db.collection("game-sequence-counters").findOneAndUpdate(
  { _id: `room-${roomId}` },
  { $inc: { seq: 1 } },
  { upsert: true, returnDocument: "after" },
);
await db.collection("game_room_events").insertOne({ ...event, sequenceNumber: counter.seq });

findOneAndUpdate with $inc is atomic per document. Concurrent callers serialize on the counter. The unique index stays as the safety net, and a defensive catch in handleRequest returns 409 Conflict on any residual 11000. If replay correctness depends on a monotonic sequence, allocation and write must be one operation.

Bug 3: identity from a client field is not identity

The relay used senderId from the client’s submit-event body to mark who an event was from:

{ "type": "resolve_round", "senderId": "<host-user-id>", "payload": { ... } }

Host-only validators (resolveRound on fruit-shop, revealCard on horror-race) checked action.resolverId === state.hostPlayerId — but resolverId came from senderId, which came from the client. A guest could trigger host-only actions by lying about who they were.

The fix wasn’t a one-line patch — the relay had to own identity end-to-end (PRs #398, #400, #402):

  1. Engine identity metadata. hostPlayerId added to InitGameOptions and persisted state for fruit-shop-engine, horror-race-engine, stone-flicking-core-engine, augmented-stone-flicking-engine. Host-only validators check the field on state, not on the action.
  2. Validation service. New apps/game-validator Bun service with POST /validate. Identity is injected from the verified session, not read from the body — action.resolverId = senderId (verified from JWT), action.revealerId = senderId.
  3. Relay integration. go/cmd/game-api/handlers/room.go calls the validator over HTTP, persists currentState on the room, only writes the event if validation passes.

In a client-authoritative system the server can let go of almost everything — engine, rules, most validation. The one thing it cannot let go of is who you are. Anything the client can write, the client can lie about.

The three pillars in one checklist

Pillar Concrete rule Caught by
Deterministic appliers No random, no time, no I/O — hoist into the action Replay divergence in tests
Atomic sequence allocation One operation; unique index as backstop code 11000 → 409
Server-owned identity Rewrite identity from authenticated session on entry Host-only validators on persisted state

Future-me checklist for a new game: could random or time leak in, is allocation atomic server-side, who decides who I am. If any answer is shaky, the bug already exists — it just hasn’t been triggered yet.

#event-sourcing #determinism

AI workflow note

Claude was the pattern-matcher across the three task-logs. I asked it to read 20260321-chess-augmentation-event-sync-fix.md, 20260412-backend-atomic-event-sequence.md, and 20260424-game-event-validation/progress.md side by side and surface what they shared — the “three pillars” framing came from that pass, not from any individual log. After bug 1, I ran the code-reviewer agent across the other engine appliers with one question: “where else might Math.random() or Date.now() be hiding inside an applier?” It surfaced two more sites before they shipped. The discipline that paid off most was writing the fix as a regression test first — the duplicate-key race in bug 2 was easy to assert against once I had the test, hard to reason about without it.


Hungjoon

I'm Hungjoon, a software engineer based in South Korea. This is my long-form notebook — homelab, Kubernetes, AI infra, and whatever else keeps me up at night.