ADR-008: Review Session Access State Delivery and Time-Based Testing Conventions¶
Status¶
Draft
Context¶
PR #2467 introduces real-time active-reviewer tracking for annotation. A reviewer opens a study, the server assigns them a SlotReservation, and the form is locked if the study is at capacity and the stage admin has enforcement enabled.
Design review of the PR surfaced three issues that cross service boundaries:
1. Access-denial state is delivered on the wrong channel¶
On direct-URL navigation, the flow today is:
- REST
loadStageStudyreturns the Study DTO. It does not attempt a slot reservation and does not report access state. - The component renders the study with the annotation form visible.
- SignalR
JoinStudyReviewattempts the slot reservation. If the study is at capacity, the hub throwsInvalidOperationException, and the client fire-and-forget call swallows it. SubscribeToStudyPresencesubscribes to the Study's MongoDB change stream. No initial snapshot. A capacity-full study with no other live reviewers never delivers a snapshot at all.
Observed outcomes:
- Case A (other live pre-save reviewers on the page): the lock only engages once someone else perturbs the Study document. Reviewer can start typing before the overlay engages.
- Case B (capacity held by already-saved
AnnotationSessions): the snapshot DTO projects onlySlotReservations, not saved sessions, sosessions.lengthis0 < sessionCountTarget.isReviewerSurplusisfalse. Form never locks. Reviewer types a full annotation, clicks Save, gets 409 Conflict, receives a generic error toast. Work is lost.
The issue is that the server knows the access answer at join time, but no channel delivers it authoritatively to the client.
2. Time-sensitive state duplicates constants across the boundary¶
Reservation expiration (the 2-hour disconnect grace period) is computed twice:
- Server:
suspendedAt + ReviewSessionConstants.SuspendedSessionGracePeriodschedules a Quartz trigger. - Client:
disconnectedSince_c + SUSPENDED_GRACE_PERIOD_MScomputesexpiresAt_cfor the countdown.
The constant is duplicated (study-presence.store.ts:20 mirrors the server). The two now reads happen on independent clocks at slightly different wall-clock moments. Over a two-hour window the two values drift, accumulating message-delivery latency, NTP skew, and tab-throttling effects into a display that can be seconds-to-minutes off the actual release time.
3. Test coverage for scheduled messages is mocked-only¶
The scheduling infrastructure is MassTransit-over-Quartz with SQL Server persistence (QuartzServiceCollectionExtensions.cs:67-93). Consumer tests pass Mock<IMessageScheduler> and verify what consumers call, not what actually gets scheduled, fired, or cancelled. The whole reliability story ("slots are released exactly when they're supposed to be") rests on behaviour no test exercises.
Decision¶
Decision 1 — Dual-channel access state with last-write-wins reconciliation¶
The server reports the same access-state DTO on both channels:
- REST
loadStageStudyresponse is extended to include anaccessResult: { granted, denialReason, snapshot, serverTimestamp }field alongside the existing study/reviewStatus payload. The component renders with the correct lock state on first paint. - SignalR
JoinStudyReviewreturns aJoinStudyReviewResultwith the same shape from the hub invocation. The client'shubConnection.invoke(...)Promise is awaited and its result feeds the same store. StudyPresenceUpdatedbroadcasts (the existing change-stream-driven snapshots) also use the same shape.
Consistency rule: every payload includes a monotonic serverTimestamp (UTC). The store applies the most recent payload and ignores staler data. There is no "authoritative" channel — whichever message carries the highest timestamp wins. This works because neither channel is a writer from the store's perspective; they are both reporters of server-observed state.
Redundancy is intentional. If SignalR is down, REST carries the state. If a snapshot arrives between a REST call dispatching and completing, the later timestamp wins. If the server pushes a delta after the REST response, the store integrates it normally.
The snapshot DTO shape:
public class StudyReviewPresenceSnapshot {
public Guid StudyId { get; }
public IReadOnlyList<SlotReservationDto> Reservations { get; }
public int TotalAllocatedSessionCount { get; }
public DateTimeOffset ServerTimestamp { get; }
}
TotalAllocatedSessionCount (derived from the same tally the server uses for its capacity filter) replaces the client-side sessions.length >= sessionCountTarget check in isReviewerSurplus. This fixes Case B.
No nested AdminOnlySessionInfo in v1 (YAGNI). If the admin console ever needs saved-reviewer IDs, that is a stage-stats query, not a presence-snapshot field.
Decision 2 — Absolute UTC timestamps for all time-sensitive state¶
The server is the sole owner of timing constants. Snapshots carry absolute UTC instants for state transitions, not relative durations:
public class SlotReservationDto {
// ... existing fields ...
public DateTimeOffset? SuspendedAt { get; }
public DateTimeOffset? ReleaseAt { get; } // SuspendedAt + grace, if suspended
public DateTimeOffset? IdleAt { get; }
public DateTimeOffset? IdleRemovalAt { get; } // IdleAt + stage.IdleSessionTimeoutMinutes, if set
}
Client countdown becomes releaseAt - Date.now(). The duplicated SUSPENDED_GRACE_PERIOD_MS constant in study-presence.store.ts is deleted. Drift reduces from "accumulated over the grace period" to "current NTP skew between client and server clocks", which is typically tens of milliseconds.
Manual-clock-offset robustness (a user with their clock set wrong by hours) is handled by a single serverTimestamp field on the initial SignalR handshake. The client records serverOffset = server_utc - client_now and applies it to countdown math. This is a small amount of code and eliminates the last realistic drift scenario.
Display policy: the client renders its "expired" state ~10 seconds before releaseAt to avoid telling the user "you have 5 seconds" immediately before a save rejection. Server enforces strictly at releaseAt. This small conservative buffer lives entirely in the client; no server change.
Decision 3 — Integration-test Quartz scheduling directly¶
Unit tests continue to mock IMessageScheduler for consumer behaviour. A new integration-test tier is added that exercises real Quartz scheduling with in-process RAMJobStore (not the SQL Server store used in production — we are testing the scheduling behaviour, not the persistence tier).
Required coverage:
Schedule_FiresAfterDelay— fires within[delay, delay + 500ms].Cancel_BeforeFire_PreventsDelivery— cancelled trigger never delivers.Cancel_AfterFire_IsNoOp— late cancel is safe.Reschedule_OverwritesPrevious— same reservation rescheduled after dirty→clean→dirty does not leak triggers.Consumer_IsIdempotent_AgainstDuplicateDelivery— delivery-guarantee tests around the existing consumer state guards.
Delays are 1-3 seconds; full suite runs under a minute. A separate small suite (2-3 tests) using Testcontainers + SQL Server covers the persistence path as a smoke-check. This is the only place the production backend is exercised directly.
Consequences¶
Positive¶
- Direct-URL navigation to a capacity-full study locks correctly on first paint. No flash window.
- Saved-session capacity is detected correctly; Case B (the bigger of the two gaps) is fixed.
- The 2-hour countdown display agrees with the server's actual release time within NTP skew, not compounding drift.
- We gain confidence in scheduling correctness through tests that exercise the real primitive rather than mocks.
- Deleting the duplicated client-side constant removes a known coupling pitfall.
Negative¶
- Three endpoints now carry the same DTO shape. Schema changes touch three places (plus generated client).
- Last-write-wins requires the server to timestamp every outgoing snapshot. Cheap, but mandatory — a missing timestamp degrades the store silently.
- Integration tests against real Quartz add ~30s to the test suite (amortised across services that schedule). Acceptable.
Neutral¶
- The existing
EnsureActiveReviewSessionForDirectNavigationAsync(written but unwired) is now wired intoloadStageStudy. Code that was dead becomes live. - The
saveRejectedcopy (currently unreferenced) gets wired into a modal; a 409 response from submit also fires a Sentry exception so engineering learns when a client reached a state it should have been prevented from reaching.
Alternatives Considered¶
Make REST the single authority for slot reservation¶
Move all slot-reservation logic into REST endpoints; SignalR becomes push-only. Cleaner separation of concerns but much larger scope — touches the next-study selection flow, the existing atomic-assign pipeline, and the hub's role entirely. Rejected for this PR; revisit if presence semantics keep growing.
Keep SignalR as sole authority, add REST-side read-only preview¶
REST carries only a best-effort snapshot with no timestamp; SignalR remains authoritative. Simpler but reintroduces a precedence question: "my REST snapshot says X, my SignalR snapshot also says X, both at different times — which is now?" Last-write-wins with timestamps is barely more code and removes the question entirely.
Relative durations + longer client-side buffer¶
Keep the SUSPENDED_GRACE_PERIOD_MS constant; tolerate drift with a 30-60s display buffer. Works, but bakes the duplication in and ages poorly (any change to the server constant becomes a two-repo dance). Rejected.
References¶
- PR #2467 —
feat/signalr-active-reviewer-tracking - active-reviewer-tracking-overhaul.md — original design discussion
- review-access-state-overhaul.md — implementation plan for this ADR
- MassTransit + Quartz scheduler integration: QuartzServiceCollectionExtensions.cs