PR #2467 — Outstanding E2E investigations (handover)¶
Two E2E tests still fail on PR #2467 (feat/signalr-active-reviewer-tracking) after a long fix-pr-checks pass. This doc captures branch state, the architectural questions the user raised, the design decisions that came out of those, and concrete investigation paths so a fresh session can pick up cold without re-deriving everything.
Editorial corrections in this revision (2026-04-28): the previous draft incorrectly claimed there was an automatic
router.navigatetriggered byisSessionExpired/sessionRemoved. There isn't. Reading the code carefully, those signals are consumed by exactly two places — the idle banner and the form-dirty auto-rejoin. Treat all "session-removed redirects" claims in earlier drafts of this doc (or earlier conversation) as wrong.
Branch state¶
- Branch:
feat/signalr-active-reviewer-tracking - Head:
016aeacfa(handover doc) on top ofabd05aeff(web e2e flag fix). - Mergeable: yes (BLOCKED only by the failing E2E and
disconnection-and-save-guarditself). - Passing: 19/22 E2E tests, all unit tests, all SonarCloud gates, validate-docs, OpenAPI, build, etc.
- Flaky (passes on retry):
bulk-update-validation.spec.ts:96— unrelated to this PR's feature. - Failing (the focus of this handover): 2 tests, both genuinely about active-reviewer-tracking behaviour.
Recent merge: 8d630ba07 brought in main's ADR-009 refactor — IStageReviewService was split into IStageReviewService + IReviewStatsQueryService (in SyRF.ProjectManagement.Application), and StudyAssignmentPolicy was extracted as a pure domain service. StageReviewService ctor now takes (IPmUnitOfWork, IStudyAssignmentPolicy, FeatureFlags). Test fixups for that landed in 561c7d19e.
How slot-reservation lifecycle actually works (the corrected mental model)¶
Three ways a SlotReservation can be removed today:
| Path | Where | Atomic with what? |
|---|---|---|
| Annotation save (graduation) | Study.AddSessionData:191-193 — when a save creates a new committed AnnotationSession AND a matching reservation existed, SlotReservations.Remove(reservation!) runs in-memory; the whole Study is then written via SaveWithCapacityGuardAsync inside a Mongo transaction (SubmitAnnotationSessionService.cs:94-119). |
✅ Yes — graduation + AnnotationSession write are one transaction. |
LeaveStudyReview SignalR call |
NotificationHub.cs:392-437. Web's pairwise route-change effect (stage-review.component.ts:545-562) calls this when the (projectId, stageId, studyId) tuple changes. If no other connections remain for the user on that study/stage, the hub removes the reservation and saves. | ❌ No — separate write, racing with whatever else may have just touched the Study (e.g. a screening write that just landed). |
| Idle / suspended-grace consumers | Background MassTransit consumers (RemoveIdleSessionConsumer, RemoveSuspendedSessionConsumer) that fire after a configured timeout to clean up orphaned reservations. |
Independent write, much later. |
What does NOT remove a reservation¶
Study.AddScreening(Study.cs:146-150) — one line, just updatesScreeningInfo. No reservation logic. By design (see "Why screening doesn't graduate" below).- The Skip button (stage-review.component.ts:480) — it just calls
router.navigate(...). Cleanup happens via the URL-change → pairwise →LeaveStudyReviewchain. - Browser tab close / refresh — relies on SignalR's
OnDisconnectedAsync(NotificationHub.cs:OnDisconnectedAsync) firing reliably.
Why screening doesn't graduate (and why that's the right call)¶
Annotation submission has a clean handoff: the saved AnnotationSession takes over the slot. The reservation has a clear successor. Removing it as part of the same write is sound.
Screening does not have a parallel object — ScreeningInfo is a record of the decision, not a holder of capacity. There's no "graduation event" to anchor the removal on. More importantly, the user is still on the page after a screening save. The auto-routing to the next study is initiated by the client after it gets the response. If the screening save commits but the response is lost (network blip, browser tab buried), the user sits looking at the same study with the form visible. If we'd already removed their reservation as part of AddScreening, the form is now in a "no slot" state and the user has no idea why.
The current design — release the slot when the user actually navigates away — correctly defers the irreversible side-effect to the moment the user demonstrates they're done with the page. That's the right ordering. The fragility isn't in the design; it's in LeaveStudyReview being one network hop away from the route change instead of in-process with the navigation.
The two failing tests¶
Test 1: disconnection-and-save-guard.spec.ts:35¶
save guard rejection › surfaces 409 rejection when reviewer's slot was taken during disconnection
What it tests¶
disconnection-and-save-guard.spec.ts:35-144. Walks the slot-stolen-during-disconnect scenario:
- Project with
sessionCountTarget=1, one annotation question. - Admin opens the only study → gets a
SlotReservation(capacity 1/1). - Admin fills the annotation form (form is dirty).
- Test directly mutates Mongo to simulate the slot being released and taken by another reviewer:
$pulladmin'sSlotReservationsentry (simulates suspended-grace consumer firing while admin was offline).$pushanExtractionInfo.Sessionsentry forreviewer, filling capacity via committed session.- Admin clicks
[data-testid="annotation-save-btn"]. - Expected:
PUT /session/{id}fires; server runsSaveWithCapacityGuardAsyncinside its Mongo transaction; capacity guard rejects (no slot for admin + capacity full); returns HTTP 409. Client's review-effects.ts:374-396 opens AccessDeniedDialog, surfaces error notification with text "couldn't save your review" / "review spot".
Observed failure¶
TimeoutError: page.waitForResponse: Timeout 30000ms exceeded while waiting for event "response"
at e2e/tests/disconnection-and-save-guard.spec.ts:129:39
Error: locator.click: Target page, context or browser has been closed
- element is visible, enabled and stable
- scrolling into view if needed
[page closes]
The save button does become visible/enabled, click is attempted, but no PUT to /session/ is observed and the page closes mid-test. The page closing is most likely the test's 30s waitForResponse timeout causing Playwright to tear down the context, not an in-app redirect.
Hypotheses (in order of plausibility, after corrected reading of the code)¶
- The form-dirty auto-rejoin opens AccessDeniedDialog before the click reaches the save button. stage-review.component.ts:524-531 — when the form transitions dirty AND
sessionRemovedis true, it callsjoinStudy(...)which callsJoinStudyReviewover SignalR. The hub returns a denial result (no slot, capacity full). study-presence.store.ts:397-430handleAccessResultopens AccessDeniedDialog. The dialog overlay is now in front of the save button. Subsequent click events go to the dialog, not the underlying form. However: the form-dirty effect usesdistinctUntilChanged()— it only fires on transitions. The test fills the form before mutating the DB, so the dirty transition has already fired (when slot was still held, no rejoin needed). Unless the form goes clean and dirty again somewhere, this path shouldn't re-fire. onSaveform validation silently bails into a confirmation dialog. annotation-form.component.ts:722-755 — ifgetErrorMapfinds errors, opens a "Form Cannot Be Saved with Invalid Data" dialog and never dispatches the save. The test fills a single textbox; if there are hidden required fields the test isn't filling, validation could silently block. Worth checking.- The save URL pattern doesn't match the test's
waitForResponsefilter. Filter isresp.url().includes('/session/') && resp.request().method() === 'PUT'. Server route isPUT /api/projects/{p}/stages/{s}/studies/{st}/session/{sid}. Pattern should match. Unless the client batched it through a different endpoint. - A race between SignalR snapshot arrival and the click. Snapshot says "no slot for you",
sessionRemovedflips, but as established this only triggers a banner + the form-dirty auto-rejoin (which already fired). Unlikely to explain page closure on its own.
The cheapest hypothesis to disprove is #2 — add console.log('onSave called', { errors }) and rerun.
Test 2: session-lifecycle.spec.ts:252¶
session lifecycle — active reviewer tracking › session cleanup on navigation away › session is removed when reviewer screens and moves to next study
What it tests¶
session-lifecycle.spec.ts:252-310:
- Project with reviewer as member, 5 studies, default screening mode.
- Reviewer opens screening page → web auto-routes to study A;
JoinStudyReview(A)createsSlotReservationon A. - Confirms
pmStudy[A].SlotReservations.length >= 1. - Reviewer screens (includes) study A via
ScreeningPage.includeStudy()— server records the screening decision; web auto-routes to study B. - Test waits for the URL change to a different study.
- Failure point: poll
pmStudy[A].SlotReservationsfiltered to the reviewer for 30s, expect 0, observed 1.
Why this is happening (corrected from earlier draft)¶
AddScreening doesn't graduate (correctly, per the design rationale above). Cleanup is supposed to come from LeaveStudyReview fired by the route-change pairwise. The simpler test at session-lifecycle.spec.ts:197 (just navigate away, no screening) does pass — so the leave path can work. The thing different about test 252 is that a screening save runs concurrently with the leave.
Hypotheses (cheapest first):
- Pairwise doesn't fire because the screening flow uses a different navigation primitive. If
ScreeningPage.includeStudy()'s post-save router-navigate doesn't change the studyId observable (e.g. uses a child-route navigation that the parent'sstudyId$observable doesn't see), then the pairwise tuple doesn't change,leaveStudyis never called, the reservation stays. - Race between screening-save commit and
LeaveStudyReview. Both end up writing topmStudy[A]. IfLeaveStudyReview's read happens before the screening-save's commit, it removes the reservation but its write loses to the screening write that includes the (still present) reservation. The screening write happens viaIPmUnitOfWork.SaveAsync(study)which uses optimistic concurrency onAudit.Version, so one of them retries — but the retry would re-load the study which now has the reservation back from the loser-rolled-back state. End state could be reservation present. - The connection lookup in
LeaveStudyReviewreturns null. Hub doesawait _reviewSessionConnectionRepository.RemoveByConnectionIdAsync(connectionId)— if some upstream flow removed that connection record (e.g. the page refresh that happens after screening, or the SignalR connection reconnecting and getting a newconnectionId), the lookup returns null and the early-return at NotificationHub.cs:406-411 skips the cleanup entirely.
Possible fixes¶
| Approach | Effort | Pros | Cons |
|---|---|---|---|
Make LeaveStudyReview more reliable (better connection-id lookup, retry on race, telemetry to surface the misses) |
Medium | Keeps the "release on navigate" semantics that the user argued for. | Doesn't address the case of OnDisconnectedAsync racing the route-change leave; still a distributed write story. |
Have LeaveStudyReview and the screening save coordinate (e.g. screening write also clears the reservation if it was for the same reviewer/stage) |
Higher | Removes the race; aligns with the screening user-visible state. | Mixes lifecycle concerns into the screening write; may surprise other call paths that don't involve the reviewer's own slot. |
| Drive cleanup from the AnnotationSession lifecycle on the server, not from the client's route change | Highest | One write to a single aggregate; client just notifies. | Bigger refactor. |
The user's preferred direction (see "Design vision" below) is to make the client UX more honest about state — banner + disabled button + tooltip + warning dialog — and improve LeaveStudyReview reliability rather than restructure the domain.
Design vision (from user, as of 2026-04-28)¶
When the slot is lost (server says "no slot for you")¶
- Modal explainer dialog opens on first detection — once-per-study, gracefully dismissable.
- Persistent warning banner stays at the top of the page after the modal is dismissed, reminding the user the slot is no longer theirs.
- Save / Submit buttons disabled while no slot exists —
disabledshould be linked to the live presence/reservation status from SignalR snapshots, not just the form'spristine/editMode/sessionSaving. - Tooltip on disabled button hover: explains why the button is disabled.
- Click handler on disabled button still fires a warning dialog (helpful for keyboard users / cases where the disabled state was missed).
When SignalR connectivity is lost¶
This is indicative of a bigger problem — server unreachable, network outage, etc. Needs explicit handling rather than silently leaving the user in an "everything looks fine" state.
- Save buttons should be disabled (the action is dependent on a healthy live link).
- Banner: "Lost connection — reconnecting…"
- Persistent until reconnected.
- If reconnect fails after N tries, escalate (modal explanation, recommended action).
Slots available but reservation missing¶
The user's specific question: if there are slots available but the reservation is missing from the database and this is communicated to the front end via SignalR, how should the Angular application respond?
Current behaviour (study-presence.store.ts:308-316): set sessionRemoved: true, banner appears via idle-notification-banner.component.ts:33-41, user can keep using the form. If they type, stage-review.component.ts:524-531 auto-rejoins. If they don't, they only find out it's gone when they save and get a 409.
Recommended behaviour (matches user's design vision):
- Show a non-dismissable warning banner: "Your slot was released. Auto-reconnecting…"
- Try to silently reacquire immediately — call
JoinStudyReviewregardless of form-dirty state. (Currently the auto-rejoin is gated on form-dirty transition, which means a user who's just sitting reading doesn't get reacquired.) - If reacquire succeeds — clear the banner, log a Sentry breadcrumb tagged
transient-slot-loss, the user notices nothing. (Useful telemetry to see how often this happens organically.) - If reacquire fails (capacity now genuinely full) — open AccessDeniedDialog with explanation; user picks next study or views read-only.
- While reacquire is in flight — disable save with tooltip "Reconnecting your slot…".
This centralises slot-state into a single computed signal (saveDisabledReason: 'no-slot' | 'reconnecting' | 'signalr-down' | 'sessionSaving' | null) that drives the button, tooltip, banner, and click-warning consistently.
Investigation playbook for the next session¶
# 1. Get on the branch
cd /home/chris/workspace/syrf/pr/pr2467.signalr-active-reviewer-tracking
git fetch origin && git pull --ff-only
# 2. Run the two failing tests in isolation, locally
cd e2e
./scripts/setup.sh
# (start API/PM/Quartz manually per CLAUDE.md, or use the e2e CI workflow as reference)
pnpm exec playwright test \
tests/disconnection-and-save-guard.spec.ts \
tests/session-lifecycle.spec.ts:252 \
--project=@full --headed --workers=1 --trace=on
Before running, instrument:
- stage-review.component.ts:555-561 — log every pairwise emission with
prevandcurr. This disproves/proves hypothesis 1 for test 2 immediately. - annotation-form.component.ts:722 — log when
onSaveis called and the value oferrors. Disproves/proves test 1 hypothesis 2. - study-presence.store.ts:308-316 — log when
sessionRemovedflips and what triggered it. - signal-r.service.ts:307-318 — log every
LeaveStudyReviewinvoke arg. For test 2, see whether it fires at all after screening. - Server-side: bump
LogInformationtoLogWarninginLeaveStudyReview(NotificationHub.cs:401-436) so it's easy to grep. Add a log inMongoStudyRepository.SaveWithCapacityGuardAsyncshowing pre-writeAudit.Versionand the result code.
Capture the Playwright trace (--trace=on) so the network/console can be replayed afterwards.
Files modified during this session that the next session should be aware of¶
src/services/api/SyRF.API.Endpoint/appsettings.e2etest.json— addedFeatureFlags.ActiveReviewerTrackingEnabled: truesrc/services/project-management/SyRF.ProjectManagement.Endpoint/appsettings.e2etest.json— samesrc/services/web/src/assets/data/appConfig.e2e.json—activeReviewerTrackingEnabledflipped from"false"to"true"e2e/tests/{annotation-capacity,session-lifecycle,disconnection-and-save-guard}.spec.ts— bulk renamedActiveReviewSessions→SlotReservations,ActiveReviewSession→SlotReservationto match the post-66f6237bedomain model- Test scaffolding: added
IStudyAssignmentPolicyandIReviewStatsQueryServicemocks toStageReviewServiceTests,StageReviewServiceAtomicTests,NotificationHubStudyPresenceTests,SubmitSessionTransactionTestsafter the ADR-009 merge - New unit tests:
MongoUpdateBuilderTests.cs,MongoRepositoryBasePartialUpdateTests.cs,ReviewerPresenceTests.cs— added to lift SonarCloud coverage above 80% in the right way (real tests, not exclusions) - Reverted dead
sonar.coverage.exclusionsentries fromsrc/services/quartz/sonar-project.properties(file isn't loaded bydotnet-sonarscanner)
Loose ends and follow-ups (not blocking the PR)¶
- The wider UX work (modal + banner + disabled button + tooltip + click-warning + SignalR-down handling) is its own piece of work. Probably wants its own ticket and doc — list of behaviours captured in "Design vision" above.
LeaveStudyReviewreliability — the connection-id lookup race in hypothesis 3 above deserves its own investigation regardless of these tests. Suggest emitting telemetry every timeRemoveByConnectionIdAsyncreturns null in a context where it shouldn't.- Screening's interaction with reservations — the architectural argument is documented above; if the team disagrees, the alternative is to graduate-on-screening with the trade-offs noted in "Test 2 → Possible fixes".
This doc lives at docs/planning/ per project conventions — feel free to delete it once the two tests are green and the design-vision items have their own home.