Skip to content

PR #2467 — Outstanding E2E investigations (handover)

Two E2E tests still fail on PR #2467 (feat/signalr-active-reviewer-tracking) after a long fix-pr-checks pass. This doc captures branch state, the architectural questions the user raised, the design decisions that came out of those, and concrete investigation paths so a fresh session can pick up cold without re-deriving everything.

Editorial corrections in this revision (2026-04-28): the previous draft incorrectly claimed there was an automatic router.navigate triggered by isSessionExpired / sessionRemoved. There isn't. Reading the code carefully, those signals are consumed by exactly two places — the idle banner and the form-dirty auto-rejoin. Treat all "session-removed redirects" claims in earlier drafts of this doc (or earlier conversation) as wrong.

Branch state

  • Branch: feat/signalr-active-reviewer-tracking
  • Head: 016aeacfa (handover doc) on top of abd05aeff (web e2e flag fix).
  • Mergeable: yes (BLOCKED only by the failing E2E and disconnection-and-save-guard itself).
  • Passing: 19/22 E2E tests, all unit tests, all SonarCloud gates, validate-docs, OpenAPI, build, etc.
  • Flaky (passes on retry): bulk-update-validation.spec.ts:96 — unrelated to this PR's feature.
  • Failing (the focus of this handover): 2 tests, both genuinely about active-reviewer-tracking behaviour.

Recent merge: 8d630ba07 brought in main's ADR-009 refactor — IStageReviewService was split into IStageReviewService + IReviewStatsQueryService (in SyRF.ProjectManagement.Application), and StudyAssignmentPolicy was extracted as a pure domain service. StageReviewService ctor now takes (IPmUnitOfWork, IStudyAssignmentPolicy, FeatureFlags). Test fixups for that landed in 561c7d19e.

How slot-reservation lifecycle actually works (the corrected mental model)

Three ways a SlotReservation can be removed today:

Path Where Atomic with what?
Annotation save (graduation) Study.AddSessionData:191-193 — when a save creates a new committed AnnotationSession AND a matching reservation existed, SlotReservations.Remove(reservation!) runs in-memory; the whole Study is then written via SaveWithCapacityGuardAsync inside a Mongo transaction (SubmitAnnotationSessionService.cs:94-119). ✅ Yes — graduation + AnnotationSession write are one transaction.
LeaveStudyReview SignalR call NotificationHub.cs:392-437. Web's pairwise route-change effect (stage-review.component.ts:545-562) calls this when the (projectId, stageId, studyId) tuple changes. If no other connections remain for the user on that study/stage, the hub removes the reservation and saves. ❌ No — separate write, racing with whatever else may have just touched the Study (e.g. a screening write that just landed).
Idle / suspended-grace consumers Background MassTransit consumers (RemoveIdleSessionConsumer, RemoveSuspendedSessionConsumer) that fire after a configured timeout to clean up orphaned reservations. Independent write, much later.

What does NOT remove a reservation

  • Study.AddScreening (Study.cs:146-150) — one line, just updates ScreeningInfo. No reservation logic. By design (see "Why screening doesn't graduate" below).
  • The Skip button (stage-review.component.ts:480) — it just calls router.navigate(...). Cleanup happens via the URL-change → pairwise → LeaveStudyReview chain.
  • Browser tab close / refresh — relies on SignalR's OnDisconnectedAsync (NotificationHub.cs:OnDisconnectedAsync) firing reliably.

Why screening doesn't graduate (and why that's the right call)

Annotation submission has a clean handoff: the saved AnnotationSession takes over the slot. The reservation has a clear successor. Removing it as part of the same write is sound.

Screening does not have a parallel object — ScreeningInfo is a record of the decision, not a holder of capacity. There's no "graduation event" to anchor the removal on. More importantly, the user is still on the page after a screening save. The auto-routing to the next study is initiated by the client after it gets the response. If the screening save commits but the response is lost (network blip, browser tab buried), the user sits looking at the same study with the form visible. If we'd already removed their reservation as part of AddScreening, the form is now in a "no slot" state and the user has no idea why.

The current design — release the slot when the user actually navigates away — correctly defers the irreversible side-effect to the moment the user demonstrates they're done with the page. That's the right ordering. The fragility isn't in the design; it's in LeaveStudyReview being one network hop away from the route change instead of in-process with the navigation.

The two failing tests

Test 1: disconnection-and-save-guard.spec.ts:35

save guard rejection › surfaces 409 rejection when reviewer's slot was taken during disconnection

What it tests

disconnection-and-save-guard.spec.ts:35-144. Walks the slot-stolen-during-disconnect scenario:

  1. Project with sessionCountTarget=1, one annotation question.
  2. Admin opens the only study → gets a SlotReservation (capacity 1/1).
  3. Admin fills the annotation form (form is dirty).
  4. Test directly mutates Mongo to simulate the slot being released and taken by another reviewer:
  5. $pull admin's SlotReservations entry (simulates suspended-grace consumer firing while admin was offline).
  6. $push an ExtractionInfo.Sessions entry for reviewer, filling capacity via committed session.
  7. Admin clicks [data-testid="annotation-save-btn"].
  8. Expected: PUT /session/{id} fires; server runs SaveWithCapacityGuardAsync inside its Mongo transaction; capacity guard rejects (no slot for admin + capacity full); returns HTTP 409. Client's review-effects.ts:374-396 opens AccessDeniedDialog, surfaces error notification with text "couldn't save your review" / "review spot".

Observed failure

TimeoutError: page.waitForResponse: Timeout 30000ms exceeded while waiting for event "response"
  at e2e/tests/disconnection-and-save-guard.spec.ts:129:39

Error: locator.click: Target page, context or browser has been closed
  - element is visible, enabled and stable
  - scrolling into view if needed
  [page closes]

The save button does become visible/enabled, click is attempted, but no PUT to /session/ is observed and the page closes mid-test. The page closing is most likely the test's 30s waitForResponse timeout causing Playwright to tear down the context, not an in-app redirect.

Hypotheses (in order of plausibility, after corrected reading of the code)

  1. The form-dirty auto-rejoin opens AccessDeniedDialog before the click reaches the save button. stage-review.component.ts:524-531 — when the form transitions dirty AND sessionRemoved is true, it calls joinStudy(...) which calls JoinStudyReview over SignalR. The hub returns a denial result (no slot, capacity full). study-presence.store.ts:397-430 handleAccessResult opens AccessDeniedDialog. The dialog overlay is now in front of the save button. Subsequent click events go to the dialog, not the underlying form. However: the form-dirty effect uses distinctUntilChanged() — it only fires on transitions. The test fills the form before mutating the DB, so the dirty transition has already fired (when slot was still held, no rejoin needed). Unless the form goes clean and dirty again somewhere, this path shouldn't re-fire.
  2. onSave form validation silently bails into a confirmation dialog. annotation-form.component.ts:722-755 — if getErrorMap finds errors, opens a "Form Cannot Be Saved with Invalid Data" dialog and never dispatches the save. The test fills a single textbox; if there are hidden required fields the test isn't filling, validation could silently block. Worth checking.
  3. The save URL pattern doesn't match the test's waitForResponse filter. Filter is resp.url().includes('/session/') && resp.request().method() === 'PUT'. Server route is PUT /api/projects/{p}/stages/{s}/studies/{st}/session/{sid}. Pattern should match. Unless the client batched it through a different endpoint.
  4. A race between SignalR snapshot arrival and the click. Snapshot says "no slot for you", sessionRemoved flips, but as established this only triggers a banner + the form-dirty auto-rejoin (which already fired). Unlikely to explain page closure on its own.

The cheapest hypothesis to disprove is #2 — add console.log('onSave called', { errors }) and rerun.

Test 2: session-lifecycle.spec.ts:252

session lifecycle — active reviewer tracking › session cleanup on navigation away › session is removed when reviewer screens and moves to next study

What it tests

session-lifecycle.spec.ts:252-310:

  1. Project with reviewer as member, 5 studies, default screening mode.
  2. Reviewer opens screening page → web auto-routes to study A; JoinStudyReview(A) creates SlotReservation on A.
  3. Confirms pmStudy[A].SlotReservations.length >= 1.
  4. Reviewer screens (includes) study A via ScreeningPage.includeStudy() — server records the screening decision; web auto-routes to study B.
  5. Test waits for the URL change to a different study.
  6. Failure point: poll pmStudy[A].SlotReservations filtered to the reviewer for 30s, expect 0, observed 1.

Why this is happening (corrected from earlier draft)

AddScreening doesn't graduate (correctly, per the design rationale above). Cleanup is supposed to come from LeaveStudyReview fired by the route-change pairwise. The simpler test at session-lifecycle.spec.ts:197 (just navigate away, no screening) does pass — so the leave path can work. The thing different about test 252 is that a screening save runs concurrently with the leave.

Hypotheses (cheapest first):

  1. Pairwise doesn't fire because the screening flow uses a different navigation primitive. If ScreeningPage.includeStudy()'s post-save router-navigate doesn't change the studyId observable (e.g. uses a child-route navigation that the parent's studyId$ observable doesn't see), then the pairwise tuple doesn't change, leaveStudy is never called, the reservation stays.
  2. Race between screening-save commit and LeaveStudyReview. Both end up writing to pmStudy[A]. If LeaveStudyReview's read happens before the screening-save's commit, it removes the reservation but its write loses to the screening write that includes the (still present) reservation. The screening write happens via IPmUnitOfWork.SaveAsync(study) which uses optimistic concurrency on Audit.Version, so one of them retries — but the retry would re-load the study which now has the reservation back from the loser-rolled-back state. End state could be reservation present.
  3. The connection lookup in LeaveStudyReview returns null. Hub does await _reviewSessionConnectionRepository.RemoveByConnectionIdAsync(connectionId) — if some upstream flow removed that connection record (e.g. the page refresh that happens after screening, or the SignalR connection reconnecting and getting a new connectionId), the lookup returns null and the early-return at NotificationHub.cs:406-411 skips the cleanup entirely.

Possible fixes

Approach Effort Pros Cons
Make LeaveStudyReview more reliable (better connection-id lookup, retry on race, telemetry to surface the misses) Medium Keeps the "release on navigate" semantics that the user argued for. Doesn't address the case of OnDisconnectedAsync racing the route-change leave; still a distributed write story.
Have LeaveStudyReview and the screening save coordinate (e.g. screening write also clears the reservation if it was for the same reviewer/stage) Higher Removes the race; aligns with the screening user-visible state. Mixes lifecycle concerns into the screening write; may surprise other call paths that don't involve the reviewer's own slot.
Drive cleanup from the AnnotationSession lifecycle on the server, not from the client's route change Highest One write to a single aggregate; client just notifies. Bigger refactor.

The user's preferred direction (see "Design vision" below) is to make the client UX more honest about state — banner + disabled button + tooltip + warning dialog — and improve LeaveStudyReview reliability rather than restructure the domain.

Design vision (from user, as of 2026-04-28)

When the slot is lost (server says "no slot for you")

  • Modal explainer dialog opens on first detection — once-per-study, gracefully dismissable.
  • Persistent warning banner stays at the top of the page after the modal is dismissed, reminding the user the slot is no longer theirs.
  • Save / Submit buttons disabled while no slot exists — disabled should be linked to the live presence/reservation status from SignalR snapshots, not just the form's pristine / editMode / sessionSaving.
  • Tooltip on disabled button hover: explains why the button is disabled.
  • Click handler on disabled button still fires a warning dialog (helpful for keyboard users / cases where the disabled state was missed).

When SignalR connectivity is lost

This is indicative of a bigger problem — server unreachable, network outage, etc. Needs explicit handling rather than silently leaving the user in an "everything looks fine" state.

  • Save buttons should be disabled (the action is dependent on a healthy live link).
  • Banner: "Lost connection — reconnecting…"
  • Persistent until reconnected.
  • If reconnect fails after N tries, escalate (modal explanation, recommended action).

Slots available but reservation missing

The user's specific question: if there are slots available but the reservation is missing from the database and this is communicated to the front end via SignalR, how should the Angular application respond?

Current behaviour (study-presence.store.ts:308-316): set sessionRemoved: true, banner appears via idle-notification-banner.component.ts:33-41, user can keep using the form. If they type, stage-review.component.ts:524-531 auto-rejoins. If they don't, they only find out it's gone when they save and get a 409.

Recommended behaviour (matches user's design vision):

  1. Show a non-dismissable warning banner: "Your slot was released. Auto-reconnecting…"
  2. Try to silently reacquire immediately — call JoinStudyReview regardless of form-dirty state. (Currently the auto-rejoin is gated on form-dirty transition, which means a user who's just sitting reading doesn't get reacquired.)
  3. If reacquire succeeds — clear the banner, log a Sentry breadcrumb tagged transient-slot-loss, the user notices nothing. (Useful telemetry to see how often this happens organically.)
  4. If reacquire fails (capacity now genuinely full) — open AccessDeniedDialog with explanation; user picks next study or views read-only.
  5. While reacquire is in flight — disable save with tooltip "Reconnecting your slot…".

This centralises slot-state into a single computed signal (saveDisabledReason: 'no-slot' | 'reconnecting' | 'signalr-down' | 'sessionSaving' | null) that drives the button, tooltip, banner, and click-warning consistently.

Investigation playbook for the next session

# 1. Get on the branch
cd /home/chris/workspace/syrf/pr/pr2467.signalr-active-reviewer-tracking
git fetch origin && git pull --ff-only

# 2. Run the two failing tests in isolation, locally
cd e2e
./scripts/setup.sh
# (start API/PM/Quartz manually per CLAUDE.md, or use the e2e CI workflow as reference)
pnpm exec playwright test \
  tests/disconnection-and-save-guard.spec.ts \
  tests/session-lifecycle.spec.ts:252 \
  --project=@full --headed --workers=1 --trace=on

Before running, instrument:

Capture the Playwright trace (--trace=on) so the network/console can be replayed afterwards.

Files modified during this session that the next session should be aware of

  • src/services/api/SyRF.API.Endpoint/appsettings.e2etest.json — added FeatureFlags.ActiveReviewerTrackingEnabled: true
  • src/services/project-management/SyRF.ProjectManagement.Endpoint/appsettings.e2etest.json — same
  • src/services/web/src/assets/data/appConfig.e2e.jsonactiveReviewerTrackingEnabled flipped from "false" to "true"
  • e2e/tests/{annotation-capacity,session-lifecycle,disconnection-and-save-guard}.spec.ts — bulk renamed ActiveReviewSessionsSlotReservations, ActiveReviewSessionSlotReservation to match the post-66f6237be domain model
  • Test scaffolding: added IStudyAssignmentPolicy and IReviewStatsQueryService mocks to StageReviewServiceTests, StageReviewServiceAtomicTests, NotificationHubStudyPresenceTests, SubmitSessionTransactionTests after the ADR-009 merge
  • New unit tests: MongoUpdateBuilderTests.cs, MongoRepositoryBasePartialUpdateTests.cs, ReviewerPresenceTests.cs — added to lift SonarCloud coverage above 80% in the right way (real tests, not exclusions)
  • Reverted dead sonar.coverage.exclusions entries from src/services/quartz/sonar-project.properties (file isn't loaded by dotnet-sonarscanner)

Loose ends and follow-ups (not blocking the PR)

  • The wider UX work (modal + banner + disabled button + tooltip + click-warning + SignalR-down handling) is its own piece of work. Probably wants its own ticket and doc — list of behaviours captured in "Design vision" above.
  • LeaveStudyReview reliability — the connection-id lookup race in hypothesis 3 above deserves its own investigation regardless of these tests. Suggest emitting telemetry every time RemoveByConnectionIdAsync returns null in a context where it shouldn't.
  • Screening's interaction with reservations — the architectural argument is documented above; if the team disagrees, the alternative is to graduate-on-screening with the trade-offs noted in "Test 2 → Possible fixes".

This doc lives at docs/planning/ per project conventions — feel free to delete it once the two tests are green and the design-vision items have their own home.