Terraform CI/CD¶

How terraform plan / terraform apply runs against terraform/** modules in this repo, and how to land changes safely.

Why this exists¶

Until April 2026, terraform applies in this repo were ad-hoc — committed to main, then someone ran terraform apply from a laptop with admin AWS creds. That worked when the repo had one operator, but it produced silent drift:

feat/terraform-cicd-workflow was triggered after a real incident (2026-04-28) where a SyRF systematic-search upload to staging hung indefinitely. Root cause: the staging Lambda's IAM role syrfS3NotifierStagingLambdaRole had no policy attachments in AWS, despite terraform/lambda/main.tf:170-191 declaring two of them. The terraform code was correct; it had just never been fully applied.
The github-actions-deployer IAM user did not have permissions to manage syrfS3NotifierStagingLambdaRole either (only the production and preview roles were in its policy resource list at the time), so even a CI-driven apply would have failed.

This workflow closes the gap. Going forward, every change to terraform/** is planned in CI on the PR (so you can review the actual diff against AWS), and applied only after an operator manually triggers terraform apply from the Actions tab.

How it works¶

flowchart LR
    PR[PR opened/updated] -->|paths: terraform/**| Plan[terraform plan]
    Plan -->|posts diff as PR comment| Review[Human review]
    Review -->|merge to main| MainPlan[plan re-runs on main<br/>surfaces post-merge drift]
    MainPlan -.->|operator reviews logs| Dispatch[Manual workflow_dispatch<br/>apply=true from Actions tab]
    Dispatch --> Replan[terraform plan again]
    Replan --> Apply[terraform apply]

On pull request¶

Trigger: any PR that touches terraform/** or .github/workflows/terraform.yml.
Static checks (always run, no AWS needed): terraform fmt -check, terraform init -backend=false, terraform validate.
Plan (gated on secrets.AWS_ACCESS_KEY_ID being set): real terraform init with backend, then terraform plan -detailed-exitcode -out=tfplan. Plan output is posted as a comment on the PR (and re-edited on each push, so reviewers see the current plan, not stale output). The tfplan binary is uploaded as an artifact for 14 days.
If AWS creds aren't configured on the repo yet, the workflow still runs static checks and posts a ::warning instead of failing, so PR signal is useful from the moment the workflow lands.
This is read-only: the workflow has no AWS write permissions used at this stage.

On push to `main`¶

Trigger: a merge to main that touches terraform/**.
The plan job re-runs on main so any post-merge drift surfaces in the workflow logs (e.g. resources someone changed in the AWS console between PR review and merge). This is read-only — apply does NOT run on push.
To apply, an operator manually triggers workflow_dispatch with apply=true (see below). That manual click is the human gate.

Apply: manual `workflow_dispatch`¶

Apply runs only when manually triggered from the Actions tab:

Go to Actions → terraform → Run workflow.
Branch: main (the workflow rejects dispatches from other branches).
Inputs: module=lambda (or other), apply=true.
Click Run workflow.

The dispatched run will:

Re-run terraform plan (drift check vs. the latest state).
If the re-plan errors, abort before any mutation.
Otherwise terraform apply against the saved plan.

For plan-only ad-hoc runs (e.g. to check for drift without changing anything), trigger the same dispatch with apply=false.

Why a manual dispatch instead of an environment with required reviewers¶

The original design for this workflow used a protected GitHub Environment (infra-apply) with required reviewers — the apply job would pause at the environment gate and wait for a designated reviewer to click Approve and deploy. That mechanism is unavailable on this repo: required-reviewer protection on private repositories needs GitHub Enterprise, while camaradesuk is on the Team plan. Team-on-private supports environments and deployment-branch policies but not required reviewers, wait timers, or custom protection rules.

workflow_dispatch gives equivalent guarantees without requiring a plan upgrade:

Apply cannot start without explicit human action (clicking Run workflow).
The action is auditable in the Actions tab (who triggered, when, with what inputs).
The triggering branch is constrained to main by both if: github.ref == 'refs/heads/main' in the apply job and the infra-apply environment's deployment-branch policy.
The infra-apply environment is retained for deployment tracking (visible in the repo's Deployments view) and so the gate becomes a required-reviewer rule automatically if the repo is ever made internal or the org upgrades to Enterprise.

One-time setup¶

These resources must be configured before the workflow can run end-to-end. None are created by this PR.

1. GitHub Environment: `infra-apply`¶

In Settings → Environments → New environment, create infra-apply and configure:

Deployment branches: restrict to main so a feature branch can't trigger an apply.
Required reviewers (unavailable on Team-private — see "Why a manual dispatch…" above): configure these only if the repo is internal or the org is on Enterprise. They are belt-and-braces; the workflow doesn't depend on them, because the workflow_dispatch click already provides the human gate.
Wait timer (also Team-Enterprise gated): optional even when available; the manual dispatch already gives reviewers as much time as they want before clicking Run workflow.

2. AWS credentials secret¶

The workflow currently uses secrets.AWS_ACCESS_KEY_ID / secrets.AWS_SECRET_ACCESS_KEY for the existing github-actions-deployer IAM user. This is the same pair already used by the syrf monorepo's ci-cd.yml for Lambda packaging.

Long-term we should migrate to OIDC + assume-role (no static keys). The workflow already requests id-token: write so this is one secret swap away. See terraform-guide.md for context on the existing AWS auth model.

3. Backend state bucket¶

Already in place: s3://camarades-terraform-state-aws/, with DynamoDB locking via the terraform-locks table (eu-west-1). See terraform/lambda/backend.tf.

Safety properties¶

Property	Mechanism
Plan can never mutate AWS	Plan job uses `terraform plan` only. No `apply`, no `import`, no destroy commands.
Static checks always run	`terraform fmt -check` + `init -backend=false` + `validate` run with no AWS access — catches syntax/format/schema errors pre-merge even before secrets are configured.
Apply requires explicit human action	Apply runs only via manual `workflow_dispatch` with `apply=true` from the Actions tab. Push-to-main does NOT auto-apply.
Plan output is reviewable on the PR	PR-comment integration shows the actual diff before merge.
Post-merge drift is surfaced on `main`	Plan job re-runs on push-to-main, logging any drift in the workflow run before an operator decides whether to dispatch an apply.
Drift between merge and apply is caught	Apply job re-runs `terraform plan` immediately before `apply` and aborts on hard errors.
Lock prevents concurrent applies	`concurrency:` group + DynamoDB state lock.
Apply is scoped to `main`	`if: github.ref == 'refs/heads/main'` in apply job + `infra-apply` environment's deployment-branch policy.

Known gaps & follow-ups¶

These are explicitly not addressed by this workflow PR. They are tracked for follow-up:

Deployer IAM does not include the staging Lambda role. terraform/lambda/github-actions-iam.tf:124-126 lists syrfS3NotifierProductionLambdaRole and syrfS3NotifierPreviewLambdaRole only. The deployer cannot create policy attachments on syrfS3NotifierStagingLambdaRole until this is widened. First widening apply must therefore be done with elevated creds (locally, by an infra owner) — or via a separate workflow run that uses an assume-role with broader IAM scope.
First reconciliation apply will create the missing staging policies. The very first apply run after this workflow lands will plan to create:
aws_iam_role_policy_attachment.staging_lambda_basic (AWSLambdaBasicExecutionRole)
aws_iam_role_policy.staging_lambda_s3 (S3:GetObject on syrfapp-uploads-staging/*)

Both are additive, non-destructive. After apply, the existing staging Lambda will be able to write CloudWatch logs and read the staging upload bucket — unblocking the systematic-search upload pipeline.

OIDC migration. Static access keys are a foot-gun. Replace with aws-actions/configure-aws-credentials@v4 role-to-assume and an OIDC provider on the AWS account. Tracked separately.
Multiple terraform modules. Today only terraform/lambda is wired up. Add to the matrix as new modules are introduced.
Periodic drift detection. Consider a scheduled workflow_dispatch trigger that runs plan-only daily and alerts if drift is found, instead of relying on PR-driven discovery.

terraform-guide.md — general Terraform usage in this repo.
cluster-setup-guide.md — bootstrap context.