Terraform CI/CD¶
How terraform plan / terraform apply runs against terraform/** modules in this repo, and how to land changes safely.
Why this exists¶
Until April 2026, terraform applies in this repo were ad-hoc — committed to main, then someone ran terraform apply from a laptop with admin AWS creds. That worked when the repo had one operator, but it produced silent drift:
feat/terraform-cicd-workflowwas triggered after a real incident (2026-04-28) where a SyRF systematic-search upload to staging hung indefinitely. Root cause: the staging Lambda's IAM rolesyrfS3NotifierStagingLambdaRolehad no policy attachments in AWS, despiteterraform/lambda/main.tf:170-191declaring two of them. The terraform code was correct; it had just never been fully applied.- The
github-actions-deployerIAM user did not have permissions to managesyrfS3NotifierStagingLambdaRoleeither (only the production and preview roles were in its policy resource list at the time), so even a CI-driven apply would have failed.
This workflow closes the gap. Going forward, every change to terraform/** is planned in CI on the PR (so you can review the actual diff against AWS), and applied only after an operator manually triggers terraform apply from the Actions tab.
How it works¶
flowchart LR
PR[PR opened/updated] -->|paths: terraform/**| Plan[terraform plan]
Plan -->|posts diff as PR comment| Review[Human review]
Review -->|merge to main| MainPlan[plan re-runs on main<br/>surfaces post-merge drift]
MainPlan -.->|operator reviews logs| Dispatch[Manual workflow_dispatch<br/>apply=true from Actions tab]
Dispatch --> Replan[terraform plan again]
Replan --> Apply[terraform apply]
On pull request¶
- Trigger: any PR that touches
terraform/**or.github/workflows/terraform.yml. - Static checks (always run, no AWS needed):
terraform fmt -check,terraform init -backend=false,terraform validate. - Plan (gated on
secrets.AWS_ACCESS_KEY_IDbeing set): realterraform initwith backend, thenterraform plan -detailed-exitcode -out=tfplan. Plan output is posted as a comment on the PR (and re-edited on each push, so reviewers see the current plan, not stale output). Thetfplanbinary is uploaded as an artifact for 14 days. - If AWS creds aren't configured on the repo yet, the workflow still runs static checks and posts a
::warninginstead of failing, so PR signal is useful from the moment the workflow lands. - This is read-only: the workflow has no AWS write permissions used at this stage.
On push to main¶
- Trigger: a merge to
mainthat touchesterraform/**. - The plan job re-runs on
mainso any post-merge drift surfaces in the workflow logs (e.g. resources someone changed in the AWS console between PR review and merge). This is read-only — apply does NOT run on push. - To apply, an operator manually triggers
workflow_dispatchwithapply=true(see below). That manual click is the human gate.
Apply: manual workflow_dispatch¶
Apply runs only when manually triggered from the Actions tab:
- Go to Actions → terraform → Run workflow.
- Branch:
main(the workflow rejects dispatches from other branches). - Inputs:
module=lambda(or other),apply=true. - Click Run workflow.
The dispatched run will:
- Re-run
terraform plan(drift check vs. the latest state). - If the re-plan errors, abort before any mutation.
- Otherwise
terraform applyagainst the saved plan.
For plan-only ad-hoc runs (e.g. to check for drift without changing anything), trigger the same dispatch with apply=false.
Why a manual dispatch instead of an environment with required reviewers¶
The original design for this workflow used a protected GitHub Environment (infra-apply) with required reviewers — the apply job would pause at the environment gate and wait for a designated reviewer to click Approve and deploy. That mechanism is unavailable on this repo: required-reviewer protection on private repositories needs GitHub Enterprise, while camaradesuk is on the Team plan. Team-on-private supports environments and deployment-branch policies but not required reviewers, wait timers, or custom protection rules.
workflow_dispatch gives equivalent guarantees without requiring a plan upgrade:
- Apply cannot start without explicit human action (clicking Run workflow).
- The action is auditable in the Actions tab (who triggered, when, with what inputs).
- The triggering branch is constrained to
mainby bothif: github.ref == 'refs/heads/main'in the apply job and theinfra-applyenvironment's deployment-branch policy. - The
infra-applyenvironment is retained for deployment tracking (visible in the repo's Deployments view) and so the gate becomes a required-reviewer rule automatically if the repo is ever made internal or the org upgrades to Enterprise.
One-time setup¶
These resources must be configured before the workflow can run end-to-end. None are created by this PR.
1. GitHub Environment: infra-apply¶
In Settings → Environments → New environment, create infra-apply and configure:
- Deployment branches: restrict to
mainso a feature branch can't trigger an apply. - Required reviewers (unavailable on Team-private — see "Why a manual dispatch…" above): configure these only if the repo is internal or the org is on Enterprise. They are belt-and-braces; the workflow doesn't depend on them, because the workflow_dispatch click already provides the human gate.
- Wait timer (also Team-Enterprise gated): optional even when available; the manual dispatch already gives reviewers as much time as they want before clicking Run workflow.
2. AWS credentials secret¶
The workflow currently uses secrets.AWS_ACCESS_KEY_ID / secrets.AWS_SECRET_ACCESS_KEY for the existing github-actions-deployer IAM user. This is the same pair already used by the syrf monorepo's ci-cd.yml for Lambda packaging.
Long-term we should migrate to OIDC + assume-role (no static keys). The workflow already requests id-token: write so this is one secret swap away. See terraform-guide.md for context on the existing AWS auth model.
3. Backend state bucket¶
Already in place: s3://camarades-terraform-state-aws/, with DynamoDB locking via the terraform-locks table (eu-west-1). See terraform/lambda/backend.tf.
Safety properties¶
| Property | Mechanism |
|---|---|
| Plan can never mutate AWS | Plan job uses terraform plan only. No apply, no import, no destroy commands. |
| Static checks always run | terraform fmt -check + init -backend=false + validate run with no AWS access — catches syntax/format/schema errors pre-merge even before secrets are configured. |
| Apply requires explicit human action | Apply runs only via manual workflow_dispatch with apply=true from the Actions tab. Push-to-main does NOT auto-apply. |
| Plan output is reviewable on the PR | PR-comment integration shows the actual diff before merge. |
Post-merge drift is surfaced on main |
Plan job re-runs on push-to-main, logging any drift in the workflow run before an operator decides whether to dispatch an apply. |
| Drift between merge and apply is caught | Apply job re-runs terraform plan immediately before apply and aborts on hard errors. |
| Lock prevents concurrent applies | concurrency: group + DynamoDB state lock. |
Apply is scoped to main |
if: github.ref == 'refs/heads/main' in apply job + infra-apply environment's deployment-branch policy. |
Known gaps & follow-ups¶
These are explicitly not addressed by this workflow PR. They are tracked for follow-up:
-
Deployer IAM does not include the staging Lambda role.
terraform/lambda/github-actions-iam.tf:124-126listssyrfS3NotifierProductionLambdaRoleandsyrfS3NotifierPreviewLambdaRoleonly. The deployer cannot create policy attachments onsyrfS3NotifierStagingLambdaRoleuntil this is widened. First widening apply must therefore be done with elevated creds (locally, by an infra owner) — or via a separate workflow run that uses an assume-role with broader IAM scope. -
First reconciliation apply will create the missing staging policies. The very first apply run after this workflow lands will plan to create:
aws_iam_role_policy_attachment.staging_lambda_basic(AWSLambdaBasicExecutionRole)aws_iam_role_policy.staging_lambda_s3(S3:GetObject onsyrfapp-uploads-staging/*)
Both are additive, non-destructive. After apply, the existing staging Lambda will be able to write CloudWatch logs and read the staging upload bucket — unblocking the systematic-search upload pipeline.
-
OIDC migration. Static access keys are a foot-gun. Replace with
aws-actions/configure-aws-credentials@v4role-to-assumeand an OIDC provider on the AWS account. Tracked separately. -
Multiple terraform modules. Today only
terraform/lambdais wired up. Add to the matrix as new modules are introduced. -
Periodic drift detection. Consider a scheduled
workflow_dispatchtrigger that runs plan-only daily and alerts if drift is found, instead of relying on PR-driven discovery.
Related¶
- terraform-guide.md — general Terraform usage in this repo.
- cluster-setup-guide.md — bootstrap context.