Insights

Terraform state recovery playbook for SaaS teams

When Terraform state disagrees with cloud reality, every apply is a coin flip. This is the sequence we run with clients — concrete commands, the order they run in, and the escalation points where you stop self-help and call for an SRE.

IaC recovery | 12 min read
Problem signals
  • terraform plan reports 20+ surprise changes you never staged, including destroys on stateful resources.
  • State lock held by a CI runner that finished hours ago; force-unlock feels risky.
  • Resources exist in AWS / GCP / Azure but terraform plan says they need to be created.
  • A previous emergency kubectl edit or console click was never reconciled back to HCL.
  • Engineers route around terraform apply by going directly to the cloud console.
Why this happens

State failures are ownership failures wearing a tool costume

Most Terraform state incidents do not start with Terraform. They start with an emergency change made through the cloud console at 2am, a Helm-managed CRD upgrade that ArgoCD reconciled but Terraform never noticed, or a refactor that renamed a module without tracking the moved blocks. In every case the proximate trigger is human pressure — but the structural cause is that nobody owns the boundary between "what is in code" and "what is in cloud".

Once state and reality diverge, teams overcompensate by freezing applies. That feels safe in the moment, but compounds risk: drift accumulates, security patches stop landing, and every future apply gets scarier. The job of recovery isn't to make the next apply perfect — it's to restore trust that terraform plan describes reality so the team can ship again.

  • Shared state files across unrelated services create blast radius.
  • Module renames without moved blocks create phantom destroy plans.
  • Manual changes through the cloud console are treated as exceptions instead of as drift signals.
  • Lock acquisition failures get silently force-unlocked without investigating the holder.
Recovery sequence

Run these seven steps in order

Skipping a step usually recreates the same incident class within 4-8 weeks.

flowchart TD
  A[Freeze risky applies] --> B[Snapshot state to local backup]
  B --> C[Inventory unmanaged resources<br/>terraform plan -refresh-only]
  C --> D{State == cloud truth?}
  D -- No --> E[Reconcile<br/>terraform import / state mv / moved]
  D -- Yes --> F[Split high-coupling state]
  E --> F
  F --> G[Re-enable applies behind guardrails]
  G --> H[Install drift monitoring]

The decision point at step 4 is where most recoveries either succeed or fail. If state and cloud agree, you have a coupling problem. If they don't, you have a reconciliation problem. Mixing the two is the failure mode that turns a one-day fix into a one-month fix.

1. Freeze risky applies

Block CI applies for the affected workspace. Allow critical security patches only with explicit named-owner approval.

2. Snapshot state

terraform state pull > backup-$(date -u +%FT%H%M%SZ).tfstate, plus terraform plan -out=baseline.tfplan.

3. Inventory drift

terraform plan -refresh-only flags any resource the cloud has changed since the last apply, without touching state.

4. Reconcile

Use terraform import for in-cloud-only resources, state mv for renames, moved {} blocks for module refactors. Never blanket-destroy.

5. Split coupled state

If one state file spans network + compute + data, break it into focused workspaces. Use remote_state data sources for cross-references.

6. Re-enable applies

Reintroduce automation behind narrow scopes, two-reviewer gates on destroy plans, and a rollback runbook the on-call can read.

7. Install drift monitoring

Scheduled terraform plan in CI that opens a PR when drift is detected. Drift is a signal, not a noise source.

Artifact

The reconciliation commands you actually run at step 4

These are the four commands that handle 90% of step-4 work. Run them with -target= to constrain blast radius.

# 1. Inventory drift without modifying state.
terraform plan -refresh-only -out=drift.tfplan

# 2. Pull a resource that exists in cloud but not in state.
#    Find the cloud-native id (e.g. RDS DB identifier, EC2 instance id).
terraform import 'aws_db_instance.primary' db-prod-primary-1

# 3. Rename a resource address (module refactor, no destroy/create).
terraform state mv 'aws_db_instance.primary' 'module.data.aws_db_instance.primary'

# 4. Mark a code-level rename so apply does not destroy/recreate.
#    Put this in the .tf file, not at the CLI:
moved {
  from = aws_db_instance.primary
  to   = module.data.aws_db_instance.primary
}

Lock handling: if you must force-unlock, first identify the holder via the state backend (S3 + DynamoDB shows it as LockID with the runner's identity). Confirm the runner is actually dead — kubectl get pods, GitHub Actions run page, etc. — before you terraform force-unlock <lock-id>. A lock held by a still-running apply and force-unlocked is the most common path to a corrupted state file.

Artifact

The state triage matrix (use during step 3 → step 4 transition)

Resource status                Action
─────────────────────────────  ────────────────────────────────────
In cloud + in code             Validate field-level diffs, keep managed
In cloud only                  terraform import (or explicitly retire)
In state only (ghost)          Verify deletion, then terraform state rm
In wrong module address        terraform state mv + add moved {} block
In two states (split-brain)    Pick authoritative state, remove from other
Lock held by dead runner       Verify dead → force-unlock → re-run apply
Lock held by live runner       Wait. Do NOT force-unlock.

The matrix keeps recovery deterministic. Without it, teams attempt large applies while still unsure what is authoritative, which is where most secondary incidents begin.

Common mistakes

The patterns that turn a one-day fix into a one-month fix

  • Force-unlocking without confirming the holder is dead. Inspect the lock metadata in DynamoDB (or the equivalent backend) before unlocking. Cross-check with CI run state.
  • Running terraform apply before terraform plan agrees with reality. If plan still shows surprise changes, you're not done with step 4.
  • Skipping the moved block on a module refactor. Terraform sees this as destroy + create — for stateful resources (RDS, EBS, persistent volumes) this is a data loss event.
  • Reconciling everything in one apply window. Stage the recovery: one workspace per session, each with a clear rollback plan. Sequence over speed.
  • Treating manual cloud-console fixes as acceptable long-term state. Every console-only change is a drift event that compounds. Reconcile within 48h or tag the resource as "unmanaged" explicitly.
When to escalate

Three signals that say stop self-help and call for help

  • State backend corruption — terraform state pull fails or returns partial JSON. Don't apply over it. Pull a known-good snapshot from versioned S3 and verify before doing anything else.
  • Plan diff > 100 resources after a refresh-only — at this point you're reconciling a fork in reality. Per-resource imports get error-prone fast; get a second pair of eyes and a written sequence before touching apply.
  • Stateful resources on the destroy list of a plan you do not understand — RDS, persistent volumes, KMS keys, IAM roles in use by lambdas. Stop. Do not apply until every line of the destroy plan is explained.

We do focused recovery engagements for exactly this profile — fragile state, fear of apply, blocked delivery. Request an infrastructure review and send the most recent terraform plan output (redact secrets); we'll come back within 24h with a named-failure-path read.