Most cloud estates do not start as a deliberate architecture. They accrete. Someone clicks through the console to unblock a launch, a security group is widened "just for the demo", and three years later nobody can say with confidence what is actually running or why. This article is about replacing that ClickOps reality with a reviewed, version-controlled, automated workflow built on Terraform, Terragrunt and Atlantis — and about the failure modes you will hit along the way.

The real cost of ClickOps

Manual change feels fast because the cost is deferred, not avoided. You pay it later as:

  • Drift. The console state and your source of truth diverge silently. The first time you discover the gap is usually during an incident.
  • No audit trail. "Who opened 0.0.0.0/0 on the RDS security group, and when?" CloudTrail can sometimes answer this, but it is forensic archaeology, not a review gate.
  • Snowflakes. Each environment becomes subtly unique, so staging stops predicting production and your testing guarantees evaporate.

The goal of IaC is not "we use Terraform". It is that every change to infrastructure is a reviewable diff, applied by automation, with a durable record of who approved it. If a human can still mutate production by hand, you have not finished.

Terraform foundations done right

Before reaching for orchestration tooling, get the primitives right. The two that teams most often botch are remote state and state isolation.

State must live remotely with locking. On AWS that means S3 for the object and DynamoDB for the lock, with versioning and encryption enabled so a corrupted apply is recoverable.

terraform {
  backend "s3" {
    bucket         = "acme-tfstate-prod"
    key            = "network/terraform.tfstate"
    region         = "eu-west-2"
    dynamodb_table = "acme-tf-locks"
    encrypt        = true
  }
}

Never share one state file across environments. A single terraform.tfstate for dev and prod means a fat-fingered destroy in dev can take production with it, and one stuck lock blocks every team at once. Isolate state per environment and per logical component (network, data, compute).

Module design is the other lever. A good module exposes intent, not implementation, and validates its inputs.

# modules/vpc/variables.tf
variable "cidr_block" {
  type = string
  validation {
    condition     = can(cidrhost(var.cidr_block, 0))
    error_message = "cidr_block must be a valid CIDR."
  }
}

variable "azs" {
  type    = list(string)
  default = []
}

# modules/vpc/main.tf
resource "aws_vpc" "this" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  tags                 = { Name = var.name }
}

resource "aws_subnet" "private" {
  for_each          = { for i, az in var.azs : az => i }
  vpc_id            = aws_vpc.this.id
  availability_zone = each.key
  cidr_block        = cidrsubnet(var.cidr_block, 4, each.value)
}

Pin module and provider versions explicitly. Floating versions turn a routine plan into an unscheduled upgrade.

The DRY problem at scale, and Terragrunt

Pure Terraform forces a choice between two bad options once you have many environments: copy the same backend and provider boilerplate into every directory, or build sprawling workspaces with conditionals. Both rot. Terragrunt fixes this by treating each leaf directory as a thin call to a versioned module, generating the repetitive bits, and wiring dependencies between components.

A typical hierarchy:

live/
  terragrunt.hcl              # root: remote state + provider generation
  prod/
    eu-west-2/
      vpc/terragrunt.hcl
      eks/terragrunt.hcl      # depends_on vpc
  staging/
    eu-west-2/
      vpc/terragrunt.hcl

The root file generates backend and provider config so no child ever repeats it. Note how the state key is derived from the path, guaranteeing isolation:

# live/terragrunt.hcl
remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite_terragrunt" }
  config = {
    bucket         = "acme-tfstate-${get_aws_account_id()}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "eu-west-2"
    encrypt        = true
    dynamodb_table = "acme-tf-locks"
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "eu-west-2"
  default_tags { tags = { ManagedBy = "terragrunt" } }
}
EOF
}

A child component consumes a module and declares dependencies. The dependency block lets EKS read the VPC's outputs without hardcoding IDs, and mock_outputs keeps plan working before the VPC exists:

# live/prod/eu-west-2/eks/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::git@github.com:acme/tf-modules.git//eks?ref=v2.4.0"
}

dependency "vpc" {
  config_path = "../vpc"
  mock_outputs = {
    vpc_id             = "vpc-mock"
    private_subnet_ids = ["subnet-mock"]
  }
}

inputs = {
  cluster_name = "prod-euw2"
  vpc_id       = dependency.vpc.outputs.vpc_id
  subnet_ids   = dependency.vpc.outputs.private_subnet_ids
}

terragrunt run-all plan then walks the dependency graph in order. Be deliberate with run-all apply: across a large graph it is powerful but blunt. Prefer scoping applies to the directory that changed.

A PR-driven workflow with Atlantis

Terragrunt gives you DRY; Atlantis gives you the review gate. Atlantis runs plan automatically when a pull request touches infrastructure, posts the plan as a comment, holds an apply lock on the affected directory, and applies only on an explicit, authorised command after merge approval.

# atlantis.yaml
version: 3
automerge: false
parallel_plan: true
parallel_apply: false
projects:
  - name: prod-vpc
    dir: live/prod/eu-west-2/vpc
    workflow: terragrunt
    autoplan:
      when_modified: ["*.hcl", "../../../**/*.hcl"]
    apply_requirements: [approved, mergeable]
workflows:
  terragrunt:
    plan:
      steps:
        - run: terragrunt plan -out=$PLANFILE
    apply:
      steps:
        - run: terragrunt apply $PLANFILE

The apply_requirements: [approved, mergeable] line is the heart of the control: no apply happens without a human approval and a mergeable PR. Locking per directory prevents two PRs from racing on the same state.

Concern ClickOps Plain Terraform + CI Terragrunt + Atlantis
Audit trail CloudTrail only Git history Git + plan/apply comments
Drift visibility None On next plan Scheduled drift checks
DRY across envs N/A Poor Strong
Apply gate None Pipeline-defined Approval + lock per dir
Blast radius Whole account Per state Per directory

Run Atlantis with a narrowly scoped IAM role and require approval from someone other than the PR author. An apply bot that can do anything is a single compromised webhook away from being your worst day.

Policy as code and drift detection

Approval catches what reviewers notice; policy catches what they miss. Gate plans with OPA via conftest (or Sentinel on Terraform Cloud) by evaluating the plan JSON before apply.

# policy/security.rego
package main

deny[msg] {
  r := input.resource_changes[_]
  r.type == "aws_security_group_rule"
  r.change.after.cidr_blocks[_] == "0.0.0.0/0"
  r.change.after.to_port == 22
  msg := "SSH from 0.0.0.0/0 is forbidden"
}
terragrunt plan -out=tf.plan
terraform show -json tf.plan > plan.json
conftest test plan.json --policy policy/

Schedule terragrunt run-all plan --terragrunt-non-interactive nightly and alert on any non-empty diff. Drift detection is the only thing that tells you whether someone has quietly bypassed the workflow.

Secrets handling

Plaintext secrets in .tfvars or, worse, in state, are the most common and most damaging mistake. State is sensitive by definition — anything Terraform reads ends up there.

  • SSM Parameter Store / Secrets Manager as the source of truth, read at apply time via data sources.
  • SOPS for encrypting values committed to Git, decrypted by KMS during the run.
  • Vault where you need dynamic, short-lived credentials.
data "aws_ssm_parameter" "db_password" {
  name            = "/prod/rds/password"
  with_decryption = true
}

Even with these patterns, secrets resolved during apply persist in state. Lock down the state bucket with bucket policies and KMS, restrict who can run terraform show, and treat the state backend as a tier-one secret store.

Sequencing the migration

Do not big-bang this. Stand up the state backend and one module first. Import an existing low-risk component with terraform import (or import blocks) and reconcile the plan to zero diff before touching anything live. Layer Terragrunt once you have two or more environments to keep DRY. Add Atlantis only after a clean plan/apply cycle works locally, and freeze console write access last — once, and only once, the automated path is trusted.


Moving a live estate from ClickOps to GitOps without an outage is a sequencing problem as much as a tooling one. Talk to i2zone and we will design the state isolation, module boundaries and Atlantis controls that fit your account topology — and run the migration with zero-diff imports so nothing breaks on the way.