A default Kubernetes cluster is a soft target: anonymous API access in some distributions, containers running as root with the full Linux capability set, flat pod networking where any compromised workload can reach any other, and no verification of where your images came from. Production hardening is not a single setting — it is defence in depth across four layers: the cluster control plane, the workloads, the supply chain that produces those workloads, and the runtime. Compromise of one layer should not hand an attacker the rest.
Cluster hardening
Start at the control plane. Benchmark every cluster against the CIS Kubernetes Benchmark using kube-bench, and treat failures as findings to remediate, not noise to ignore.
# Run CIS benchmark checks against the cluster
kube-bench run --targets master,node,etcd,policies \
--benchmark cis-1.9 --json | jq '.Totals'
The non-negotiables for the API server:
- No anonymous auth — set
--anonymous-auth=false. Unauthenticated requests should be rejected outright. - RBAC, least privilege — no wildcard verbs/resources in production roles, no binding users to
cluster-admin, and scope ServiceAccount permissions to exactly what each workload needs. - Audit logging — enable an audit policy so every API call is recorded. Without it you cannot investigate a breach.
# Minimal API server audit policy — log metadata for all, bodies for secrets access
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
resources:
- group: ""
resources: ["secrets", "configmaps"]
- level: Metadata
omitStages: ["RequestReceived"]
Audit RBAC regularly — permissions accrete over time. kubectl auth can-i --list --as=system:serviceaccount:default:my-sa quickly shows what a given identity can actually do.
Workload hardening
A container that runs as root with a writable root filesystem and all capabilities is one escape away from owning the node. Enforce a locked-down securityContext on every pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
automountServiceAccountToken: false # off unless the pod calls the API
securityContext:
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: registry.example.com/api@sha256:... # pin by digest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
privileged: false
capabilities:
drop: ["ALL"]
resources:
limits: { cpu: "1", memory: "512Mi" }
requests: { cpu: "250m", memory: "256Mi" }
Enforce this cluster-wide rather than trusting every author to remember it. The built-in Pod Security Standards admission controller applies the restricted profile by namespace label:
apiVersion: v1
kind: Namespace
metadata:
name: payments
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: restricted
The restricted profile rejects privileged pods, host namespace sharing, and root execution — closing the most common workload misconfigurations by default.
Network policies with default-deny
By default, every pod can talk to every other pod. That flat network means one compromised front-end can reach your database directly. Adopt a default-deny posture per namespace, then allow only required flows.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: payments
spec:
podSelector: {} # every pod in the namespace
policyTypes: [Ingress, Egress]
# no ingress/egress rules = deny all
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow
namespace: payments
spec:
podSelector:
matchLabels: { app: api }
policyTypes: [Ingress, Egress]
ingress:
- from:
- podSelector: { matchLabels: { app: gateway } }
ports:
- { protocol: TCP, port: 8080 }
egress:
- to:
- podSelector: { matchLabels: { app: postgres } }
ports:
- { protocol: TCP, port: 5432 }
- to: # allow DNS
- namespaceSelector: {}
podSelector: { matchLabels: { k8s-app: kube-dns } }
ports:
- { protocol: UDP, port: 53 }
Default-deny that forgets DNS egress will break every pod in the namespace. Always explicitly allow port 53 to kube-dns when you lock down egress.
Secrets
Kubernetes Secrets are base64-encoded, not encrypted — anyone with etcd or namespace read access can decode them. Two things are mandatory:
- Encryption at rest with KMS envelope encryption. Configure an
EncryptionConfigurationwith a KMS provider (AWS KMS, Azure Key Vault, GCP KMS) so the data encryption key is itself wrapped by a key you control and rotate. - Never store secrets in Git, plaintext manifests, or ConfigMaps. Pull them from an external store at runtime using the External Secrets Operator, which syncs from AWS Secrets Manager, Vault, or Azure Key Vault into Kubernetes Secrets without the secret material living in your repo.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: payments
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: db-credentials
data:
- secretKey: password
remoteRef:
key: prod/payments/db
property: password
Supply chain
You must trust what you run. Three controls form the chain:
- Scan images for known CVEs with Trivy in CI, and fail the build on high/critical findings.
- Sign images with cosign so their provenance is verifiable.
- Admission control to block anything unsigned, unscanned, or risky from ever being admitted.
# Scan and fail the pipeline on serious vulnerabilities
trivy image --severity HIGH,CRITICAL --exit-code 1 \
registry.example.com/api:$GIT_SHA
# Sign the image after a clean scan (keyless, OIDC-backed)
cosign sign --yes registry.example.com/api@sha256:$DIGEST
Then enforce verification at admission. A Kyverno policy that rejects images lacking a valid signature:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
rules:
- name: require-signed-images
match:
any:
- resources:
kinds: [Pod]
verifyImages:
- imageReferences: ["registry.example.com/*"]
attestors:
- entries:
- keyless:
issuer: "https://token.actions.githubusercontent.com"
subject: "https://github.com/example/api/.github/workflows/release.yml@refs/heads/main"
Either Kyverno or OPA Gatekeeper works here; Kyverno's native image-verification rules are more ergonomic for cosign, while Gatekeeper's Rego is more flexible for arbitrary policy. Pick one and standardise.
| Control | Tool | Stops |
|---|---|---|
| Vulnerability scanning | Trivy | Running known-vulnerable images |
| Image signing/verify | cosign + Kyverno | Tampered or untrusted images |
| Policy admission | Kyverno / Gatekeeper | Misconfigured or non-compliant resources |
| Runtime detection | Falco | Live exploitation after admission |
Runtime detection and node hardening
Admission control stops bad things being admitted; it cannot see what a workload does once running. Falco watches kernel syscalls and alerts on suspicious behaviour — a shell spawned in a container, an unexpected outbound connection, a write to a sensitive path. Wire its alerts into the same on-call pipeline as everything else.
Finally, harden the nodes themselves: use a minimal, hardened OS image, keep the kernel and kubelet patched, disable the read-only kubelet port, restrict SSH, and run nodes in private subnets behind a bastion or session-manager equivalent. A workload's blast radius is bounded by how locked down the host underneath it is.
Treat nodes as cattle, not pets. Patch by replacing the node image and rolling the fleet, not by SSHing in to run
apt upgrade. An immutable, frequently-recycled node is both easier to patch and harder to persist on.
Sequencing the rollout
Do not try to land all of this at once on a live cluster — you will break workloads and erode trust in the programme. Sequence it. Begin in audit or warn mode for both Pod Security Standards and your admission policies (validationFailureAction: Audit in Kyverno) so you can see what would be blocked without blocking it. Triage the violations, fix the offending workloads, and only then flip to Enforce. Apply the same discipline to network policies: deploy default-deny in a non-production namespace first, confirm DNS and inter-service traffic still flow, and roll forward namespace by namespace. The goal is a cluster that is locked down by default and stays that way as new workloads arrive — enforcement is what makes hardening durable rather than a snapshot that decays the moment the next team deploys.
Defence in depth is a programme, not a one-off audit. If you want i2zone to benchmark, harden, and continuously enforce security across your Kubernetes estate, speak to our team.