After deploying Apache Airflow on Kubernetes across AWS EKS,
Google GKE, and Azure AKS in production environments processing billions of tasks annually, I’ve learned that the
KubernetesExecutor fundamentally changes how you architect data pipelines. This article shares battle-tested
patterns for running Airflow at scale on Kubernetes, from infrastructure setup to cost optimization.
1. Why Kubernetes for Airflow?
The KubernetesExecutor offers advantages that CeleryExecutor and LocalExecutor can’t match:
- Task-Level Isolation: Each task runs in its own pod with dedicated resources
- Auto-Scaling: Pods scale up/down automatically based on workload
- Resource Efficiency: No idle worker processes consuming resources
- Cost Optimization: Pay only for pods actively running tasks
- Dependency Isolation: Different tasks can have different Python dependencies
- Multi-Tenancy: Separate namespaces for different teams/environments
In my deployments, KubernetesExecutor reduced infrastructure costs by 40-60% compared to CeleryExecutor, while
improving task isolation and reliability.
Figure 1: Airflow on Kubernetes Production Architecture
2. Architecture: Production Kubernetes Deployment
2.1 Core Components
A production Airflow on Kubernetes deployment consists of:
- Scheduler Pod(s): Parse DAGs, create task pods
- Web Server Pod(s): Serve UI, handle API requests
- Worker Pods (ephemeral): Execute tasks, terminated after completion
- PostgreSQL Database: Metadata store (RDS/Cloud SQL recommended)
- Redis (optional): For Celery if using CeleryKubernetesExecutor
- DAG Storage: Git-sync or PersistentVolume
- Logs Storage: S3/GCS/Azure Blob for centralized logging
2.2 Helm Chart Deployment
The official Airflow Helm chart is the recommended deployment method:
# Add Airflow Helm repository
helm repo add apache-airflow https://airflow.apache.org
helm repo update
# Create namespace
kubectl create namespace airflow
# Install Airflow
helm install airflow apache-airflow/airflow \
--namespace airflow \
--values production-values.yaml
2.3 Production values.yaml
# production-values.yaml
# Executor configuration
executor: "KubernetesExecutor"
# Airflow configuration
config:
core:
parallelism: 128
dag_concurrency: 32
max_active_runs_per_dag: 16
kubernetes_executor:
namespace: airflow
worker_container_repository: my-registry.io/airflow-worker
worker_container_tag: "2.8.0"
delete_worker_pods: "True"
delete_worker_pods_on_failure: "False" # Keep for debugging
worker_pods_creation_batch_size: 16
multi_namespace_mode: "False"
# Scheduler
scheduler:
replicas: 2 # HA scheduler
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
# Web server
webserver:
replicas: 2
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
service:
type: LoadBalancer # Or use Ingress
# PostgreSQL (use external RDS/Cloud SQL in production)
postgresql:
enabled: false
externalDatabase:
host: airflow-db.us-east-1.rds.amazonaws.com
port: 5432
database: airflow
user: airflow
passwordSecret: airflow-db-secret
passwordSecretKey: postgresql-password
# DAG repository via Git-sync
dags:
gitSync:
enabled: true
repo: https://github.com/yourorg/airflow-dags.git
branch: main
rev: HEAD
depth: 1
wait: 60 # Poll interval in seconds
sshKeySecret: airflow-git-secret
# Remote logging to S3
logs:
persistence:
enabled: false
remote_logging: true
remote_base_log_folder: "s3://my-airflow-logs/logs"
remote_log_conn_id: "aws_default"
# Resource quotas for worker pods
workers:
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
persistence:
enabled: false
# RBAC
rbac:
create: true
createSCCRoleBinding: false
Figure 2: GitOps Deployment Pipeline
3. Advanced Patterns: Custom Worker Images
Different DAGs often require different dependencies. Use custom worker images per task:
3.1 Multiple Worker Images
# dags/ml_pipeline.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime
with DAG(
'ml_pipeline',
start_date=datetime(2025, 1, 1),
schedule_interval='@daily',
catchup=False
) as dag:
# Data preprocessing with pandas/polars
preprocess = KubernetesPodOperator(
task_id='preprocess_data',
name='preprocess-data',
namespace='airflow',
image='my-registry.io/airflow-data-worker:latest',
cmds=['python'],
arguments=['/scripts/preprocess.py'],
env_vars={
'EXECUTION_DATE': '{{ ds }}',
'S3_BUCKET': 'my-data-bucket'
},
resources={
'request_memory': '4Gi',
'request_cpu': '2',
'limit_memory': '8Gi',
'limit_cpu': '4'
},
is_delete_operator_pod=True,
get_logs=True
)
# Model training with PyTorch on GPU
train_model = KubernetesPodOperator(
task_id='train_model',
name='train-model',
namespace='airflow',
image='my-registry.io/airflow-ml-gpu:latest',
cmds=['python'],
arguments=['/scripts/train.py'],
env_vars={
'MODEL_VERSION': '{{ execution_date }}',
'S3_MODEL_BUCKET': 'my-models-bucket'
},
resources={
'request_memory': '16Gi',
'request_cpu': '4',
'limit_memory': '32Gi',
'limit_cpu': '8'
},
node_selector={'node.kubernetes.io/instance-type': 'p3.2xlarge'}, # GPU nodes
is_delete_operator_pod=True,
get_logs=True
)
preprocess >> train_model
3.2 Dockerfile for Custom Workers
# Dockerfile.ml-worker
FROM apache/airflow:2.8.0-python3.11
# Install ML dependencies
USER root
RUN apt-get update && apt-get install -y \
build-essential \
libgomp1 \
&& apt-get clean
USER airflow
# Install Python packages
COPY requirements-ml.txt .
RUN pip install --no-cache-dir -r requirements-ml.txt
# requirements-ml.txt:
# pandas==2.1.4
# polars==0.20.3
# torch==2.1.0
# transformers==4.36.0
# scikit-learn==1.4.0
4. Auto-Scaling Strategies
4.1 Cluster Auto-Scaler
Configure Kubernetes Cluster Autoscaler to add/remove nodes based on pending pods:
# AWS EKS Node Group with Auto-scaling
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: airflow-cluster
region: us-east-1
nodeGroups:
# General purpose nodes for scheduler/web
- name: airflow-core
instanceType: m5.xlarge
desiredCapacity: 3
minSize: 2
maxSize: 5
labels:
workload: airflow-core
# Worker nodes for task execution
- name: airflow-workers
instanceType: c5.2xlarge
desiredCapacity: 5
minSize: 2
maxSize: 50 # Auto-scale based on demand
labels:
workload: airflow-workers
taints:
- key: workload
value: airflow-workers
effect: NoSchedule
# GPU nodes for ML workloads
- name: ml-gpu-workers
instanceType: p3.2xlarge
desiredCapacity: 0 # Scale from zero
minSize: 0
maxSize: 10
labels:
workload: ml-gpu
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
4.2 Horizontal Pod Autoscaler for Schedulers
# scheduler-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: airflow-scheduler
namespace: airflow
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: airflow-scheduler
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Figure 3: Cost Comparison Analysis
5. Multi-Environment Setup
Production Airflow requires separate environments for dev, staging, and production:
5.1 Namespace-Based Isolation
# Create environments
kubectl create namespace airflow-dev
kubectl create namespace airflow-staging
kubectl create namespace airflow-prod
# Deploy to each environment
helm install airflow-dev apache-airflow/airflow \
--namespace airflow-dev \
--values values-dev.yaml
helm install airflow-staging apache-airflow/airflow \
--namespace airflow-staging \
--values values-staging.yaml
helm install airflow-prod apache-airflow/airflow \
--namespace airflow-prod \
--values values-prod.yaml
5.2 Terraform for Infrastructure as Code
# terraform/airflow-prod.tf
module "airflow_prod" {
source = "./modules/airflow-on-k8s"
environment = "prod"
namespace = "airflow-prod"
cluster_name = var.eks_cluster_name
# Database
db_instance_class = "db.r5.xlarge"
db_allocated_storage = 100
# Scheduler
scheduler_replicas = 3
scheduler_cpu = "2"
scheduler_memory = "8Gi"
# Workers
worker_min_pods = 5
worker_max_pods = 100
worker_cpu = "2"
worker_memory = "4Gi"
# Logging
log_bucket = "airflow-prod-logs-${var.account_id}"
# Git
dag_repo_url = "git@github.com:org/dags-prod.git"
dag_repo_branch = "main"
tags = {
Environment = "production"
Team = "data-platform"
}
}
6. Cost Optimization
6.1 Spot Instances for Worker Nodes
# EKS Node Group with Spot Instances
- name: airflow-workers-spot
instanceTypes:
- c5.2xlarge
- c5.4xlarge
- m5.2xlarge
spot: true
desiredCapacity: 10
minSize: 5
maxSize: 50
labels:
workload: airflow-workers
lifecycle: spot
taints:
- key: spot
value: "true"
effect: NoSchedule
6.2 Task-Level Resource Optimization
# Right-size resources per task
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
# Lightweight task - small resources
small_task = KubernetesPodOperator(
task_id='lightweight_task',
resources={
'request_cpu': '250m',
'request_memory': '512Mi',
'limit_cpu': '500m',
'limit_memory': '1Gi'
}
)
# Heavy task - large resources
heavy_task = KubernetesPodOperator(
task_id='heavy_processing',
resources={
'request_cpu': '4',
'request_memory': '16Gi',
'limit_cpu': '8',
'limit_memory': '32Gi'
}
)
7. Monitoring and Observability
7.1 Prometheus + Grafana
# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Airflow StatsD exporter
config:
metrics:
enabled: true
statsd:
enabled: true
host: prometheus-statsd-exporter
port: 9125
prefix: airflow
7.2 CloudWatch/Stackdriver Integration
# Fluent Bit for log shipping
daemonset:
enabled: true
config:
outputs: |
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region us-east-1
log_group_name /aws/eks/airflow-prod
log_stream_prefix airflow-
auto_create_group true
8. Case Study: Healthcare Data Platform on GKE
8.1 Architecture
- Platform: Google Kubernetes Engine (GKE)
- Scale: 200+ DAGs, 50,000+ tasks/day
- Data Volume: 10TB processed daily
- Workers: 5-80 pods (auto-scaled)
- Cost: $8K/month (vs $18K with CeleryExecutor)
8.2 Key Configurations
# GKE-specific optimizations
nodeGroups:
- name: airflow-workers
machineType: n2-standard-4
preemptible: true # 60-80% cost savings
minNodeCount: 3
maxNodeCount: 50
gke_accelerator:
enabled: true
type: nvidia-tesla-t4
count: 1
8.3 Results
- ✅ 56% cost reduction vs CeleryExecutor
- ✅ 99.8% task success rate
- ✅ 30-second pod startup time (optimized images)
- ✅ Zero idle resources (pods terminate after completion)
- ✅ Isolated failures (one task failure doesn’t affect others)
9. Best Practices
9.1 DAG Development Workflow
- Local Development: Use kind/minikube for local Kubernetes testing
- CI/CD Pipeline: Automated DAG testing and deployment
- GitOps: DAGs in Git, deployed via Git-sync
- Staged Rollout: dev → staging → prod promotion
9.2 Security
- RBAC: Kubernetes RBAC for pod creation permissions
- Network Policies: Restrict pod-to-pod communication
- Secrets Management: Use Kubernetes Secrets or external (AWS Secrets Manager, Vault)
- Pod Security Standards: Enforce restricted pod security
9.3 Performance Tuning
- Optimize Images: Multi-stage builds, minimal base images
- Pre-pull Images: Use DaemonSet to pre-pull common images
- Fast Scheduling: Keep scheduler CPU/memory adequate
- Database Tuning: Connection pooling, read replicas
10. Conclusion
Running Airflow on Kubernetes with the KubernetesExecutor transforms how you build and scale data pipelines. Key
takeaways:
- 40-60% cost savings through auto-scaling and spot instances
- Better isolation prevents cascading failures
- Flexible resource allocation per task optimizes efficiency
- Cloud-native architecture leverages Kubernetes ecosystem
- Production-ready with Helm charts and Infrastructure as Code
Start with the official Helm chart, deploy to a dev environment, and gradually promote to production. Your data
platform will be more reliable, scalable, and cost-effective.
References and Further Reading
- Apache Airflow. (2025). “Running Airflow on Kubernetes.” https://airflow.apache.org/docs/apache-airflow/stable/kubernetes.html
- Apache Airflow Helm Chart. (2025). “Official Helm Chart Documentation.” https://airflow.apache.org/docs/helm-chart/stable/index.html
- Kubernetes Documentation. (2025). “Kubernetes Concepts.” https://kubernetes.io/docs/concepts/
- AWS. (2025). “Amazon EKS Best Practices.” https://aws.github.io/aws-eks-best-practices/
- Google Cloud. (2025). “GKE Best Practices.” https://cloud.google.com/kubernetes-engine/docs/best-practices
- Microsoft Azure. (2025). “AKS Best Practices.” https://learn.microsoft.com/azure/aks/best-practices
- Helm. (2025). “Helm Documentation.” https://helm.sh/docs/
- Terraform. (2025). “Terraform Kubernetes Provider.” https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs
- Astronomer. (2025). “Astronomer Runtime on Kubernetes.” https://docs.astronomer.io/astro
- Prometheus. (2025). “Monitoring Kubernetes.” https://prometheus.io/docs/prometheus/latest/configuration/configuration/
- CNCF. (2024). “Cloud Native Landscape.” https://landscape.cncf.io/
This article reflects production experience deploying Airflow on Kubernetes across AWS EKS, Google GKE, and Azure
AKS. Written for data platform engineers, DevOps teams, and technical leaders building cloud-native data
infrastructure.
Discover more from C4: Container, Code, Cloud & Context
Subscribe to get the latest posts sent to your email.