Airflow on Kubernetes in Production: Architecture, Deployment, and Cost Optimization

After deploying Apache Airflow on Kubernetes across AWS EKS,
Google GKE, and Azure AKS in production environments processing billions of tasks annually, I’ve learned that the
KubernetesExecutor fundamentally changes how you architect data pipelines. This article shares battle-tested
patterns for running Airflow at scale on Kubernetes, from infrastructure setup to cost optimization.

1. Why Kubernetes for Airflow?

The KubernetesExecutor offers advantages that CeleryExecutor and LocalExecutor can’t match:

  • Task-Level Isolation: Each task runs in its own pod with dedicated resources
  • Auto-Scaling: Pods scale up/down automatically based on workload
  • Resource Efficiency: No idle worker processes consuming resources
  • Cost Optimization: Pay only for pods actively running tasks
  • Dependency Isolation: Different tasks can have different Python dependencies
  • Multi-Tenancy: Separate namespaces for different teams/environments

In my deployments, KubernetesExecutor reduced infrastructure costs by 40-60% compared to CeleryExecutor, while
improving task isolation and reliability.

Airflow on Kubernetes Architecture

Figure 1: Airflow on Kubernetes Production Architecture

2. Architecture: Production Kubernetes Deployment

2.1 Core Components

A production Airflow on Kubernetes deployment consists of:

  • Scheduler Pod(s): Parse DAGs, create task pods
  • Web Server Pod(s): Serve UI, handle API requests
  • Worker Pods (ephemeral): Execute tasks, terminated after completion
  • PostgreSQL Database: Metadata store (RDS/Cloud SQL recommended)
  • Redis (optional): For Celery if using CeleryKubernetesExecutor
  • DAG Storage: Git-sync or PersistentVolume
  • Logs Storage: S3/GCS/Azure Blob for centralized logging

2.2 Helm Chart Deployment

The official Airflow Helm chart is the recommended deployment method:

# Add Airflow Helm repository
helm repo add apache-airflow https://airflow.apache.org
helm repo update

# Create namespace
kubectl create namespace airflow

# Install Airflow
helm install airflow apache-airflow/airflow \
  --namespace airflow \
  --values production-values.yaml

2.3 Production values.yaml

# production-values.yaml
# Executor configuration
executor: "KubernetesExecutor"

# Airflow configuration
config:
  core:
    parallelism: 128
    dag_concurrency: 32
    max_active_runs_per_dag: 16
  
  kubernetes_executor:
    namespace: airflow
    worker_container_repository: my-registry.io/airflow-worker
    worker_container_tag: "2.8.0"
    delete_worker_pods: "True"
    delete_worker_pods_on_failure: "False"  # Keep for debugging
    worker_pods_creation_batch_size: 16
    multi_namespace_mode: "False"

# Scheduler
scheduler:
  replicas: 2  # HA scheduler
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"
  
# Web server
webserver:
  replicas: 2
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"
  
  service:
    type: LoadBalancer  # Or use Ingress

# PostgreSQL (use external RDS/Cloud SQL in production)
postgresql:
  enabled: false

externalDatabase:
  host: airflow-db.us-east-1.rds.amazonaws.com
  port: 5432
  database: airflow
  user: airflow
  passwordSecret: airflow-db-secret
  passwordSecretKey: postgresql-password

# DAG repository via Git-sync
dags:
  gitSync:
    enabled: true
    repo: https://github.com/yourorg/airflow-dags.git
    branch: main
    rev: HEAD
    depth: 1
    wait: 60  # Poll interval in seconds
    
    sshKeySecret: airflow-git-secret

# Remote logging to S3
logs:
  persistence:
    enabled: false
  
  remote_logging: true
  remote_base_log_folder: "s3://my-airflow-logs/logs"
  remote_log_conn_id: "aws_default"

# Resource quotas for worker pods
workers:
  resources:
    requests:
      cpu: "500m"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"
  
  persistence:
    enabled: false

# RBAC
rbac:
  create: true
  createSCCRoleBinding: false
Deployment Workflow

Figure 2: GitOps Deployment Pipeline

3. Advanced Patterns: Custom Worker Images

Different DAGs often require different dependencies. Use custom worker images per task:

3.1 Multiple Worker Images

# dags/ml_pipeline.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime

with DAG(
    'ml_pipeline',
    start_date=datetime(2025, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    
    # Data preprocessing with pandas/polars
    preprocess = KubernetesPodOperator(
        task_id='preprocess_data',
        name='preprocess-data',
        namespace='airflow',
        image='my-registry.io/airflow-data-worker:latest',
        cmds=['python'],
        arguments=['/scripts/preprocess.py'],
        env_vars={
            'EXECUTION_DATE': '{{ ds }}',
            'S3_BUCKET': 'my-data-bucket'
        },
        resources={
            'request_memory': '4Gi',
            'request_cpu': '2',
            'limit_memory': '8Gi',
            'limit_cpu': '4'
        },
        is_delete_operator_pod=True,
        get_logs=True
    )
    
    # Model training with PyTorch on GPU
    train_model = KubernetesPodOperator(
        task_id='train_model',
        name='train-model',
        namespace='airflow',
        image='my-registry.io/airflow-ml-gpu:latest',
        cmds=['python'],
        arguments=['/scripts/train.py'],
        env_vars={
            'MODEL_VERSION': '{{ execution_date }}',
            'S3_MODEL_BUCKET': 'my-models-bucket'
        },
        resources={
            'request_memory': '16Gi',
            'request_cpu': '4',
            'limit_memory': '32Gi',
            'limit_cpu': '8'
        },
        node_selector={'node.kubernetes.io/instance-type': 'p3.2xlarge'},  # GPU nodes
        is_delete_operator_pod=True,
        get_logs=True
    )
    
    preprocess >> train_model

3.2 Dockerfile for Custom Workers

# Dockerfile.ml-worker
FROM apache/airflow:2.8.0-python3.11

# Install ML dependencies
USER root
RUN apt-get update && apt-get install -y \
    build-essential \
    libgomp1 \
    && apt-get clean

USER airflow

# Install Python packages
COPY requirements-ml.txt .
RUN pip install --no-cache-dir -r requirements-ml.txt

# requirements-ml.txt:
# pandas==2.1.4
# polars==0.20.3
# torch==2.1.0
# transformers==4.36.0
# scikit-learn==1.4.0

4. Auto-Scaling Strategies

4.1 Cluster Auto-Scaler

Configure Kubernetes Cluster Autoscaler to add/remove nodes based on pending pods:

# AWS EKS Node Group with Auto-scaling
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: airflow-cluster
  region: us-east-1

nodeGroups:
  # General purpose nodes for scheduler/web
  - name: airflow-core
    instanceType: m5.xlarge
    desiredCapacity: 3
    minSize: 2
    maxSize: 5
    labels:
      workload: airflow-core
    
  # Worker nodes for task execution
  - name: airflow-workers
    instanceType: c5.2xlarge
    desiredCapacity: 5
    minSize: 2
    maxSize: 50  # Auto-scale based on demand
    labels:
      workload: airflow-workers
    taints:
      - key: workload
        value: airflow-workers
        effect: NoSchedule
    
  # GPU nodes for ML workloads
  - name: ml-gpu-workers
    instanceType: p3.2xlarge
    desiredCapacity: 0  # Scale from zero
    minSize: 0
    maxSize: 10
    labels:
      workload: ml-gpu
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule

4.2 Horizontal Pod Autoscaler for Schedulers

# scheduler-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: airflow-scheduler
  namespace: airflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: airflow-scheduler
  minReplicas: 2
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
Cost Optimization

Figure 3: Cost Comparison Analysis

5. Multi-Environment Setup

Production Airflow requires separate environments for dev, staging, and production:

5.1 Namespace-Based Isolation

# Create environments
kubectl create namespace airflow-dev
kubectl create namespace airflow-staging
kubectl create namespace airflow-prod

# Deploy to each environment
helm install airflow-dev apache-airflow/airflow \
  --namespace airflow-dev \
  --values values-dev.yaml

helm install airflow-staging apache-airflow/airflow \
  --namespace airflow-staging \
  --values values-staging.yaml

helm install airflow-prod apache-airflow/airflow \
  --namespace airflow-prod \
  --values values-prod.yaml

5.2 Terraform for Infrastructure as Code

# terraform/airflow-prod.tf
module "airflow_prod" {
  source = "./modules/airflow-on-k8s"
  
  environment         = "prod"
  namespace          = "airflow-prod"
  cluster_name       = var.eks_cluster_name
  
  # Database
  db_instance_class  = "db.r5.xlarge"
  db_allocated_storage = 100
  
  # Scheduler
  scheduler_replicas = 3
  scheduler_cpu      = "2"
  scheduler_memory   = "8Gi"
  
  # Workers
  worker_min_pods    = 5
  worker_max_pods    = 100
  worker_cpu         = "2"
  worker_memory      = "4Gi"
  
  # Logging
  log_bucket         = "airflow-prod-logs-${var.account_id}"
  
  # Git
  dag_repo_url       = "git@github.com:org/dags-prod.git"
  dag_repo_branch    = "main"
  
  tags = {
    Environment = "production"
    Team        = "data-platform"
  }
}

6. Cost Optimization

6.1 Spot Instances for Worker Nodes

# EKS Node Group with Spot Instances
- name: airflow-workers-spot
  instanceTypes:
    - c5.2xlarge
    - c5.4xlarge
    - m5.2xlarge
  spot: true
  desiredCapacity: 10
  minSize: 5
  maxSize: 50
  labels:
    workload: airflow-workers
    lifecycle: spot
  taints:
    - key: spot
      value: "true"
      effect: NoSchedule

6.2 Task-Level Resource Optimization

# Right-size resources per task
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

# Lightweight task - small resources
small_task = KubernetesPodOperator(
    task_id='lightweight_task',
    resources={
        'request_cpu': '250m',
        'request_memory': '512Mi',
        'limit_cpu': '500m',
        'limit_memory': '1Gi'
    }
)

# Heavy task - large resources
heavy_task = KubernetesPodOperator(
    task_id='heavy_processing',
    resources={
        'request_cpu': '4',
        'request_memory': '16Gi',
        'limit_cpu': '8',
        'limit_memory': '32Gi'
    }
)

7. Monitoring and Observability

7.1 Prometheus + Grafana

# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Airflow StatsD exporter
config:
  metrics:
    enabled: true
    
  statsd:
    enabled: true
    host: prometheus-statsd-exporter
    port: 9125
    prefix: airflow

7.2 CloudWatch/Stackdriver Integration

# Fluent Bit for log shipping
daemonset:
  enabled: true
  
config:
  outputs: |
    [OUTPUT]
        Name cloudwatch_logs
        Match kube.*
        region us-east-1
        log_group_name /aws/eks/airflow-prod
        log_stream_prefix airflow-
        auto_create_group true

8. Case Study: Healthcare Data Platform on GKE

8.1 Architecture

  • Platform: Google Kubernetes Engine (GKE)
  • Scale: 200+ DAGs, 50,000+ tasks/day
  • Data Volume: 10TB processed daily
  • Workers: 5-80 pods (auto-scaled)
  • Cost: $8K/month (vs $18K with CeleryExecutor)

8.2 Key Configurations

# GKE-specific optimizations
nodeGroups:
  - name: airflow-workers
    machineType: n2-standard-4
    preemptible: true  # 60-80% cost savings
    minNodeCount: 3
    maxNodeCount: 50
    
gke_accelerator:
  enabled: true
  type: nvidia-tesla-t4
  count: 1

8.3 Results

  • 56% cost reduction vs CeleryExecutor
  • 99.8% task success rate
  • 30-second pod startup time (optimized images)
  • Zero idle resources (pods terminate after completion)
  • Isolated failures (one task failure doesn’t affect others)

9. Best Practices

9.1 DAG Development Workflow

  1. Local Development: Use kind/minikube for local Kubernetes testing
  2. CI/CD Pipeline: Automated DAG testing and deployment
  3. GitOps: DAGs in Git, deployed via Git-sync
  4. Staged Rollout: dev → staging → prod promotion

9.2 Security

  • RBAC: Kubernetes RBAC for pod creation permissions
  • Network Policies: Restrict pod-to-pod communication
  • Secrets Management: Use Kubernetes Secrets or external (AWS Secrets Manager, Vault)
  • Pod Security Standards: Enforce restricted pod security

9.3 Performance Tuning

  • Optimize Images: Multi-stage builds, minimal base images
  • Pre-pull Images: Use DaemonSet to pre-pull common images
  • Fast Scheduling: Keep scheduler CPU/memory adequate
  • Database Tuning: Connection pooling, read replicas

10. Conclusion

Running Airflow on Kubernetes with the KubernetesExecutor transforms how you build and scale data pipelines. Key
takeaways:

  • 40-60% cost savings through auto-scaling and spot instances
  • Better isolation prevents cascading failures
  • Flexible resource allocation per task optimizes efficiency
  • Cloud-native architecture leverages Kubernetes ecosystem
  • Production-ready with Helm charts and Infrastructure as Code

Start with the official Helm chart, deploy to a dev environment, and gradually promote to production. Your data
platform will be more reliable, scalable, and cost-effective.

References and Further Reading

This article reflects production experience deploying Airflow on Kubernetes across AWS EKS, Google GKE, and Azure
AKS. Written for data platform engineers, DevOps teams, and technical leaders building cloud-native data
infrastructure.


Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.