Infrastructure as Code for AI: Terraform Patterns for LLM Deployments

Infrastructure as Code for AI: Terraform Patterns for LLM Deployments

Expert Guide to Managing AI Infrastructure with Terraform

I’ve managed AI infrastructure across AWS, Azure, and GCP using Terraform. Infrastructure as Code isn’t just about automation—it’s about reproducibility, version control, and managing complex AI deployments consistently. When you’re deploying LLM services, vector databases, and GPU clusters, Terraform becomes essential.

In this guide, I’ll share the Terraform patterns I’ve used to deploy production AI infrastructure. You’ll learn how to structure Terraform modules, manage GPU resources, deploy Kubernetes clusters, and handle multi-cloud deployments.

What You’ll Learn

  • Terraform module patterns for AI infrastructure
  • GPU resource management with Terraform
  • Kubernetes cluster deployment patterns
  • Multi-cloud deployment strategies
  • State management and remote backends
  • Variable management and secrets handling
  • CI/CD integration for infrastructure
  • Real-world examples from production deployments
  • Common Terraform pitfalls and how to avoid them

Introduction: Why Infrastructure as Code for AI?

Managing AI infrastructure manually is error-prone, slow, and doesn’t scale. When you’re deploying GPU clusters, vector databases, and Kubernetes clusters across multiple environments, you need Infrastructure as Code. Terraform provides:

  • Reproducibility: Same infrastructure every time, across environments
  • Version control: Track infrastructure changes like code
  • Multi-cloud support: Deploy to AWS, Azure, GCP with the same tool
  • State management: Track infrastructure state and changes
  • Modularity: Reusable modules for common patterns

I’ve seen teams spend days deploying AI infrastructure manually, only to have it fail in production. With Terraform, the same infrastructure deploys in minutes, consistently, every time.

Terraform Architecture for AI Infrastructure
Figure 1: Terraform Architecture for AI Infrastructure

1. Terraform Module Structure

1.1 Module Organization

Organize Terraform code into reusable modules:

# Module Structure
terraform/
├── modules/
│   ├── kubernetes-cluster/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── llm-service/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── vector-database/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── monitoring/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
└── main.tf

1.2 Module Usage

Use modules to create reusable infrastructure components:

# main.tf - Using modules
module "kubernetes_cluster" {
  source = "./modules/kubernetes-cluster"
  
  cluster_name = "ai-cluster"
  node_count   = 3
  node_type    = "g4dn.xlarge"  # GPU-enabled
  
  tags = {
    Environment = "production"
    Application = "ai-llm"
  }
}

module "llm_service" {
  source = "./modules/llm-service"
  
  cluster_id = module.kubernetes_cluster.cluster_id
  gpu_count  = 2
  model_path  = "s3://models/llm/v1"
  
  depends_on = [module.kubernetes_cluster]
}

module "vector_database" {
  source = "./modules/vector-database"
  
  instance_type = "r6g.2xlarge"
  storage_size  = 500  # GB
  replication   = true
}
Terraform Module Patterns for AI Infrastructure
Figure 2: Terraform Module Patterns for AI Infrastructure

2. Kubernetes Cluster Deployment

2.1 EKS Cluster with Terraform

Deploy an EKS cluster for AI workloads:

# modules/kubernetes-cluster/main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# EKS Cluster
resource "aws_eks_cluster" "ai_cluster" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster.arn
  version  = "1.28"

  vpc_config {
    subnet_ids              = var.subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }

  enabled_cluster_log_types = ["api", "audit", "authenticator"]

  tags = var.tags
}

# Node Group with GPU Support
resource "aws_eks_node_group" "gpu_nodes" {
  cluster_name    = aws_eks_cluster.ai_cluster.name
  node_group_name = "gpu-nodes"
  node_role_arn   = aws_iam_role.node_group.arn
  subnet_ids      = var.subnet_ids

  instance_types = [var.node_type]  # g4dn.xlarge, p3.2xlarge, etc.

  scaling_config {
    desired_size = var.node_count
    max_size     = var.max_nodes
    min_size     = var.min_nodes
  }

  update_config {
    max_unavailable = 1
  }

  labels = {
    accelerator = "gpu"
  }

  taints {
    key    = "nvidia.com/gpu"
    value  = "true"
    effect = "NO_SCHEDULE"
  }
}

2.2 AKS Cluster Deployment

Deploy an AKS cluster for Azure:

# AKS Cluster
resource "azurerm_kubernetes_cluster" "ai_cluster" {
  name                = var.cluster_name
  location            = var.location
  resource_group_name = var.resource_group_name
  dns_prefix          = var.cluster_name

  default_node_pool {
    name                = "default"
    node_count          = var.node_count
    vm_size             = var.node_type  # Standard_NC6s_v3 for GPU
    enable_auto_scaling = true
    min_count           = var.min_nodes
    max_count           = var.max_nodes
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags
}

# GPU Node Pool
resource "azurerm_kubernetes_cluster_node_pool" "gpu_pool" {
  name                  = "gpunodepool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.ai_cluster.id
  vm_size               = "Standard_NC6s_v3"  # NVIDIA GPU
  node_count            = var.gpu_node_count

  node_taints = ["nvidia.com/gpu=true:NoSchedule"]

  tags = var.tags
}

3. GPU Resource Management

3.1 AWS GPU Instances

Provision GPU instances for LLM inference:

# EC2 GPU Instances
resource "aws_instance" "gpu_inference" {
  count         = var.gpu_instance_count
  ami           = data.aws_ami.gpu_ami.id
  instance_type = var.gpu_instance_type  # g4dn.xlarge, p3.2xlarge, etc.

  vpc_security_group_ids = [aws_security_group.gpu.id]
  subnet_id              = var.subnet_ids[count.index % length(var.subnet_ids)]

  iam_instance_profile = aws_iam_instance_profile.gpu.name

  root_block_device {
    volume_type = "gp3"
    volume_size = 100
    encrypted   = true
  }

  tags = merge(var.tags, {
    Name        = "${var.cluster_name}-gpu-${count.index}"
    Accelerator = "gpu"
  })
}

# Launch Template for Auto Scaling
resource "aws_launch_template" "gpu_template" {
  name_prefix   = "${var.cluster_name}-gpu-"
  image_id      = data.aws_ami.gpu_ami.id
  instance_type = var.gpu_instance_type

  vpc_security_group_ids = [aws_security_group.gpu.id]

  block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
      volume_type = "gp3"
      volume_size = 100
      encrypted   = true
    }
  }

  iam_instance_profile {
    name = aws_iam_instance_profile.gpu.name
  }

  tag_specifications {
    resource_type = "instance"
    tags = merge(var.tags, {
      Accelerator = "gpu"
    })
  }
}

# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_asg" {
  name                = "${var.cluster_name}-gpu-asg"
  vpc_zone_identifier = var.subnet_ids
  target_group_arns   = [aws_lb_target_group.gpu.arn]
  health_check_type   = "ELB"

  min_size         = var.min_gpu_instances
  max_size         = var.max_gpu_instances
  desired_capacity = var.desired_gpu_instances

  launch_template {
    id      = aws_launch_template.gpu_template.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "${var.cluster_name}-gpu"
    propagate_at_launch = true
  }
}

3.2 Azure GPU Instances

Provision GPU instances in Azure:

# Azure GPU VM
resource "azurerm_linux_virtual_machine" "gpu_vm" {
  count               = var.gpu_vm_count
  name                = "${var.cluster_name}-gpu-${count.index}"
  resource_group_name = var.resource_group_name
  location            = var.location
  size                = var.gpu_vm_size  # Standard_NC6s_v3, etc.

  admin_username = var.admin_username
  admin_ssh_key {
    username   = var.admin_username
    public_key = var.ssh_public_key
  }

  network_interface_ids = [azurerm_network_interface.gpu[count.index].id]

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Premium_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-focal"
    sku       = "20_04-lts-gen2"
    version   = "latest"
  }

  tags = merge(var.tags, {
    Accelerator = "gpu"
  })
}

# Virtual Machine Scale Set for Auto Scaling
resource "azurerm_linux_virtual_machine_scale_set" "gpu_vmss" {
  name                = "${var.cluster_name}-gpu-vmss"
  resource_group_name = var.resource_group_name
  location            = var.location
  sku                 = var.gpu_vm_size
  instances           = var.gpu_instance_count

  admin_username = var.admin_username
  admin_ssh_key {
    username   = var.admin_username
    public_key = var.ssh_public_key
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-focal"
    sku       = "20_04-lts-gen2"
    version   = "latest"
  }

  network_interface {
    name    = "gpu-nic"
    primary = true

    ip_configuration {
      name      = "internal"
      primary   = true
      subnet_id = var.subnet_id
    }
  }

  automatic_os_upgrade_policy {
    disable_automatic_rollback = false
    enable_automatic_os_upgrade = true
  }

  tags = var.tags
}
Terraform Deployment Patterns for AI Infrastructure
Figure 3: Terraform Deployment Patterns for AI Infrastructure

4. State Management and Remote Backends

4.1 Remote State Configuration

Use remote backends for state management:

# backend.tf - Remote State Backend
terraform {
  backend "s3" {
    bucket         = "ai-infrastructure-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

# Alternative: Azure Backend
terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state"
    storage_account_name = "tfstate"
    container_name       = "tfstate"
    key                  = "production.terraform.tfstate"
  }
}

# Alternative: GCS Backend
terraform {
  backend "gcs" {
    bucket = "ai-infrastructure-terraform-state"
    prefix = "production"
  }
}

4.2 State Locking

Prevent concurrent modifications with state locking:

# DynamoDB Table for State Locking (AWS)
resource "aws_dynamodb_table" "terraform_state_lock" {
  name           = "terraform-state-lock"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name = "Terraform State Lock"
  }
}

# State Locking Configuration
terraform {
  backend "s3" {
    # ... other config ...
    dynamodb_table = "terraform-state-lock"
  }
}

5. Variable Management and Secrets

5.1 Variable Files

Organize variables by environment:

# variables.tf
variable "cluster_name" {
  description = "Name of the Kubernetes cluster"
  type        = string
}

variable "node_count" {
  description = "Number of nodes in the cluster"
  type        = number
  default     = 3
}

variable "gpu_instance_type" {
  description = "GPU instance type"
  type        = string
  default     = "g4dn.xlarge"
}

variable "tags" {
  description = "Tags to apply to resources"
  type        = map(string)
  default     = {}
}

# terraform.tfvars (production)
cluster_name      = "ai-production-cluster"
node_count        = 5
gpu_instance_type = "p3.2xlarge"
tags = {
  Environment = "production"
  Application = "ai-llm"
  ManagedBy   = "terraform"
}

5.2 Secrets Management

Handle secrets securely with Terraform:

# Using AWS Secrets Manager
data "aws_secretsmanager_secret" "api_key" {
  name = "ai-api-key"
}

data "aws_secretsmanager_secret_version" "api_key" {
  secret_id = data.aws_secretsmanager_secret.api_key.id
}

# Using Azure Key Vault
data "azurerm_key_vault_secret" "api_key" {
  name         = "ai-api-key"
  key_vault_id = azurerm_key_vault.main.id
}

# Using environment variables (sensitive)
variable "api_key" {
  description = "API key for LLM service"
  type        = string
  sensitive   = true
}

# Pass to resources
resource "kubernetes_secret" "api_key" {
  metadata {
    name = "api-key"
  }

  data = {
    api-key = var.api_key
  }
}

6. Multi-Cloud Deployment

6.1 Provider Configuration

Configure multiple cloud providers:

# providers.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
    google = {
      source  = "hashicorp/google"
      version = "~> 4.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.20"
    }
  }
}

# AWS Provider
provider "aws" {
  region = var.aws_region

  default_tags {
    tags = var.tags
  }
}

# Azure Provider
provider "azurerm" {
  features {}
}

# GCP Provider
provider "google" {
  project = var.gcp_project_id
  region  = var.gcp_region
}

6.2 Cross-Cloud Resources

Deploy resources across multiple clouds:

# AWS Resources
module "aws_infrastructure" {
  source = "./modules/aws"
  
  cluster_name = var.cluster_name
  region       = var.aws_region
}

# Azure Resources
module "azure_infrastructure" {
  source = "./modules/azure"
  
  cluster_name      = var.cluster_name
  resource_group_name = var.azure_resource_group
  location          = var.azure_location
}

# GCP Resources
module "gcp_infrastructure" {
  source = "./modules/gcp"
  
  cluster_name = var.cluster_name
  project_id   = var.gcp_project_id
  region       = var.gcp_region
}

7. CI/CD Integration

7.1 GitHub Actions Workflow

Automate Terraform deployments with CI/CD:

# .github/workflows/terraform.yml
name: Terraform Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v2
      with:
        terraform_version: 1.5.0
    
    - name: Terraform Init
      run: terraform init
      working-directory: ./environments/production
    
    - name: Terraform Validate
      run: terraform validate
      working-directory: ./environments/production
    
    - name: Terraform Plan
      run: terraform plan -out=tfplan
      working-directory: ./environments/production
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    
    - name: Terraform Apply
      if: github.ref == 'refs/heads/main'
      run: terraform apply -auto-approve tfplan
      working-directory: ./environments/production

7.2 Terraform Cloud Integration

Use Terraform Cloud for managed runs:

# terraform.tf - Terraform Cloud Backend
terraform {
  cloud {
    organization = "your-org"
    
    workspaces {
      name = "ai-infrastructure-production"
    }
  }
}

# Workspace Configuration
# - Auto-apply on merge to main
# - Manual apply for other branches
# - Cost estimation enabled
# - Policy checks (Sentinel)
Best Practices: Lessons from Production
Best Practices: Lessons from Production
Best Practices: Lessons from Production
Best Practices: Lessons from Production

8. Best Practices: Lessons from Production

After architecting multiple cloud-native AI systems, here are the practices I follow:

  1. Use modules: Create reusable, composable modules
  2. Remote state: Store state remotely for team collaboration
  3. State locking: Prevent concurrent modifications
  4. Version control: Track infrastructure changes in Git
  5. Environment separation: Separate dev, staging, production
  6. Variable management: Use tfvars files for environment configs
  7. Secrets handling: Never commit secrets—use secret managers
  8. CI/CD integration: Automate deployments with pipelines
  9. Plan before apply: Always review plans before applying
  10. Document modules: Document inputs, outputs, and usage
Common Mistakes to Avoid
Common Mistakes to Avoid
Common Mistakes to Avoid
Common Mistakes to Avoid

9. Common Mistakes to Avoid

I’ve made these mistakes so you don’t have to:

  • Not using remote state: Local state doesn’t work for teams
  • Committing secrets: Never commit secrets to Git
  • Not using modules: Duplicate code leads to drift
  • Ignoring state files: State files are critical—back them up
  • Not planning: Always run terraform plan before apply
  • Hardcoding values: Use variables for flexibility
  • Not versioning: Pin provider and Terraform versions
  • Ignoring costs: Use cost estimation tools

10. Conclusion

Infrastructure as Code with Terraform enables reproducible, version-controlled AI infrastructure deployments. The key is modular design, remote state management, proper variable handling, and CI/CD integration. Get these right, and your AI infrastructure will be consistent, maintainable, and scalable.

🎯 Key Takeaway

Terraform for AI infrastructure is about reproducibility, version control, and automation. Use modules for reusability, remote state for collaboration, proper variable management for flexibility, and CI/CD for automation. The result: consistent, maintainable, and scalable AI infrastructure deployments across environments and clouds.


Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.