Infrastructure as Code for AI: Terraform Patterns for LLM Deployments
Expert Guide to Managing AI Infrastructure with Terraform
I’ve managed AI infrastructure across AWS, Azure, and GCP using Terraform. Infrastructure as Code isn’t just about automation—it’s about reproducibility, version control, and managing complex AI deployments consistently. When you’re deploying LLM services, vector databases, and GPU clusters, Terraform becomes essential.
In this guide, I’ll share the Terraform patterns I’ve used to deploy production AI infrastructure. You’ll learn how to structure Terraform modules, manage GPU resources, deploy Kubernetes clusters, and handle multi-cloud deployments.
What You’ll Learn
- Terraform module patterns for AI infrastructure
- GPU resource management with Terraform
- Kubernetes cluster deployment patterns
- Multi-cloud deployment strategies
- State management and remote backends
- Variable management and secrets handling
- CI/CD integration for infrastructure
- Real-world examples from production deployments
- Common Terraform pitfalls and how to avoid them
Introduction: Why Infrastructure as Code for AI?
Managing AI infrastructure manually is error-prone, slow, and doesn’t scale. When you’re deploying GPU clusters, vector databases, and Kubernetes clusters across multiple environments, you need Infrastructure as Code. Terraform provides:
- Reproducibility: Same infrastructure every time, across environments
- Version control: Track infrastructure changes like code
- Multi-cloud support: Deploy to AWS, Azure, GCP with the same tool
- State management: Track infrastructure state and changes
- Modularity: Reusable modules for common patterns
I’ve seen teams spend days deploying AI infrastructure manually, only to have it fail in production. With Terraform, the same infrastructure deploys in minutes, consistently, every time.

1. Terraform Module Structure
1.1 Module Organization
Organize Terraform code into reusable modules:
# Module Structure
terraform/
├── modules/
│ ├── kubernetes-cluster/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── llm-service/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── vector-database/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── monitoring/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── production/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
└── main.tf
1.2 Module Usage
Use modules to create reusable infrastructure components:
# main.tf - Using modules
module "kubernetes_cluster" {
source = "./modules/kubernetes-cluster"
cluster_name = "ai-cluster"
node_count = 3
node_type = "g4dn.xlarge" # GPU-enabled
tags = {
Environment = "production"
Application = "ai-llm"
}
}
module "llm_service" {
source = "./modules/llm-service"
cluster_id = module.kubernetes_cluster.cluster_id
gpu_count = 2
model_path = "s3://models/llm/v1"
depends_on = [module.kubernetes_cluster]
}
module "vector_database" {
source = "./modules/vector-database"
instance_type = "r6g.2xlarge"
storage_size = 500 # GB
replication = true
}

2. Kubernetes Cluster Deployment
2.1 EKS Cluster with Terraform
Deploy an EKS cluster for AI workloads:
# modules/kubernetes-cluster/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# EKS Cluster
resource "aws_eks_cluster" "ai_cluster" {
name = var.cluster_name
role_arn = aws_iam_role.cluster.arn
version = "1.28"
vpc_config {
subnet_ids = var.subnet_ids
endpoint_private_access = true
endpoint_public_access = true
}
enabled_cluster_log_types = ["api", "audit", "authenticator"]
tags = var.tags
}
# Node Group with GPU Support
resource "aws_eks_node_group" "gpu_nodes" {
cluster_name = aws_eks_cluster.ai_cluster.name
node_group_name = "gpu-nodes"
node_role_arn = aws_iam_role.node_group.arn
subnet_ids = var.subnet_ids
instance_types = [var.node_type] # g4dn.xlarge, p3.2xlarge, etc.
scaling_config {
desired_size = var.node_count
max_size = var.max_nodes
min_size = var.min_nodes
}
update_config {
max_unavailable = 1
}
labels = {
accelerator = "gpu"
}
taints {
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}
}
2.2 AKS Cluster Deployment
Deploy an AKS cluster for Azure:
# AKS Cluster
resource "azurerm_kubernetes_cluster" "ai_cluster" {
name = var.cluster_name
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = var.cluster_name
default_node_pool {
name = "default"
node_count = var.node_count
vm_size = var.node_type # Standard_NC6s_v3 for GPU
enable_auto_scaling = true
min_count = var.min_nodes
max_count = var.max_nodes
}
identity {
type = "SystemAssigned"
}
tags = var.tags
}
# GPU Node Pool
resource "azurerm_kubernetes_cluster_node_pool" "gpu_pool" {
name = "gpunodepool"
kubernetes_cluster_id = azurerm_kubernetes_cluster.ai_cluster.id
vm_size = "Standard_NC6s_v3" # NVIDIA GPU
node_count = var.gpu_node_count
node_taints = ["nvidia.com/gpu=true:NoSchedule"]
tags = var.tags
}
3. GPU Resource Management
3.1 AWS GPU Instances
Provision GPU instances for LLM inference:
# EC2 GPU Instances
resource "aws_instance" "gpu_inference" {
count = var.gpu_instance_count
ami = data.aws_ami.gpu_ami.id
instance_type = var.gpu_instance_type # g4dn.xlarge, p3.2xlarge, etc.
vpc_security_group_ids = [aws_security_group.gpu.id]
subnet_id = var.subnet_ids[count.index % length(var.subnet_ids)]
iam_instance_profile = aws_iam_instance_profile.gpu.name
root_block_device {
volume_type = "gp3"
volume_size = 100
encrypted = true
}
tags = merge(var.tags, {
Name = "${var.cluster_name}-gpu-${count.index}"
Accelerator = "gpu"
})
}
# Launch Template for Auto Scaling
resource "aws_launch_template" "gpu_template" {
name_prefix = "${var.cluster_name}-gpu-"
image_id = data.aws_ami.gpu_ami.id
instance_type = var.gpu_instance_type
vpc_security_group_ids = [aws_security_group.gpu.id]
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_type = "gp3"
volume_size = 100
encrypted = true
}
}
iam_instance_profile {
name = aws_iam_instance_profile.gpu.name
}
tag_specifications {
resource_type = "instance"
tags = merge(var.tags, {
Accelerator = "gpu"
})
}
}
# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_asg" {
name = "${var.cluster_name}-gpu-asg"
vpc_zone_identifier = var.subnet_ids
target_group_arns = [aws_lb_target_group.gpu.arn]
health_check_type = "ELB"
min_size = var.min_gpu_instances
max_size = var.max_gpu_instances
desired_capacity = var.desired_gpu_instances
launch_template {
id = aws_launch_template.gpu_template.id
version = "$Latest"
}
tag {
key = "Name"
value = "${var.cluster_name}-gpu"
propagate_at_launch = true
}
}
3.2 Azure GPU Instances
Provision GPU instances in Azure:
# Azure GPU VM
resource "azurerm_linux_virtual_machine" "gpu_vm" {
count = var.gpu_vm_count
name = "${var.cluster_name}-gpu-${count.index}"
resource_group_name = var.resource_group_name
location = var.location
size = var.gpu_vm_size # Standard_NC6s_v3, etc.
admin_username = var.admin_username
admin_ssh_key {
username = var.admin_username
public_key = var.ssh_public_key
}
network_interface_ids = [azurerm_network_interface.gpu[count.index].id]
os_disk {
caching = "ReadWrite"
storage_account_type = "Premium_LRS"
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-focal"
sku = "20_04-lts-gen2"
version = "latest"
}
tags = merge(var.tags, {
Accelerator = "gpu"
})
}
# Virtual Machine Scale Set for Auto Scaling
resource "azurerm_linux_virtual_machine_scale_set" "gpu_vmss" {
name = "${var.cluster_name}-gpu-vmss"
resource_group_name = var.resource_group_name
location = var.location
sku = var.gpu_vm_size
instances = var.gpu_instance_count
admin_username = var.admin_username
admin_ssh_key {
username = var.admin_username
public_key = var.ssh_public_key
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-focal"
sku = "20_04-lts-gen2"
version = "latest"
}
network_interface {
name = "gpu-nic"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = var.subnet_id
}
}
automatic_os_upgrade_policy {
disable_automatic_rollback = false
enable_automatic_os_upgrade = true
}
tags = var.tags
}

4. State Management and Remote Backends
4.1 Remote State Configuration
Use remote backends for state management:
# backend.tf - Remote State Backend
terraform {
backend "s3" {
bucket = "ai-infrastructure-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
# Alternative: Azure Backend
terraform {
backend "azurerm" {
resource_group_name = "terraform-state"
storage_account_name = "tfstate"
container_name = "tfstate"
key = "production.terraform.tfstate"
}
}
# Alternative: GCS Backend
terraform {
backend "gcs" {
bucket = "ai-infrastructure-terraform-state"
prefix = "production"
}
}
4.2 State Locking
Prevent concurrent modifications with state locking:
# DynamoDB Table for State Locking (AWS)
resource "aws_dynamodb_table" "terraform_state_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock"
}
}
# State Locking Configuration
terraform {
backend "s3" {
# ... other config ...
dynamodb_table = "terraform-state-lock"
}
}
5. Variable Management and Secrets
5.1 Variable Files
Organize variables by environment:
# variables.tf
variable "cluster_name" {
description = "Name of the Kubernetes cluster"
type = string
}
variable "node_count" {
description = "Number of nodes in the cluster"
type = number
default = 3
}
variable "gpu_instance_type" {
description = "GPU instance type"
type = string
default = "g4dn.xlarge"
}
variable "tags" {
description = "Tags to apply to resources"
type = map(string)
default = {}
}
# terraform.tfvars (production)
cluster_name = "ai-production-cluster"
node_count = 5
gpu_instance_type = "p3.2xlarge"
tags = {
Environment = "production"
Application = "ai-llm"
ManagedBy = "terraform"
}
5.2 Secrets Management
Handle secrets securely with Terraform:
# Using AWS Secrets Manager
data "aws_secretsmanager_secret" "api_key" {
name = "ai-api-key"
}
data "aws_secretsmanager_secret_version" "api_key" {
secret_id = data.aws_secretsmanager_secret.api_key.id
}
# Using Azure Key Vault
data "azurerm_key_vault_secret" "api_key" {
name = "ai-api-key"
key_vault_id = azurerm_key_vault.main.id
}
# Using environment variables (sensitive)
variable "api_key" {
description = "API key for LLM service"
type = string
sensitive = true
}
# Pass to resources
resource "kubernetes_secret" "api_key" {
metadata {
name = "api-key"
}
data = {
api-key = var.api_key
}
}
6. Multi-Cloud Deployment
6.1 Provider Configuration
Configure multiple cloud providers:
# providers.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
google = {
source = "hashicorp/google"
version = "~> 4.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.20"
}
}
}
# AWS Provider
provider "aws" {
region = var.aws_region
default_tags {
tags = var.tags
}
}
# Azure Provider
provider "azurerm" {
features {}
}
# GCP Provider
provider "google" {
project = var.gcp_project_id
region = var.gcp_region
}
6.2 Cross-Cloud Resources
Deploy resources across multiple clouds:
# AWS Resources
module "aws_infrastructure" {
source = "./modules/aws"
cluster_name = var.cluster_name
region = var.aws_region
}
# Azure Resources
module "azure_infrastructure" {
source = "./modules/azure"
cluster_name = var.cluster_name
resource_group_name = var.azure_resource_group
location = var.azure_location
}
# GCP Resources
module "gcp_infrastructure" {
source = "./modules/gcp"
cluster_name = var.cluster_name
project_id = var.gcp_project_id
region = var.gcp_region
}
7. CI/CD Integration
7.1 GitHub Actions Workflow
Automate Terraform deployments with CI/CD:
# .github/workflows/terraform.yml
name: Terraform Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
- name: Terraform Init
run: terraform init
working-directory: ./environments/production
- name: Terraform Validate
run: terraform validate
working-directory: ./environments/production
- name: Terraform Plan
run: terraform plan -out=tfplan
working-directory: ./environments/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply -auto-approve tfplan
working-directory: ./environments/production
7.2 Terraform Cloud Integration
Use Terraform Cloud for managed runs:
# terraform.tf - Terraform Cloud Backend
terraform {
cloud {
organization = "your-org"
workspaces {
name = "ai-infrastructure-production"
}
}
}
# Workspace Configuration
# - Auto-apply on merge to main
# - Manual apply for other branches
# - Cost estimation enabled
# - Policy checks (Sentinel)


8. Best Practices: Lessons from Production
After architecting multiple cloud-native AI systems, here are the practices I follow:
- Use modules: Create reusable, composable modules
- Remote state: Store state remotely for team collaboration
- State locking: Prevent concurrent modifications
- Version control: Track infrastructure changes in Git
- Environment separation: Separate dev, staging, production
- Variable management: Use tfvars files for environment configs
- Secrets handling: Never commit secrets—use secret managers
- CI/CD integration: Automate deployments with pipelines
- Plan before apply: Always review plans before applying
- Document modules: Document inputs, outputs, and usage


9. Common Mistakes to Avoid
I’ve made these mistakes so you don’t have to:
- Not using remote state: Local state doesn’t work for teams
- Committing secrets: Never commit secrets to Git
- Not using modules: Duplicate code leads to drift
- Ignoring state files: State files are critical—back them up
- Not planning: Always run terraform plan before apply
- Hardcoding values: Use variables for flexibility
- Not versioning: Pin provider and Terraform versions
- Ignoring costs: Use cost estimation tools
10. Conclusion
Infrastructure as Code with Terraform enables reproducible, version-controlled AI infrastructure deployments. The key is modular design, remote state management, proper variable handling, and CI/CD integration. Get these right, and your AI infrastructure will be consistent, maintainable, and scalable.
🎯 Key Takeaway
Terraform for AI infrastructure is about reproducibility, version control, and automation. Use modules for reusability, remote state for collaboration, proper variable management for flexibility, and CI/CD for automation. The result: consistent, maintainable, and scalable AI infrastructure deployments across environments and clouds.
Discover more from C4: Container, Code, Cloud & Context
Subscribe to get the latest posts sent to your email.