2025-09-04

AWS Fargate 102: The Patterns Nobody Tells You About

Advanced Fargate patterns learned from running production workloads. From cost optimization to stateful containers, here's what the docs won't tell you.

In Fargate 101, we covered the basics of getting started. This post explores advanced patterns that emerge from running production workloads - the kind of insights that typically surface during troubleshooting when the actual mechanics suddenly become clear.

Cost Optimization Strategies

As mentioned in the previous post, Fargate does cost more than EC2. However, several approaches can help manage those costs effectively. When AWS bills start climbing unexpectedly, systematic optimization becomes essential.

Fargate Spot: A Significant Cost Reduction

Fargate Spot offers substantial savings (up to 70%) with the trade-off that AWS can terminate your containers with a 2-minute notice. While this sounds risky, it works well for many use cases when implemented thoughtfully.

Here’s a proven approach:

resource "aws_ecs_service" "batch_processor" {
  name  = "batch-processor"
  cluster  = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.batch.arn
  
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight  = 4
    base  = 0
  }
  
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight  = 1
    base  = 2  # Always keep 2 on regular Fargate
  }
  
  desired_count = 10
}

This configuration runs 80% of tasks on Spot while maintaining a baseline on regular Fargate for stability. This pattern works particularly well for:

Batch processing jobs
Development and staging environments
Asynchronous workers that can handle restarts gracefully
CI/CD runners

A crucial lesson: always set up CloudWatch alarms for spot interruptions. When interruption rates spike, temporarily shift more traffic to regular Fargate:

// Lambda function to handle spot interruptions
import { ECSClient, PutClusterCapacityProvidersCommand } from "@aws-sdk/client-ecs";

export const handleSpotInterruption = async (event: any) => {
  const ecs = new ECSClient({ region: process.env.AWS_REGION });
  
  // Temporarily increase regular Fargate weight
  const command = new PutClusterCapacityProvidersCommand({
    cluster: 'production',
    capacityProviders: ['FARGATE', 'FARGATE_SPOT'],
    defaultCapacityProviderStrategy: [
      { capacityProvider: 'FARGATE', weight: 10, base: 5 },
      { capacityProvider: 'FARGATE_SPOT', weight: 1, base: 0 }
    ]
  });
  await ecs.send(command);
  
  // Set a timer to revert after 30 minutes
  await scheduleReversion();
};

Right-Sizing: Finding the Sweet Spot

Many teams tend to over-provision Fargate tasks to avoid out-of-memory issues. Here’s a systematic approach for finding appropriate resource allocation:

Start big, measure, then shrink

# Deploy with generous resources
CPU: 1024
Memory: 2048

# After a week, check actual usage
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ServiceName,Value=my-service \
  --statistics Average,Maximum \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-08T00:00:00Z \
  --period 3600

Use the 80% rule: Size for 80% of peak usage, not 100%. That 20% buffer handles spikes.

Different sizes for different environments:

locals {
  task_sizes = {
    production = {
      cpu  = "512"
      memory = "1024"
    }
    staging = {
      cpu  = "256"
      memory = "512"
    }
    development = {
      cpu  = "256"
      memory = "512"
    }
  }
}

resource "aws_ecs_task_definition" "app" {
  cpu  = local.task_sizes[var.environment].cpu
  memory = local.task_sizes[var.environment].memory
  # ...
}

ARM + Savings Plans: The Double Discount

AWS Graviton2/Graviton3 (ARM) processors are 20% cheaper and often faster. Combine with Savings Plans for another 20% off:

# Multi-arch Dockerfile
FROM --platform=$BUILDPLATFORM node:18-alpine AS builder
ARG TARGETPLATFORM
ARG BUILDPLATFORM
RUN echo "Building on $BUILDPLATFORM for $TARGETPLATFORM"

# Your build steps here...

FROM node:18-alpine
COPY --from=builder /app /app
CMD ["node", "index.js"]

Build for both architectures:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag myapp:latest \
  --push .

Then in your task definition:

{
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  }
}

Note: Fargate currently supports Graviton2 processors, which offer excellent price-performance for most workloads.

Moving Node.js services to Graviton typically results in about 30-40% cost savings. Common compatibility issues include legacy Java applications with x86-specific JNI libraries, though most modern workloads transition smoothly.

Working With Stateful Containers

The conventional wisdom is that containers should be stateless, and generally that’s good advice. However, there are cases where you need persistent state, and EFS can be both helpful and challenging in this context.

EFS: The Good, Bad, and Ugly

Setting up EFS with Fargate:

resource "aws_efs_file_system" "shared" {
  creation_token = "shared-storage"
  
  performance_mode = "generalPurpose"  # or "maxIO" for more operations
  throughput_mode  = "bursting"  # or "provisioned" for consistent performance
  
  lifecycle_policy {
    transition_to_ia = "AFTER_30_DAYS"  # Save money on cold data
  }
}

resource "aws_ecs_task_definition" "app" {
  # ... other config ...
  
  volume {
    name = "shared-storage"
    
    efs_volume_configuration {
      file_system_id  = aws_efs_file_system.shared.id
      root_directory  = "/"
      transit_encryption  = "ENABLED"
      transit_encryption_port = 2999
      
      authorization_config {
        access_point_id = aws_efs_access_point.app.id
        iam  = "ENABLED"
      }
    }
  }
  
  container_definitions = jsonencode([{
    name = "app"
    # ...
    mountPoints = [{
      sourceVolume  = "shared-storage"
      containerPath = "/data"
    }]
  }])
}

Typical performance characteristics:

EFS latency: 0.25-10ms depending on operation type (metadata operations are faster than data operations, compared to 0.1ms for local SSD)
Throughput: Bursts to 100MB/s, sustains around 10MB/s in general purpose mode
Cost: About $0.30/GB/month (vs $0.10 for EBS)
Note: As of January 2024, Fargate also supports EBS volumes for better performance when you need faster, persistent storage

Where EFS proves useful:

Shared configuration files that multiple containers need to access
User uploads that require cross-container availability
Build caches (though locking can be tricky)
Legacy applications with hard filesystem requirements

Where alternatives are recommended:

Database storage (RDS or managed databases work better)
High-frequency write operations
Temporary files (container ephemeral storage is faster)
Caching layers (ElastiCache is more appropriate)

The Session Affinity Pattern

Sometimes you need sticky sessions. Here’s how to do it with Fargate:

resource "aws_lb_target_group" "app" {
  name  = "app-tg"
  port  = 80
  protocol = "HTTP"
  vpc_id  = aws_vpc.main.id
  target_type = "ip"
  
  stickiness {
    type  = "app_cookie"
    cookie_duration = 86400
    cookie_name  = "FARGATE_SESSION"
  }
  
  health_check {
    enabled = true
    path  = "/health"
    matcher = "200"
  }
}

However, sticky sessions and auto-scaling don’t play well together. When tasks terminate unexpectedly, those sessions disappear. A better approach involves:

Store session data in ElastiCache Redis for persistence
Use JWT tokens instead of server-side sessions when possible
Design for graceful session loss during deployments

Monitoring Fargate Workloads

One challenge with Fargate is the reduced visibility into the underlying infrastructure. Here’s an effective monitoring approach:

The Three Pillars of Fargate Observability

CloudWatch Container Insights (The Basics)
```
aws ecs put-account-setting \
  --name containerInsights \
  --value enabled
```
This gives you CPU, memory, network, and disk metrics. It’s fine for basics but misses application-level stuff.

X-Ray for Distributed Tracing (The Connections)

// Add to your Node.js app
const AWSXRay = require('aws-xray-sdk-core');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));

// Now all AWS SDK calls are traced
const s3 = new AWS.S3();
await s3.getObject({ Bucket: 'my-bucket', Key: 'file.txt' }).promise();

Custom Metrics via StatsD (The Details)

// Run a sidecar container for StatsD
const taskDef = {
  containerDefinitions: [
    {
      name: "app",
      // your app config
    },
    {
      name: "datadog-agent",
      image: "datadog/agent:latest",
      environment: [
        { name: "DD_API_KEY", value: process.env.DD_API_KEY },
        { name: "ECS_FARGATE", value: "true" }
      ]
    }
  ]
};

The Debug Pattern That Saves Hours

Can’t SSH into Fargate? Use ECS Exec, but make it useful:

# Add this to your Dockerfile
RUN apk add --no-cache \
    curl \
    netstat \
    ps \
    htop \
    strace \
    tcpdump

# Enable ECS Exec in task definition
aws ecs update-service \
  --cluster production \
  --service my-service \
  --enable-execute-command

# Debug like a pro
aws ecs execute-command \
  --cluster production \
  --task abc123 \
  --container app \
  --interactive \
  --command "/bin/sh"

# Inside the container:
> netstat -tulpn  # Check what's listening
> ps aux  # See all processes
> strace -p 1  # Trace system calls
> tcpdump -i any  # Watch network traffic

Blue-Green Deployments: The Safe Way

Fargate + CodeDeploy = zero-downtime deployments. Here’s the setup that’s saved us from many bad deploys:

resource "aws_codedeploy_deployment_group" "app" {
  app_name  = aws_codedeploy_app.app.name
  deployment_group_name  = "production"
  deployment_config_name = "CodeDeployDefault.ECSLinear10PercentEvery1Minutes"
  
  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE", "DEPLOYMENT_STOP_ON_ALARM"]
  }
  
  blue_green_deployment_config {
    terminate_blue_instances_on_deployment_success {
      action  = "TERMINATE"
      termination_wait_time_in_minutes  = 5
    }
    
    deployment_ready_option {
      action_on_timeout = "CONTINUE_DEPLOYMENT"
    }
    
    green_fleet_provisioning_option {
      action = "COPY_AUTO_SCALING_GROUP"
    }
  }
  
  ecs_service {
    cluster_name = aws_ecs_cluster.main.name
    service_name = aws_ecs_service.app.name
  }
}

The killer feature? Automatic rollback on CloudWatch alarms:

const errorRateAlarm = new cloudwatch.Alarm(this, 'ErrorRate', {
  metric: new cloudwatch.Metric({
    namespace: 'MyApp',
    metricName: 'Errors',
    statistic: 'Sum'
  }),
  threshold: 10,
  evaluationPeriods: 2
});

// Link to deployment group
deploymentGroup.addAlarm(errorRateAlarm);

Multi-Region Fargate: Because Disasters Happen

Running Fargate across regions isn’t hard, but keeping them in sync is. Here’s a proven pattern:

Common gotchas to watch for:

Environment variables per region: Don’t hardcode endpoints
S3 bucket names must be unique globally: Add region suffix
Cross-region latency: 80-150ms depending on specific regions (us-east-1 to eu-west-1 is typically around 90ms)
Failover isn’t instant: Route 53 health checks take 30-60 seconds

Essential Patterns for Production Fargate

The Sidecar Pattern

{
  "containerDefinitions": [
    {
      "name": "app",
      "dependsOn": [{
        "containerName": "envoy",
        "condition": "HEALTHY"
      }]
    },
    {
      "name": "envoy",
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:9901/ready || exit 1"]
      }
    }
  ]
}

The Init Container Pattern (Sort Of)

Fargate doesn’t have true init containers, but you can fake it:

# In your entrypoint script
#!/bin/sh
echo "Running initialization..."
/app/init-db.sh
if [ $? -ne 0 ]; then
  echo "Init failed, exiting"
  exit 1
fi
echo "Starting main application..."
exec node index.js

The Circuit Breaker Pattern

class FargateCircuitBreaker {
  private failures = 0;
  private lastFailTime = 0;
  private readonly threshold = 5;
  private readonly timeout = 60000; // 1 minute
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.isOpen()) {
      throw new Error('Circuit breaker is open');
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private isOpen(): boolean {
    return this.failures >= this.threshold && 
           Date.now() - this.lastFailTime < this.timeout;
  }
  
  private onSuccess(): void {
    this.failures = 0;
  }
  
  private onFailure(): void {
    this.failures++;
    this.lastFailTime = Date.now();
  }
}

Remember: Fargate is like a Swiss Army knife. Incredibly useful, but you’ll occasionally cut yourself if you’re not careful.

References

Optimizing Amazon ECS service auto scaling - Best practices for configuring ECS Service Auto Scaling to handle traffic spikes without over-provisioning.
Use historical patterns to scale Amazon ECS services with predictive scaling - Predictive scaling on Fargate that scales out tasks before expected traffic influx.
Fargate security considerations for Amazon ECS - Security hardening guidance including task isolation, IAM task roles, and secrets management.
Network security best practices for Amazon ECS - Network segmentation, security group rules, and VPC Flow Logs for Fargate workloads.
Amazon ECS service quotas and API throttling limits - Service quotas that affect scaling behavior at high task counts and how to request limit increases.
Run message-driven workloads at scale by using AWS Fargate - AWS Prescriptive Guidance pattern for SQS-driven auto scaling of Fargate tasks.
Theoretical cost optimization by Amazon ECS launch type: Fargate vs EC2 - AWS Containers blog analysis of when Fargate becomes more cost-effective than EC2 at different utilization levels.

AWS Fargate Deep Dive Series

Complete guide to AWS Fargate from basics to production. Learn serverless containers, cost optimization, debugging techniques, and Infrastructure-as-Code deployment patterns through real-world experience.

Progress 2 of 4 posts

Previous Basics: When Your Containers Don't Need a Babysitter Next Production Lessons That'll Save You Hours

All posts in this series

Part 1: Basics: When Your Containers Don't Need a Babysitter

Part 2: Advanced Patterns: Cost Optimization and Production Techniques

Part 3: Production Lessons That'll Save You Hours

Part 4: IaC Deep Dive: Deploying with CDK, Terraform, and SAM

View series →

AWS Fargate 101: When Your Containers Don't Need a Babysitter

A practical guide to AWS Fargate from someone who's managed too many EC2 instances. Learn when serverless containers make sense and when they don't.

awsfargateecs+4

September 4, 2025

Edge Computing with AWS: CloudFront Functions vs Lambda@Edge

A comprehensive technical guide to choosing and implementing AWS edge computing solutions for global applications with practical examples and cost optimization strategies.

awscloudfrontlambda+6

December 25, 2025

AWS Secrets Manager & Parameter Store: Security Best Practices

A comprehensive technical guide comparing AWS Secrets Manager and Systems Manager Parameter Store, demonstrating when to use each service with real-world implementation patterns.

awssecrets-managerparameter-store+8

December 23, 2025

AWS Fargate 103: Production Lessons That'll Save You Hours

Production incidents from running Fargate at scale. Memory leaks, ENI limits, subnet failures, and debugging techniques that work.

awsfargatedebugging+4

September 4, 2025

AWS Fargate 104: Deploying with CDK, Terraform, and SAM

How to deploy Fargate effectively with different IaC tools. Practical patterns, common gotchas, and what works best for each approach.

awsfargatecdk+5

September 4, 2025