AWS Fargate 102: The Patterns Nobody Tells You About

In Fargate 101, I covered the basics of getting started. This post explores some patterns I've discovered after running production workloads for a while—the kind of insights that usually emerge when you're troubleshooting an issue and suddenly understand how something actually works under the hood.

Cost Optimization Strategies#

As I mentioned in the previous post, Fargate does cost more than EC2. However, I've learned some approaches that can help manage those costs. After experiencing a particularly surprising AWS bill one month, we got more systematic about optimization.

Fargate Spot: A Significant Cost Reduction#

Fargate Spot offers substantial savings (up to 70%) with the trade-off that AWS can terminate your containers with a 2-minute notice. While this sounds risky, I've found it works well for many use cases when implemented thoughtfully.

Here's an approach that has worked for us:

hcl

resource "aws_ecs_service" "batch_processor" {
  name            = "batch-processor"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.batch.arn
  
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4
    base              = 0
  }
  
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 2  # Always keep 2 on regular Fargate
  }
  
  desired_count = 10
}

This configuration runs 80% of tasks on Spot while maintaining a baseline on regular Fargate for stability. I've found this pattern works well for:

Batch processing jobs
Development and staging environments
Asynchronous workers that can handle restarts gracefully
CI/CD runners

One thing I learned the hard way: it's worth setting up CloudWatch alarms for spot interruptions. When interruption rates spike, we temporarily shift more traffic to regular Fargate:

TypeScript

// Lambda function to handle spot interruptions
export const handleSpotInterruption = async (event: any) => {
  const ecs = new AWS.ECS();
  
  // Temporarily increase regular Fargate weight
  await ecs.putClusterCapacityProviders({
    cluster: 'production',
    capacityProviders: ['FARGATE', 'FARGATE_SPOT'],
    defaultCapacityProviderStrategy: [
      { capacityProvider: 'FARGATE', weight: 10, base: 5 },
      { capacityProvider: 'FARGATE_SPOT', weight: 1, base: 0 }
    ]
  }).promise();
  
  // Set a timer to revert after 30 minutes
  await scheduleReversion();
};

Right-Sizing: Finding the Sweet Spot#

I've noticed that many folks (myself included initially) tend to over-provision Fargate tasks to avoid out-of-memory issues. Here's the approach I've developed for finding appropriate resource allocation:

Start big, measure, then shrink

Bash

# Deploy with generous resources
CPU: 1024
Memory: 2048

# After a week, check actual usage
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ServiceName,Value=my-service \
  --statistics Average,Maximum \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-08T00:00:00Z \
  --period 3600

Use the 80% rule: Size for 80% of peak usage, not 100%. That 20% buffer handles spikes.

Different sizes for different environments:

hcl

locals {
  task_sizes = {
    production = {
      cpu    = "512"
      memory = "1024"
    }
    staging = {
      cpu    = "256"
      memory = "512"
    }
    development = {
      cpu    = "256"
      memory = "512"
    }
  }
}

resource "aws_ecs_task_definition" "app" {
  cpu    = local.task_sizes[var.environment].cpu
  memory = local.task_sizes[var.environment].memory
  # ...
}

ARM + Savings Plans: The Double Discount#

AWS Graviton (ARM) processors are 20% cheaper and often faster. Combine with Savings Plans for another 20% off:

Dockerfile

# Multi-arch Dockerfile
FROM --platform=$BUILDPLATFORM node:18-alpine AS builder
ARG TARGETPLATFORM
ARG BUILDPLATFORM
RUN echo "Building on $BUILDPLATFORM for $TARGETPLATFORM"

# Your build steps here...

FROM node:18-alpine
COPY --from=builder /app /app
CMD ["node", "index.js"]

Build for both architectures:

Bash

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag myapp:latest \
  --push .

Then in your task definition:

JSON

{
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  }
}

In our experience, moving Node.js services to Graviton resulted in about 35% cost savings. The main compatibility issue we encountered was a legacy Java application with x86-specific JNI libraries, but most of our other workloads transitioned smoothly.

Working With Stateful Containers#

The conventional wisdom is that containers should be stateless, and generally that's good advice. However, there are cases where you need persistent state, and EFS can be both helpful and challenging in this context.

EFS: The Good, Bad, and Ugly#

Loading diagram...

Setting up EFS with Fargate:

hcl

resource "aws_efs_file_system" "shared" {
  creation_token = "shared-storage"
  
  performance_mode = "generalPurpose"  # or "maxIO" for more operations
  throughput_mode  = "bursting"        # or "provisioned" for consistent performance
  
  lifecycle_policy {
    transition_to_ia = "AFTER_30_DAYS"  # Save money on cold data
  }
}

resource "aws_ecs_task_definition" "app" {
  # ... other config ...
  
  volume {
    name = "shared-storage"
    
    efs_volume_configuration {
      file_system_id          = aws_efs_file_system.shared.id
      root_directory          = "/"
      transit_encryption      = "ENABLED"
      transit_encryption_port = 2999
      
      authorization_config {
        access_point_id = aws_efs_access_point.app.id
        iam             = "ENABLED"
      }
    }
  }
  
  container_definitions = jsonencode([{
    name = "app"
    # ...
    mountPoints = [{
      sourceVolume  = "shared-storage"
      containerPath = "/data"
    }]
  }])
}

Performance characteristics I've observed:

EFS latency: 5-10ms per operation (compared to 0.1ms for local SSD)
Throughput: Bursts to 100MB/s, sustains around 10MB/s in general purpose mode
Cost: About $0.30/GB/month (vs $0.10 for EBS)

Where I've found EFS useful:

Shared configuration files that multiple containers need to access
User uploads that require cross-container availability
Build caches (though locking can be tricky)
Legacy applications with hard filesystem requirements

Where I'd recommend alternatives:

Database storage (RDS or managed databases work better)
High-frequency write operations
Temporary files (container ephemeral storage is faster)
Caching layers (ElastiCache is more appropriate)

The Session Affinity Pattern#

Sometimes you need sticky sessions. Here's how to do it with Fargate:

hcl

resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id
  target_type = "ip"
  
  stickiness {
    type            = "app_cookie"
    cookie_duration = 86400
    cookie_name     = "FARGATE_SESSION"
  }
  
  health_check {
    enabled = true
    path    = "/health"
    matcher = "200"
  }
}

However, I've learned that sticky sessions and auto-scaling don't play well together. When tasks terminate unexpectedly, those sessions disappear. Our approach evolved to:

Store session data in ElastiCache Redis for persistence
Use JWT tokens instead of server-side sessions when possible
Design for graceful session loss during deployments

Monitoring Fargate Workloads#

One challenge with Fargate is the reduced visibility into the underlying infrastructure. Here's the monitoring approach I've found most effective:

The Three Pillars of Fargate Observability#

CloudWatch Container Insights (The Basics)
Bash
```
aws ecs put-account-setting \
  --name containerInsights \
  --value enabled
```
This gives you CPU, memory, network, and disk metrics. It's fine for basics but misses application-level stuff.

X-Ray for Distributed Tracing (The Connections)

JavaScript

// Add to your Node.js app
const AWSXRay = require('aws-xray-sdk-core');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));

// Now all AWS SDK calls are traced
const s3 = new AWS.S3();
await s3.getObject({ Bucket: 'my-bucket', Key: 'file.txt' }).promise();

Custom Metrics via StatsD (The Details)

TypeScript

// Run a sidecar container for StatsD
const taskDef = {
  containerDefinitions: [
    {
      name: "app",
      // your app config
    },
    {
      name: "datadog-agent",
      image: "datadog/agent:latest",
      environment: [
        { name: "DD_API_KEY", value: process.env.DD_API_KEY },
        { name: "ECS_FARGATE", value: "true" }
      ]
    }
  ]
};

The Debug Pattern That Saves Hours#

Can't SSH into Fargate? Use ECS Exec, but make it useful:

Bash

# Add this to your Dockerfile
RUN apk add --no-cache \
    curl \
    netstat \
    ps \
    htop \
    strace \
    tcpdump

# Enable ECS Exec in task definition
aws ecs update-service \
  --cluster production \
  --service my-service \
  --enable-execute-command

# Debug like a pro
aws ecs execute-command \
  --cluster production \
  --task abc123 \
  --container app \
  --interactive \
  --command "/bin/sh"

# Inside the container:
> netstat -tulpn  # Check what's listening
> ps aux          # See all processes
> strace -p 1     # Trace system calls
> tcpdump -i any  # Watch network traffic

Blue-Green Deployments: The Safe Way#

Fargate + CodeDeploy = zero-downtime deployments. Here's the setup that's saved us from many bad deploys:

hcl

resource "aws_codedeploy_deployment_group" "app" {
  app_name               = aws_codedeploy_app.app.name
  deployment_group_name  = "production"
  deployment_config_name = "CodeDeployDefault.ECSLinear10PercentEvery1Minutes"
  
  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE", "DEPLOYMENT_STOP_ON_ALARM"]
  }
  
  blue_green_deployment_config {
    terminate_blue_instances_on_deployment_success {
      action                                          = "TERMINATE"
      termination_wait_time_in_minutes               = 5
    }
    
    deployment_ready_option {
      action_on_timeout = "CONTINUE_DEPLOYMENT"
    }
    
    green_fleet_provisioning_option {
      action = "COPY_AUTO_SCALING_GROUP"
    }
  }
  
  ecs_service {
    cluster_name = aws_ecs_cluster.main.name
    service_name = aws_ecs_service.app.name
  }
}

The killer feature? Automatic rollback on CloudWatch alarms:

TypeScript

const errorRateAlarm = new cloudwatch.Alarm(this, 'ErrorRate', {
  metric: new cloudwatch.Metric({
    namespace: 'MyApp',
    metricName: 'Errors',
    statistic: 'Sum'
  }),
  threshold: 10,
  evaluationPeriods: 2
});

// Link to deployment group
deploymentGroup.addAlarm(errorRateAlarm);

Multi-Region Fargate: Because Disasters Happen#

Running Fargate across regions isn't hard, but keeping them in sync is. Here's our pattern:

Loading diagram...

The gotchas we discovered:

Environment variables per region: Don't hardcode endpoints
S3 bucket names must be unique globally: Add region suffix
Cross-region latency: ~100ms between US and EU
Failover isn't instant: Route 53 health checks take 30-60 seconds

The Patterns We Wished We Knew Earlier#

The Sidecar Pattern#

JSON

{
  "containerDefinitions": [
    {
      "name": "app",
      "dependsOn": [{
        "containerName": "envoy",
        "condition": "HEALTHY"
      }]
    },
    {
      "name": "envoy",
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:9901/ready || exit 1"]
      }
    }
  ]
}

The Init Container Pattern (Sort Of)#

Fargate doesn't have true init containers, but you can fake it:

Bash

# In your entrypoint script
#!/bin/sh
echo "Running initialization..."
/app/init-db.sh
if [ $? -ne 0 ]; then
  echo "Init failed, exiting"
  exit 1
fi
echo "Starting main application..."
exec node index.js

The Circuit Breaker Pattern#

TypeScript

class FargateCircuitBreaker {
  private failures = 0;
  private lastFailTime = 0;
  private readonly threshold = 5;
  private readonly timeout = 60000; // 1 minute
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.isOpen()) {
      throw new Error('Circuit breaker is open');
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private isOpen(): boolean {
    return this.failures >= this.threshold && 
           Date.now() - this.lastFailTime < this.timeout;
  }
  
  private onSuccess(): void {
    this.failures = 0;
  }
  
  private onFailure(): void {
    this.failures++;
    this.lastFailTime = Date.now();
  }
}

Remember: Fargate is like a Swiss Army knife. Incredibly useful, but you'll occasionally cut yourself if you're not careful.

AWS Fargate 102: The Patterns Nobody Tells You About

Cost Optimization Strategies#

Fargate Spot: A Significant Cost Reduction#

Right-Sizing: Finding the Sweet Spot#

ARM + Savings Plans: The Double Discount#

Working With Stateful Containers#

EFS: The Good, Bad, and Ugly#

The Session Affinity Pattern#

Monitoring Fargate Workloads#

The Three Pillars of Fargate Observability#

The Debug Pattern That Saves Hours#

Blue-Green Deployments: The Safe Way#

Multi-Region Fargate: Because Disasters Happen#

The Patterns We Wished We Knew Earlier#

The Sidecar Pattern#

The Init Container Pattern (Sort Of)#

The Circuit Breaker Pattern#

AWS Fargate Deep Dive Series

All Posts in This Series

Comments (0)

Join the conversation

No comments yet

Comments (0)

Join the conversation

No comments yet

Related Posts