Ayhan Sipahi 2025-09-04

AWS Fargate Troubleshooting: ENI Limits, Timeouts, Debugging

Production incidents from running Fargate at scale. Memory leaks, ENI limits, subnet failures, and debugging techniques that work.

Fargate hides resource constraints (ENI limits, memory leaks, and subnet exhaustion) behind green dashboards until a traffic spike makes them impossible to ignore. These failures share a pattern: the root cause is invisible to standard CloudWatch metrics, so tasks stall or crash with misleading errors. This post covers the production incidents that expose those blind spots and the debugging steps that resolve them.

Previous parts of this Fargate series (101, 102) covered the basics and advanced patterns. Production scenarios yield the hardest lessons, and the debugging approaches below proved effective against them. The next part, 104, covers Infrastructure-as-Code deployment patterns.

The ENI Limit Discovery

Context: An e-commerce platform was preparing for a high-traffic event, expecting roughly 10x normal load. Auto-scaling was configured and load testing had passed.

The Issue: The evening before the event, tasks started failing with this error:

ResourcesNotReady: The ENI allocation could not be completed

Fargate tasks were stuck in PENDING state. New deployments wouldn’t complete, and auto-scaling couldn’t provision additional capacity. Most dashboards showed normal operation, but we discovered that available ENIs in our VPC had reached the limit.

What We Learned

Each Fargate task requires its own ENI, and AWS enforces limits per VPC. Running around 200 tasks across 3 availability zones, without accounting for ENI consumption from other services, can silently exhaust the limit. The limits we discovered:

Default ENI limit: 5,000 per VPC
Each Fargate task: 1 ENI
Each RDS instance: 1 ENI
Each Lambda in VPC: Shares ENI pool
Each ELB: Multiple ENIs

Checking limits:

aws ec2 describe-account-attributes \
  --attribute-names max-instances

# But the real limit is here:
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-0263D0A3  # ENIs per VPC

In this case the VPC was at 4,847 ENIs. Load testing had focused on application performance without considering cumulative ENI consumption across all services in the VPC.

Resolution Approach

Immediate steps:

# Scale down non-critical services in development
aws ecs update-service \
  --cluster development \
  --service api \
  --desired-count 0

# Request quota increase through AWS Support
aws support create-case \
  --subject "ENI quota increase needed - production capacity planning" \
  --service-code "service-quota-increase"

Longer-term improvements:

Multiple VPCs: Split workloads across dev, staging, and prod VPCs
ENI monitoring: CloudWatch custom metric tracking ENI usage
Right-sizing: Reduced over-provisioned tasks
Lambda optimization: Moved Lambdas out of VPC where possible

// ENI monitoring Lambda (using AWS SDK v2 for compatibility)
// Note: Consider migrating to SDK v3 for better performance and tree-shaking
export const monitorENIs = async () => {
  const ec2 = new AWS.EC2();
  const cloudWatch = new AWS.CloudWatch();
  
  const enis = await ec2.describeNetworkInterfaces().promise();
  const inUse = enis.NetworkInterfaces?.length || 0;
  
  await cloudWatch.putMetricData({
    Namespace: 'Custom/VPC',
    MetricData: [{
      MetricName: 'ENIsInUse',
      Value: inUse,
      Unit: 'Count'
    }]
  }).promise();
};

Lessons learned:

Load testing should include all infrastructure components, not just your application
ENI limits are per VPC, not per service
AWS Support is surprisingly responsive during critical incidents

Subnet Routing Failure

The Setup: Multi-AZ Fargate deployment across three private subnets. Everything running smoothly for months.

What Happened: 40% of tasks started showing intermittent connectivity issues. Some HTTP requests succeeded, others timed out after 30 seconds.

The weird part? Only tasks in one specific subnet (us-east-1a) were affected.

The Investigation Journey

First, the obvious checks:

# Check task health
aws ecs list-tasks --cluster production --service-name api
aws ecs describe-tasks --cluster production --tasks task-abc123

# Check network interfaces
aws ec2 describe-network-interfaces \
  --filters "Name=subnet-id,Values=subnet-12345" \
  --query 'NetworkInterfaces[*].[NetworkInterfaceId,Status,PrivateIpAddress]'

Tasks looked healthy. Network interfaces were attached and active. But something was wrong.

The breakthrough: Enabling VPC Flow Logs revealed the smoking gun:

# Enable VPC Flow Logs for the problem subnet
aws ec2 create-flow-logs \
  --resource-type Subnet \
  --resource-ids subnet-12345 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /aws/vpc/flowlogs

Flow logs showed that packets were leaving our subnet but never reaching their destination. The return packets were getting dropped somewhere.

The Culprit

Turns out, our network team had modified the route table for that subnet earlier that morning. They changed the NAT gateway route from 0.0.0.0/0 → nat-gateway-123 to 0.0.0.0/0 → nat-gateway-456 without realizing Fargate tasks were running there.

The new NAT gateway was in a different AZ and had different security group rules. Classic.

The fix:

# Check which route table is associated with the subnet
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-12345"

# Verify the routes
aws ec2 describe-route-tables --route-table-ids rtb-abc123 \
  --query 'RouteTables[*].Routes[*].[DestinationCidrBlock,GatewayId,State]'

# Fix the route (revert to original NAT gateway)
aws ec2 replace-route \
  --route-table-id rtb-abc123 \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-gateway-123

Lessons learned:

Always test routing changes in non-production first
VPC Flow Logs are your friend for network debugging
Document which route tables serve which services
Set up monitoring for routing table changes

Tracking a Memory Leak Without SSH Access

The Setup: Node.js API running on Fargate, memory limit set to 2GB per task. Worked fine for weeks.

What Happened: Memory usage slowly climbing over 3-4 hours, then tasks getting OOM killed. Memory usage graphs looked like ski slopes.

But here’s the kicker: no way to SSH into the container to debug.

The Investigation Arsenal

Since we can’t SSH, we need to get creative:

1. ECS Exec (the primary tool):

# First, enable it on the service
aws ecs update-service \
  --cluster production \
  --service api \
  --enable-execute-command

# Then connect to a running task
aws ecs execute-command \
  --cluster production \
  --task task-abc123 \
  --container api \
  --interactive \
  --command "/bin/bash"

# Inside the container, check memory usage
> ps aux --sort=-%mem | head -20
> cat /proc/meminfo
> pmap -x 1  # Memory map of PID 1

2. Application-level monitoring:

// Add to your Node.js app
const express = require('express');
const app = express();

// Memory monitoring endpoint
app.get('/debug/memory', (req, res) => {
  const used = process.memoryUsage();
  const stats = {
    rss: Math.round(used.rss / 1024 / 1024 * 100) / 100,  // MB
    heapTotal: Math.round(used.heapTotal / 1024 / 1024 * 100) / 100,
    heapUsed: Math.round(used.heapUsed / 1024 / 1024 * 100) / 100,
    external: Math.round(used.external / 1024 / 1024 * 100) / 100,
    arrayBuffers: Math.round(used.arrayBuffers / 1024 / 1024 * 100) / 100
  };
  
  res.json(stats);
});

// Heap snapshot endpoint (for extreme debugging)
app.get('/debug/heapdump', (req, res) => {
  const heapdump = require('heapdump');
  const filename = `/tmp/heapdump-${Date.now()}.heapsnapshot`;
  
  heapdump.writeSnapshot(filename, (err, filename) => {
    if (err) {
      res.status(500).send(err.message);
    } else {
      res.download(filename);
    }
  });
});

3. The detective work:

ECS Exec enables installing debugging tools inside the container. In this case the HTTP client wasn’t properly closing connections:

# Inside the container
> npm install -g clinic
> clinic doctor --on-port 8080 -- node index.js &
> curl http://localhost:8080/debug/memory

# Check open file descriptors
> ls -la /proc/1/fd | wc -l
> lsof -p 1 | grep TCP

Bingo! Thousands of TCP connections in CLOSE_WAIT state.

The Root Cause

Our Node.js HTTP client code looked innocent enough:

// The problematic code
const axios = require('axios');

async function callExternalAPI() {
  const response = await axios.get('https://api.example.com/data');
  return response.data;
}

But we weren’t configuring connection pooling or timeouts properly. Each request created new connections that weren’t being cleaned up.

The fix:

// Fixed version with proper configuration
const axios = require('axios');
const https = require('https');
const http = require('http');

// Configure connection pooling
const httpAgent = new http.Agent({
  keepAlive: true,
  maxSockets: 50,
  timeout: 5000,
});

const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 50,
  timeout: 5000,
});

const axiosInstance = axios.create({
  httpAgent,
  httpsAgent,
  timeout: 10000, // 10 seconds
});

// Graceful shutdown
process.on('SIGTERM', () => {
  httpAgent.destroy();
  httpsAgent.destroy();
});

async function callExternalAPI() {
  const response = await axiosInstance.get('https://api.example.com/data');
  return response.data;
}

Lessons learned:

ECS Exec is invaluable for containerized debugging
Always configure HTTP clients properly in production
Monitor file descriptors, not just memory
Connection pools matter, even for “simple” HTTP clients

The 30-Second Connection Timeout

The Setup: Internal service-to-service communication between two Fargate services. Worked fine 99% of the time.

What Happened: Randomly, about 1% of requests would hang for exactly 30 seconds, then fail with a connection timeout.

The pattern was completely random. No correlation with load, time of day, or deployment history.

The Debugging Odyssey

Network layer investigation:

# VPC Flow Logs analysis
aws logs filter-log-events \
  --log-group-name /aws/vpc/flowlogs \
  --start-time 1645564800000 \
  --filter-pattern "REJECT"

# Security group rules audit
aws ec2 describe-security-groups \
  --group-ids sg-12345 \
  --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}'

Security groups looked fine. Flow logs showed packets flowing normally.

Application layer investigation:

// Added detailed connection tracking
const net = require('net');
const original_connect = net.Socket.prototype.connect;

net.Socket.prototype.connect = function(...args) {
  const startTime = Date.now();
  console.log(`[${new Date().toISOString()}] Starting connection to ${args[0]?.host || args[0]?.path}`);
  
  const result = original_connect.apply(this, args);
  
  this.on('connect', () => {
    const duration = Date.now() - startTime;
    console.log(`[${new Date().toISOString()}] Connected after ${duration}ms`);
  });
  
  this.on('error', (err) => {
    const duration = Date.now() - startTime;
    console.log(`[${new Date().toISOString()}] Connection error after ${duration}ms:`, err.message);
  });
  
  return result;
};

The Breakthrough

The logs showed something interesting: successful connections were taking 2-5ms, but the hanging ones were taking exactly 30,000ms. That’s not random - that’s a timeout.

The pattern became clear: the issue only appeared when both services were in the same availability zone and the connection was going through the load balancer.

The issue: AWS ALB has a known quirk where connections from the same AZ can occasionally loop back through the load balancer infrastructure, causing delays.

The fix (multiple strategies):

Direct service communication for same-AZ:

// Service discovery with AZ awareness
const AWS = require('aws-sdk');
const ecs = new AWS.ECS();

async function getServiceEndpoints() {
  const tasks = await ecs.listTasks({
    cluster: 'production',
    serviceName: 'target-service'
  }).promise();
  
  const taskDetails = await ecs.describeTasks({
    cluster: 'production',
    tasks: tasks.taskArns
  }).promise();
  
  return taskDetails.tasks.map(task => ({
    ip: task.attachments[0].details.find(d => d.name === 'privateIPv4Address').value,
    az: task.availabilityZone,
    port: 8080
  }));
}

// Smart routing
async function callService(endpoint, data) {
  const currentAZ = process.env.AWS_AVAILABILITY_ZONE;
  const endpoints = await getServiceEndpoints();
  
  // Try same-AZ direct connection first
  const sameAZEndpoint = endpoints.find(e => e.az === currentAZ);
  if (sameAZEndpoint) {
    try {
      return await axios.post(`http://${sameAZEndpoint.ip}:${sameAZEndpoint.port}${endpoint}`, data);
    } catch (error) {
      // Fall back to load balancer
      return await axios.post(`https://internal-service.example.com${endpoint}`, data);
    }
  }
  
  // Use load balancer for cross-AZ
  return await axios.post(`https://internal-service.example.com${endpoint}`, data);
}

Connection timeout tuning:

const axiosInstance = axios.create({
  timeout: 5000,  // Fail fast instead of waiting 30s
  httpsAgent: new https.Agent({
    timeout: 2000,  // Connection timeout
    keepAlive: true,
  })
});

Lessons learned:

ALBs can introduce unexpected latency for same-AZ communication
Service discovery enables direct communication patterns
Always implement connection timeouts shorter than your SLA
Load balancers aren’t always the fastest path

Stalled Deployments

The Setup: Standard blue-green deployment using CodeDeploy.

What Happened: New deployment stuck at 50% for hours. Half the tasks were running the new version, half the old. CodeDeploy dashboard showed “In Progress” with no error messages.

Auto-rollback wasn’t triggering because technically, nothing was “failing.”

The Investigation

CodeDeploy logs were unhelpful:

aws deploy get-deployment --deployment-id d-XXXXXXXXX
# Status: InProgress, no error information

aws logs filter-log-events \
  --log-group-name /aws/codedeploy-agent \
  --start-time $(date -d '1 hour ago' +%s)000

ECS service events revealed the issue:

aws ecs describe-services \
  --cluster production \
  --services api \
  --query 'services[0].events[0:10]'

The events showed:

"(service api) failed to launch a task with (error ECS was unable to assume role...)"

The Root Cause

Our task execution role had been modified by another team for an unrelated service, and they accidentally removed the trust relationship that allows ECS to assume the role.

The role policy looked like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"  // WRONG!
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

It should have been:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"  // CORRECT
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The fix:

# Check the role trust policy
aws iam get-role --role-name fargate-task-execution-role \
  --query 'Role.AssumeRolePolicyDocument'

# Update it
aws iam update-assume-role-policy \
  --role-name fargate-task-execution-role \
  --policy-document file://trust-policy.json

Prevention strategy:

// Automated role validation
export const validateTaskRoles = async () => {
  const iam = new AWS.IAM();
  
  const role = await iam.getRole({
    RoleName: 'fargate-task-execution-role'
  }).promise();
  
  const trustPolicy = JSON.parse(decodeURIComponent(role.Role.AssumeRolePolicyDocument));
  const ecsService = trustPolicy.Statement.some(statement =>
    statement.Principal?.Service === 'ecs-tasks.amazonaws.com'
  );
  
  if (!ecsService) {
    await sendAlert('Task execution role missing ECS trust relationship!');
    return false;
  }
  
  return true;
};

Lessons learned:

ECS service events are more detailed than CodeDeploy logs
Role trust policies are fragile and need monitoring
Blue-green deployments can get stuck in limbo
Always check IAM when things mysteriously stop working

Debug Toolbox

After these incidents, here’s the debugging toolkit that has proven effective across production environments:

1. The Ultimate Fargate Debug Container

FROM node:18-alpine
RUN apk add --no-cache \
    curl \
    wget \
    netcat-openbsd \
    bind-tools \
    tcpdump \
    strace \
    htop \
    iotop \
    lsof \
    procps \
    net-tools

# Add your app
COPY . /app
WORKDIR /app

# Debug endpoints
RUN npm install express heapdump clinic

2. Monitoring Stack

// Health check endpoint with detailed diagnostics
app.get('/health/detailed', async (req, res) => {
  const health = {
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    memory: process.memoryUsage(),
    cpu: process.cpuUsage(),
    connections: {
      active: await getActiveConnections(),
      waiting: await getWaitingConnections()
    },
    environment: {
      nodeVersion: process.version,
      availabilityZone: process.env.AWS_AVAILABILITY_ZONE || 'unknown',
      region: process.env.AWS_REGION || 'unknown'
    }
  };
  
  res.json(health);
});

async function getActiveConnections() {
  return new Promise((resolve) => {
    require('child_process').exec('netstat -an | grep ESTABLISHED | wc -l', 
      (error, stdout) => {
        resolve(parseInt(stdout.trim()) || 0);
      }
    );
  });
}

3. Automated Incident Response

# CloudWatch alarms that help
ENIUtilizationAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: High-ENI-Utilization
    MetricName: ENIsInUse
    Namespace: Custom/VPC
    Statistic: Maximum
    Period: 300
    EvaluationPeriods: 2
    Threshold: 4500  # 90% of 5000 limit
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
      - !Ref SNSTopic

MemoryUtilizationAlarm:
  Type: AWS::CloudWatch::Alarm  
  Properties:
    AlarmName: Fargate-Memory-High
    MetricName: MemoryUtilized
    Namespace: ECS/ContainerInsights
    Statistic: Average
    Period: 300
    EvaluationPeriods: 3
    Threshold: 80  # 80% memory usage
    ComparisonOperator: GreaterThanThreshold

The Universal Laws of Fargate Debugging

After all these adventures, here are the patterns that hold true:

When tasks won’t start: Check ENI limits, security groups, and IAM roles (in that order)
When tasks are slow: It’s usually the network (route tables, NAT gateways, DNS)
When memory keeps climbing: It’s always connection pooling or event listeners
When deployments hang: Check service events, not deployment logs
When 1% of requests fail: Look for load balancer quirks or cross-AZ issues
When nothing makes sense: Enable VPC Flow Logs and ECS Exec

The lesson? Fargate removes a lot of infrastructure complexity, but when things go wrong, you need to understand the underlying AWS networking and compute primitives. The abstraction is leaky, and production always finds the leaks.

These debugging techniques are most valuable when they are already in place before an incident begins. Every monitoring endpoint and diagnostic tool set up proactively reduces mean time to resolution.

Fargate removes a lot of infrastructure complexity, but production always finds the edge cases. Knowing the underlying AWS networking and compute primitives is still required.

References

Architect for AWS Fargate for Amazon ECS - Core Fargate architecture reference covering task lifecycle, networking, and the fully managed infrastructure model.
Amazon ECS task networking options for Fargate - How Fargate manages ENIs per task, awsvpc network mode constraints, and ENI limits per VPC.
Fargate security best practices in Amazon ECS - Security hardening for Fargate workloads including read-only root filesystems, resource limits, and network policies.
Amazon ECS service quotas and API throttling limits - ENI and task quotas that cause production surprises at scale, and when to request increases.
Updating an Amazon ECS service - Service update strategies including rolling updates and deployment circuit breaker configuration.
Cost Optimization Checklist for Amazon ECS and AWS Fargate - AWS Containers blog checklist covering Fargate Spot, Savings Plans, and right-sizing to reduce production costs.

AWS Fargate Deep Dive Series

Complete guide to AWS Fargate from basics to production. Learn serverless containers, cost optimization, debugging techniques, and Infrastructure-as-Code deployment patterns through real-world experience.

Progress 3/4 posts completed

Previous Advanced Patterns: Cost Optimization and Production Techniques Next IaC Deep Dive: Deploying with CDK, Terraform, and SAM

All Posts in This Series

Part 1: Basics: When Your Containers Don't Need a Babysitter

Part 2: Advanced Patterns: Cost Optimization and Production Techniques

Part 3: Production Lessons That'll Save You Hours

Part 4: IaC Deep Dive: Deploying with CDK, Terraform, and SAM

View series →

Debugging Notification Delivery Failures at Scale

Real-world debugging techniques, monitoring strategies, and lessons learned from notification system failures in high-stakes production environments

debuggingmonitoringproduction +4

September 8, 2025

Observability Beyond Metrics: The Art of System Storytelling

Move past green-light dashboards to observability that narrates system behavior, user journeys, and business impact via distributed tracing.

observabilitymonitoringdistributed-tracing +5

September 8, 2025

From RFC to Production: What They Don't Tell You About Implementation

An honest take on the gap between beautiful RFC designs and messy production reality, featuring real-world lessons from implementing notification systems at scale

rfcimplementationproduction +5

September 8, 2025

AWS Fargate Production Patterns: Cost, Monitoring, Blue-Green

Advanced Fargate patterns learned from running production workloads. From cost optimization to stateful containers, here's what the docs won't tell you.

awsfargateecs +5

September 4, 2025

Blameless Postmortems: A Model and a Copy-Paste Template

A blameless postmortem model that fixes the system instead of finding a culprit, with a copy-paste template and where individual accountability still applies.

engineering-cultureincident-responsepsychological-safety +4

June 21, 2026

The ENI Limit Discovery

What We Learned

Resolution Approach

Subnet Routing Failure

The Investigation Journey

The Culprit

Tracking a Memory Leak Without SSH Access

The Investigation Arsenal

The Root Cause

The 30-Second Connection Timeout

The Debugging Odyssey

The Breakthrough

Stalled Deployments

The Investigation

The Root Cause

Debug Toolbox

1. The Ultimate Fargate Debug Container

2. Monitoring Stack

3. Automated Incident Response

The Universal Laws of Fargate Debugging

References

AWS Fargate Deep Dive Series

All Posts in This Series

Related posts