AWS Fargate 103: Production Lessons That'll Save You Hours
Production incidents from running Fargate at scale. Memory leaks, ENI limits, subnet failures, and debugging techniques that work.
There's something humbling about thinking your Fargate setup is solid, seeing all green dashboards, and then discovering there were blind spots you hadn't considered. Running Fargate workloads at scale reveals challenges that don't show up in tutorials or basic implementations.
In previous parts of our Fargate series (101, 102), we covered the basics and advanced patterns. Here are some production scenarios that taught valuable lessons, along with debugging approaches and solutions that proved effective. Next up in 104, we'll explore Infrastructure-as-Code deployment patterns.
The ENI Limit Discovery
Context: We were preparing for Black Friday traffic on an e-commerce platform, expecting roughly 10x normal load. Auto-scaling was configured, load testing had passed, and our confidence was high.
The Issue: During the evening before Black Friday, our tasks started failing with an error we hadn't seen before:
Fargate tasks were stuck in PENDING state. New deployments wouldn't complete, and auto-scaling couldn't provision additional capacity. Most dashboards showed normal operation, but we discovered that available ENIs in our VPC had reached the limit.
What We Learned
Each Fargate task requires its own ENI, and AWS enforces limits per VPC. We had been running around 200 tasks across 3 availability zones, but hadn't accounted for ENI consumption from other services. The limits we discovered:
- Default ENI limit: 5,000 per VPC
- Each Fargate task: 1 ENI
- Each RDS instance: 1 ENI
- Each Lambda in VPC: Shares ENI pool
- Each ELB: Multiple ENIs
We checked our limits:
We discovered we were at 4,847 ENIs. Our load testing had focused on application performance but hadn't considered the cumulative ENI consumption across all services in the VPC.
Resolution Approach
Immediate steps:
Longer-term improvements:
- Multiple VPCs: Split workloads across dev, staging, and prod VPCs
- ENI monitoring: CloudWatch custom metric tracking ENI usage
- Right-sizing: Reduced over-provisioned tasks
- Lambda optimization: Moved Lambdas out of VPC where possible
Lessons learned:
- Load testing should include all infrastructure components, not just your application
- ENI limits are per VPC, not per service
- AWS Support is surprisingly responsive during critical incidents
The Subnet That Went Rogue
The Setup: Multi-AZ Fargate deployment across three private subnets. Everything running smoothly for months.
What Happened: Tuesday morning, 40% of our tasks started showing intermittent connectivity issues. Some HTTP requests succeeded, others timed out after 30 seconds.
The weird part? Only tasks in one specific subnet (us-east-1a) were affected.
The Investigation Journey
First, the obvious checks:
Tasks looked healthy. Network interfaces were attached and active. But something was wrong.
The breakthrough: We enabled VPC Flow Logs and found the smoking gun:
Flow logs showed that packets were leaving our subnet but never reaching their destination. The return packets were getting dropped somewhere.
The Culprit
Turns out, our network team had modified the route table for that subnet earlier that morning. They changed the NAT gateway route from 0.0.0.0/0 → nat-gateway-123 to 0.0.0.0/0 → nat-gateway-456 without realizing Fargate tasks were running there.
The new NAT gateway was in a different AZ and had different security group rules. Classic.
The fix:
Lessons learned:
- Always test routing changes in non-production first
- VPC Flow Logs are your friend for network debugging
- Document which route tables serve which services
- Set up monitoring for routing table changes
The Memory Leak Mystery (No SSH Edition)
The Setup: Node.js API running on Fargate, memory limit set to 2GB per task. Worked fine for weeks.
What Happened: Memory usage slowly climbing over 3-4 hours, then tasks getting OOM killed. Memory usage graphs looked like ski slopes.
But here's the kicker: no way to SSH into the container to debug.
The Investigation Arsenal
Since we can't SSH, we need to get creative:
1. ECS Exec (our savior):
2. Application-level monitoring:
3. The detective work:
We used ECS Exec to install debugging tools and found that our HTTP client wasn't properly closing connections:
Bingo! Thousands of TCP connections in CLOSE_WAIT state.
The Root Cause
Our Node.js HTTP client code looked innocent enough:
But we weren't configuring connection pooling or timeouts properly. Each request created new connections that weren't being cleaned up.
The fix:
Lessons learned:
- ECS Exec is invaluable for containerized debugging
- Always configure HTTP clients properly in production
- Monitor file descriptors, not just memory
- Connection pools matter, even for "simple" HTTP clients
The 30-Second Connection Timeout Phantom
The Setup: Internal service-to-service communication between two Fargate services. Worked fine 99% of the time.
What Happened: Randomly, about 1% of requests would hang for exactly 30 seconds, then fail with a connection timeout.
The pattern was completely random. No correlation with load, time of day, or deployment history.
The Debugging Odyssey
Network layer investigation:
Security groups looked fine. Flow logs showed packets flowing normally.
Application layer investigation:
The Breakthrough
The logs showed something interesting: successful connections were taking 2-5ms, but the hanging ones were taking exactly 30,000ms. That's not random - that's a timeout.
Then we noticed the pattern: it only happened when both services were in the same availability zone and the connection was going through the load balancer.
The issue: AWS ALB has a known quirk where connections from the same AZ can occasionally loop back through the load balancer infrastructure, causing delays.
The fix (multiple strategies):
- Direct service communication for same-AZ:
- Connection timeout tuning:
Lessons learned:
- ALBs can introduce unexpected latency for same-AZ communication
- Service discovery enables direct communication patterns
- Always implement connection timeouts shorter than your SLA
- Load balancers aren't always the fastest path
The Deployment That Wouldn't Deploy
The Setup: Standard blue-green deployment using CodeDeploy. Worked hundreds of times before.
What Happened: New deployment stuck at 50% for hours. Half the tasks were running the new version, half the old. CodeDeploy dashboard showed "In Progress" with no error messages.
Auto-rollback wasn't triggering because technically, nothing was "failing."
The Investigation
CodeDeploy logs were unhelpful:
ECS service events revealed the issue:
The events showed:
The Root Cause
Our task execution role had been modified by another team for an unrelated service, and they accidentally removed the trust relationship that allows ECS to assume the role.
The role policy looked like this:
It should have been:
The fix:
Prevention strategy:
Lessons learned:
- ECS service events are more detailed than CodeDeploy logs
- Role trust policies are fragile and need monitoring
- Blue-green deployments can get stuck in limbo
- Always check IAM when things mysteriously stop working
The Debug Toolbox That Works
After all these incidents, here's the debugging toolkit that's saved us countless hours:
1. The Ultimate Fargate Debug Container
2. Monitoring Stack
3. Automated Incident Response
The Universal Laws of Fargate Debugging
After all these adventures, here are the patterns that hold true:
-
When tasks won't start: Check ENI limits, security groups, and IAM roles (in that order)
-
When tasks are slow: It's usually the network (route tables, NAT gateways, DNS)
-
When memory keeps climbing: It's always connection pooling or event listeners
-
When deployments hang: Check service events, not deployment logs
-
When 1% of requests fail: Look for load balancer quirks or cross-AZ issues
-
When nothing makes sense: Enable VPC Flow Logs and ECS Exec
The lesson? Fargate removes a lot of infrastructure complexity, but when things go wrong, you need to understand the underlying AWS networking and compute primitives. The abstraction is leaky, and production always finds the leaks.
Keep these debugging techniques handy. Trust me, you'll need them during critical incidents, and when that happens, you'll be grateful for every monitoring endpoint and diagnostic tool you set up beforehand.
Next time someone tells you serverless containers are "set it and forget it," show them this series. Production has other plans, but now you're ready for them.
AWS Fargate Deep Dive Series
Complete guide to AWS Fargate from basics to production. Learn serverless containers, cost optimization, debugging techniques, and Infrastructure-as-Code deployment patterns through real-world experience.