Building AWS Serverless with TypeScript: Lessons from 3 Years of Lambda at Scale

Why I moved from Express.js to Lambda, the costly mistakes I made along the way, and the TypeScript patterns that saved my team thousands in AWS bills.

Three years ago, I was running a traditional Express.js API on EC2 instances. Fixed costs, predictable scaling, 99.9% uptime. Life was good. Then our biggest client asked for a feature that needed to process 50,000 webhooks in under 10 minutes, once per month.

Keeping EC2 instances running 24/7 for a 10-minute monthly spike felt wasteful. That's when I dove headfirst into AWS Lambda. Here's what I learned from building 30+ production Lambda functions, making every serverless mistake possible, and spending way too much on AWS bills.

Why I Finally Embraced Serverless (After Years of Resistance)#

I used to be that guy who called serverless "vendor lock-in with extra steps." Coming from a background of managing Kubernetes clusters and fine-tuning JVM garbage collectors, Lambda felt like giving up control. But three incidents changed my mind:

The Midnight Scaling Disaster (June 2022)#

Our Express API got featured on Hacker News at 2 AM. Traffic went from 100 req/min to 5,000 req/min. Our auto-scaling group took 8 minutes to spin up new instances. By then, we'd lost $3,000 in failed payment processing and our Redis cache was toast.

Lambda would have scaled instantly. That's when I started paying attention.

The Webhook Processing Hell (August 2022)#

A client needed to process Stripe webhooks that could arrive in bursts of 10,000+ events. With EC2, we had two bad options:

  1. Over-provision for peak load (expensive)
  2. Use queues and risk webhook timeouts (unreliable)

Lambda's automatic concurrency scaling solved this elegantly. Each webhook got its own function instance. No queues, no timeouts, no over-provisioning.

The $800 Idle Cost Revelation (October 2022)#

I calculated our actual compute utilization. Our "high-performance" API servers were idle 87% of the time, but we paid for 100%. That's $800/month for spinning fans and unused CPU cycles.

Lambda's pay-per-millisecond model suddenly looked brilliant.

The Stack That Actually Works in Production#

After burning through multiple approaches, here's what we settled on:

TypeScript
// Our production CDK stack - refined through pain
import { Stack, StackProps, Duration, RemovalPolicy } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { RestApi, LambdaIntegration, Cors } from 'aws-cdk-lib/aws-apigateway';
import { Table, AttributeType, BillingMode } from 'aws-cdk-lib/aws-dynamodb';
import { Runtime } from 'aws-cdk-lib/aws-lambda';

export class ProductionServerlessStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // DynamoDB table - learned to use single-table design the hard way
    const dataTable = new Table(this, 'DataTable', {
      partitionKey: { name: 'PK', type: AttributeType.STRING },
      sortKey: { name: 'SK', type: AttributeType.STRING },
      billingMode: BillingMode.PAY_PER_REQUEST,  // On-demand pricing saved us during spikes
      // Point-in-time recovery saved us from a junior dev's DELETE mistake
      pointInTimeRecovery: true,
      removalPolicy: RemovalPolicy.RETAIN,  // Never accidentally delete prod data
    });

    // Add GSI for querying by different access patterns
    dataTable.addGlobalSecondaryIndex({
      indexName: 'GSI1',
      partitionKey: { name: 'GSI1PK', type: AttributeType.STRING },
      sortKey: { name: 'GSI1SK', type: AttributeType.STRING },
    });

    // Lambda function with production-ready settings
    const apiHandler = new NodejsFunction(this, 'ApiHandler', {
      entry: 'src/handlers/api.ts',
      runtime: Runtime.NODEJS_20_X,
      // Memory sizing based on actual profiling, not guesses
      memorySize: 1024,  // Sweet spot for our JSON processing workload
      timeout: Duration.seconds(28),  // Just under API Gateway's 29s limit
      environment: {
        TABLE_NAME: dataTable.tableName,
        NODE_ENV: 'production',
        // Enable connection reuse for DynamoDB
        AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
        // Custom env vars
        LOG_LEVEL: 'info',
        ENABLE_X_RAY: 'true',
      },
      bundling: {
        minify: true,
        target: 'node20',
        // Exclude aws-sdk from bundle - Lambda runtime provides it
        externalModules: ['@aws-sdk/*'],
        // Tree-shake unused code
        treeShaking: true,
        // Source maps for debugging prod issues
        sourceMap: true,
        // Define for dead code elimination
        define: {
          'process.env.NODE_ENV': '"production"',
        },
      },
      // Enable X-Ray tracing for debugging
      tracing: Tracing.ACTIVE,
      // Reserved concurrency to prevent Lambda from consuming entire account limit
      reservedConcurrentExecutions: 100,
    });

    // Grant DynamoDB permissions
    dataTable.grantReadWriteData(apiHandler);

    // API Gateway with proper CORS and throttling
    const api = new RestApi(this, 'ServerlessApi', {
      restApiName: 'production-serverless-api',
      description: 'Production serverless API with proper error handling',
      defaultCorsPreflightOptions: {
        allowOrigins: process.env.NODE_ENV === 'production'
          ? ['https://yourdomain.com']
          : Cors.ALL_ORIGINS,
        allowMethods: Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization', 'X-Amz-Date'],
      },
      deployOptions: {
        // Stage-specific throttling
        throttlingRateLimit: 1000,
        throttlingBurstLimit: 2000,
        // Enable detailed CloudWatch metrics
        metricsEnabled: true,
        loggingLevel: MethodLoggingLevel.INFO,
        // Enable X-Ray tracing
        tracingEnabled: true,
      },
    });

    // Add resource with proper integration
    const items = api.root.addResource('items');
    items.addMethod('GET', new LambdaIntegration(apiHandler));
    items.addMethod('POST', new LambdaIntegration(apiHandler));

    const singleItem = items.addResource('{id}');
    singleItem.addMethod('GET', new LambdaIntegration(apiHandler));
    singleItem.addMethod('PUT', new LambdaIntegration(apiHandler));
    singleItem.addMethod('DELETE', new LambdaIntegration(apiHandler));
  }
}

The Lambda Handler That Handles Reality#

Here's our production Lambda handler, complete with all the error handling and optimizations learned from 3 years of production incidents:

TypeScript
// src/handlers/api.ts
import { APIGatewayProxyHandler, APIGatewayProxyResult } from 'aws-lambda';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, GetCommand, PutCommand, QueryCommand } from '@aws-sdk/lib-dynamodb';

// Create DynamoDB client outside handler for connection reuse
const dynamoClient = new DynamoDBClient({
  region: process.env.AWS_REGION,
  // Connection pooling settings that reduced our costs by 15%
  maxAttempts: 3,
  requestHandler: {
    connectionTimeout: 1000,
    socketTimeout: 1000,
  },
});

const docClient = DynamoDBDocumentClient.from(dynamoClient, {
  marshallOptions: {
    removeUndefinedValues: true,  // Prevents DynamoDB validation errors
    convertEmptyValues: false,
  },
});

interface Item {
  id: string;
  name: string;
  description?: string;
  createdAt: string;
  updatedAt: string;
}

// The handler that processes 10M+ requests/month
export const handler: APIGatewayProxyHandler = async (event): Promise<APIGatewayProxyResult> => {
  // Performance optimization: parse once, use everywhere
  const { httpMethod, pathParameters, body, requestContext } = event;
  const requestId = requestContext.requestId;

  // Structured logging that actually helps during incidents
  console.log('Request received', {
    requestId,
    method: httpMethod,
    path: event.path,
    pathParams: pathParameters,
    userAgent: event.headers['User-Agent'],
    sourceIp: event.requestContext.identity.sourceIp,
  });

  try {
    switch (httpMethod) {
      case 'GET':
        return await handleGet(pathParameters?.id, requestId);
      case 'POST':
        return await handlePost(body, requestId);
      case 'PUT':
        return await handlePut(pathParameters?.id, body, requestId);
      case 'DELETE':
        return await handleDelete(pathParameters?.id, requestId);
      default:
        return createResponse(405, { error: 'Method not allowed' });
    }
  } catch (error) {
    // Error handling that survived production incidents
    console.error('Handler error', {
      requestId,
      error: error.message,
      stack: error.stack,
      // Sanitized request data (never log sensitive info)
      method: httpMethod,
      path: event.path,
    });

    // Different error responses based on error type
    if (error.name === 'ValidationException') {
      return createResponse(400, { error: 'Invalid request data' });
    }

    if (error.name === 'ConditionalCheckFailedException') {
      return createResponse(409, { error: 'Resource conflict' });
    }

    if (error.name === 'ResourceNotFoundException') {
      return createResponse(404, { error: 'Resource not found' });
    }

    // Generic server error for unexpected issues
    return createResponse(500, {
      error: 'Internal server error',
      requestId,  // Include for support tickets
    });
  }
};

async function handleGet(id: string | undefined, requestId: string): Promise<APIGatewayProxyResult> {
  if (!id) {
    // List all items with pagination
    const result = await docClient.send(new QueryCommand({
      TableName: process.env.TABLE_NAME!,
      KeyConditionExpression: 'PK = :pk',
      ExpressionAttributeValues: {
        ':pk': 'ITEM',
      },
      Limit: 50,  // Prevent large scans that timeout
    }));

    const items = result.Items?.map(item => ({
      id: item.SK.replace('ITEM#', ''),
      name: item.name,
      description: item.description,
      createdAt: item.createdAt,
      updatedAt: item.updatedAt,
    })) || [];

    return createResponse(200, { items, count: items.length, requestId });
  }

  // Get single item
  const result = await docClient.send(new GetCommand({
    TableName: process.env.TABLE_NAME!,
    Key: {
      PK: 'ITEM',
      SK: `ITEM#${id}`,
    },
  }));

  if (!result.Item) {
    return createResponse(404, { error: 'Item not found', requestId });
  }

  const item: Item = {
    id: result.Item.SK.replace('ITEM#', ''),
    name: result.Item.name,
    description: result.Item.description,
    createdAt: result.Item.createdAt,
    updatedAt: result.Item.updatedAt,
  };

  return createResponse(200, { item, requestId });
}

async function handlePost(body: string | null, requestId: string): Promise<APIGatewayProxyResult> {
  if (!body) {
    return createResponse(400, { error: 'Request body is required', requestId });
  }

  let data: Partial<Item>;
  try {
    data = JSON.parse(body);
  } catch (error) {
    return createResponse(400, { error: 'Invalid JSON', requestId });
  }

  // Validation that prevented many production bugs
  if (!data.name || typeof data.name !== 'string' || data.name.trim().length === 0) {
    return createResponse(400, { error: 'Name is required and must be a non-empty string', requestId });
  }

  if (data.name.length > 100) {
    return createResponse(400, { error: 'Name must be 100 characters or less', requestId });
  }

  const id = generateId();  // Custom ID generation
  const now = new Date().toISOString();

  const item: Item = {
    id,
    name: data.name.trim(),
    description: data.description?.trim() || undefined,
    createdAt: now,
    updatedAt: now,
  };

  // Single-table design with composite keys
  await docClient.send(new PutCommand({
    TableName: process.env.TABLE_NAME!,
    Item: {
      PK: 'ITEM',
      SK: `ITEM#${id}`,
      ...item,
      // GSI keys for alternative access patterns
      GSI1PK: 'ITEMS_BY_NAME',
      GSI1SK: item.name.toLowerCase(),
    },
    // Prevent overwriting existing items
    ConditionExpression: 'attribute_not_exists(PK)',
  }));

  console.log('Item created', { requestId, itemId: id });

  return createResponse(201, { item, requestId });
}

// Utility function for consistent responses
function createResponse(statusCode: number, body: any): APIGatewayProxyResult {
  return {
    statusCode,
    headers: {
      'Content-Type': 'application/json',
      'Access-Control-Allow-Origin': '*',  // Adjust for production
      'Access-Control-Allow-Headers': 'Content-Type,Authorization',
      'X-Request-ID': body.requestId || 'unknown',
    },
    body: JSON.stringify(body),
  };
}

// Generate URL-safe unique IDs
function generateId(): string {
  return `${Date.now().toString(36)}-${Math.random().toString(36).substr(2, 9)}`;
}

Cost Optimization Lessons That Saved Thousands#

1. Memory vs. CPU Trade-offs#

I spent weeks optimizing our Lambda memory settings. Here's what I learned:

TypeScript
// Memory profiling revealed surprising insights
const memoryConfigs = [
  { memory: 512, avgDuration: 850, avgCost: 0.0012 },   // CPU-bound
  { memory: 1024, avgDuration: 420, avgCost: 0.0009 },  // Sweet spot
  { memory: 1536, avgDuration: 380, avgCost: 0.0011 },  // Diminishing returns
  { memory: 3008, avgDuration: 360, avgCost: 0.0021 },  // Overprovisioned
];

1024 MB was our sweet spot. More memory = faster execution = lower cost, up to a point.

2. Connection Reuse Saved 15% on AWS Bills#

TypeScript
// Before: New connection every invocation = expensive
const dynamoClient = new DynamoDBClient({ region: 'us-east-1' });

// After: Connection reuse = 15% cost reduction
const dynamoClient = new DynamoDBClient({
  region: 'us-east-1',
  maxAttempts: 3,
  requestHandler: {
    connectionTimeout: 1000,
    socketTimeout: 1000,
  },
});

// Enable HTTP keep-alive
process.env.AWS_NODEJS_CONNECTION_REUSE_ENABLED = '1';

3. Bundle Size Optimization#

TypeScript
// CDK bundling config that reduced cold starts by 40%
bundling: {
  minify: true,
  target: 'node20',
  externalModules: [
    '@aws-sdk/*',  // Use Lambda runtime version
    'aws-lambda',  // Already available
  ],
  treeShaking: true,
  sourceMap: process.env.NODE_ENV !== 'production',  // Debug info only in dev
  define: {
    'process.env.NODE_ENV': '"production"',
  },
  banner: '/* Production Lambda bundle */',
  // Critical: exclude large dependencies
  nodeModules: {
    // Only bundle what we actually use
    'lodash': {
      include: ['throttle', 'debounce'],  // Tree-shake unused functions
    },
  },
}

The Monitoring Setup That Actually Alerts on Real Issues#

After too many 3 AM pages for non-issues, here's our production monitoring:

TypeScript
// CloudWatch alarms that don't cry wolf
export class ServerlessMonitoring extends Construct {
  constructor(scope: Construct, id: string, props: { lambdaFunction: Function }) {
    super(scope, id);

    // Error rate alarm - 5% error rate over 5 minutes
    const errorAlarm = new Alarm(this, 'HighErrorRate', {
      metric: props.lambdaFunction.metricErrors({
        statistic: 'Sum',
        period: Duration.minutes(5),
      }).with({
        statistic: 'Average',
      }),
      threshold: 0.05,  // 5% error rate
      evaluationPeriods: 2,
      treatMissingData: TreatMissingData.NOT_BREACHING,
    });

    // Duration alarm - 95th percentile over 5 seconds
    const durationAlarm = new Alarm(this, 'SlowRequests', {
      metric: props.lambdaFunction.metricDuration({
        statistic: 'p95',
        period: Duration.minutes(5),
      }),
      threshold: 5000,  // 5 seconds
      evaluationPeriods: 3,
    });

    // Throttle alarm - any throttling is bad
    const throttleAlarm = new Alarm(this, 'ThrottledRequests', {
      metric: props.lambdaFunction.metricThrottles({
        statistic: 'Sum',
        period: Duration.minutes(1),
      }),
      threshold: 1,
      evaluationPeriods: 1,
    });

    // Custom metric for business logic errors
    const businessErrorAlarm = new Alarm(this, 'BusinessLogicErrors', {
      metric: new Metric({
        namespace: 'MyApp/Lambda',
        metricName: 'BusinessErrors',
        statistic: 'Sum',
      }),
      threshold: 10,
      evaluationPeriods: 2,
    });
  }
}

The Mistakes That Cost Me Sleep (and Money)#

1. The Concurrent Execution Limit Incident#

December 2022, Black Friday. Our webhook processing Lambda consumed all 1,000 concurrent executions in our AWS account. Our main API went down because it couldn't get any Lambda capacity.

Fix: Set reserved concurrency on critical functions:

TypeScript
reservedConcurrentExecutions: 100,  // Guarantee capacity

2. The DynamoDB Hot Partition Disaster#

January 2023. Used sequential IDs for DynamoDB partition keys. All traffic hit one partition. Read/write throttling killed performance.

Fix: Distributed partition keys:

TypeScript
// Bad: Sequential IDs create hot partitions
PK: `USER#${sequentialId}`

// Good: UUID or timestamp + random
PK: `USER#${uuid.v4()}`
// Or: Use current hour + random for time-based access
PK: `USER#${new Date().getHours()}-${Math.random().toString(36)}`

3. The Memory Leak That Wasn't#

March 2023. Lambda functions timing out after exactly 15 minutes. Thought it was a memory leak. Turned out AWS had a 15-minute maximum execution time limit. We were processing large batches synchronously.

Fix: Batch processing with pagination:

TypeScript
// Process in smaller chunks
const BATCH_SIZE = 100;
const MAX_EXECUTION_TIME = 14 * 60 * 1000; // 14 minutes
const startTime = Date.now();

for (let i = 0; i < items.length; i += BATCH_SIZE) {
  if (Date.now() - startTime > MAX_EXECUTION_TIME) {
    // Schedule continuation via SQS
    await scheduleRemainingWork(items.slice(i));
    break;
  }

  const batch = items.slice(i, i + BATCH_SIZE);
  await processBatch(batch);
}

TypeScript Patterns That Saved My Sanity#

1. Strict Event Type Definitions#

TypeScript
// Custom type definitions for better IntelliSense
interface StrictAPIGatewayEvent extends APIGatewayProxyEvent {
  pathParameters: { [key: string]: string };  // Never null in our setup
  body: string;  // Always present for POST/PUT
}

// Type guards for runtime safety
function isValidItemData(data: any): data is Partial<Item> {
  return typeof data === 'object' &&
         data !== null &&
         (data.name === undefined || typeof data.name === 'string');
}

2. Environment Variable Validation#

TypeScript
// Validate environment at startup, not runtime
interface Environment {
  TABLE_NAME: string;
  LOG_LEVEL: 'debug' | 'info' | 'warn' | 'error';
  NODE_ENV: 'development' | 'production';
}

function validateEnvironment(): Environment {
  const env = process.env;

  if (!env.TABLE_NAME) {
    throw new Error('TABLE_NAME environment variable is required');
  }

  return {
    TABLE_NAME: env.TABLE_NAME,
    LOG_LEVEL: (env.LOG_LEVEL as any) || 'info',
    NODE_ENV: (env.NODE_ENV as any) || 'development',
  };
}

// Validate once at module load
const ENV = validateEnvironment();

3. Result Types for Error Handling#

TypeScript
// Rust-inspired Result type for clean error handling
type Result<T, E = Error> =
  | { success: true; data: T }
  | { success: false; error: E };

async function getItem(id: string): Promise<Result<Item, string>> {
  try {
    const result = await docClient.send(new GetCommand({
      TableName: ENV.TABLE_NAME,
      Key: { PK: 'ITEM', SK: `ITEM#${id}` },
    }));

    if (!result.Item) {
      return { success: false, error: 'Item not found' };
    }

    return { success: true, data: transformDynamoItem(result.Item) };
  } catch (error) {
    return { success: false, error: error.message };
  }
}

// Usage
const result = await getItem(id);
if (!result.success) {
  return createResponse(404, { error: result.error });
}
// TypeScript knows result.data is Item
const item = result.data;

Performance Insights from Production Data#

After 18 months in production with detailed monitoring:

Cold Start Analysis#

  • Average cold start: 850ms
  • P95 cold start: 1,200ms
  • Bundle size impact: 10MB bundle = +400ms cold start
  • Memory impact: 1024MB vs 512MB = -200ms cold start

Cost Breakdown (Monthly)#

  • Lambda execution: $89/month (8M invocations)
  • API Gateway: $32/month (8M requests)
  • DynamoDB: $67/month (pay-per-request)
  • CloudWatch logs: $12/month
  • Total: $200/month (vs. $800/month for EC2 equivalent)

Reliability Metrics#

  • Uptime: 99.97% (vs. 99.9% on EC2)
  • Error rate: 0.02% (mostly client errors)
  • P95 response time: 180ms

When NOT to Use Serverless#

Serverless isn't always the answer. Here's when I stick with containers:

  1. Long-running processes - Video encoding, large batch jobs
  2. Websocket-heavy apps - Real-time gaming, chat apps
  3. Legacy applications - Complex deployment requirements
  4. Stateful workloads - In-memory caches, sessions
  5. Cold start sensitive - Sub-100ms response requirements

The Deployment Pipeline That Doesn't Break#

TypeScript
// CDK pipeline for zero-downtime deployments
export class ServerlessPipeline extends Stack {
  constructor(scope: Construct, id: string) {
    super(scope, id);

    const pipeline = new CodePipeline(this, 'Pipeline', {
      synth: new ShellStep('Synth', {
        input: CodePipelineSource.gitHub('yourorg/repo', 'main'),
        commands: [
          'npm ci',
          'npm run build',
          'npm run test',
          'npx cdk synth',
        ],
      }),
    });

    // Stage deployments with gradual rollout
    const testStage = new ServerlessStage(this, 'Test', {
      stageName: 'test',
    });

    const prodStage = new ServerlessStage(this, 'Prod', {
      stageName: 'prod',
    });

    pipeline.addStage(testStage, {
      post: [
        new ShellStep('IntegrationTests', {
          commands: [
            'npm run test:integration',
          ],
          envFromCfnOutput: {
            API_URL: testStage.apiUrl,
          },
        }),
      ],
    });

    pipeline.addStage(prodStage, {
      pre: [
        new ManualApprovalStep('PromoteToProd'),
      ],
      post: [
        new ShellStep('SmokeTests', {
          commands: [
            'npm run test:smoke',
          ],
        }),
      ],
    });
  }
}

Final Thoughts: 3 Years Later#

Serverless with TypeScript transformed how our team ships features. We went from weekly deployments to daily deployments. Our AWS bill dropped 75%. Our uptime improved to 99.97%.

But the biggest win? Developer happiness. No more 3 AM pages about server crashes. No more capacity planning spreadsheets. No more patching operating systems.

The serverless learning curve is steep, but the productivity gains are real. Start small, measure everything, and prepare for a few expensive mistakes along the way.

Ready to dive in? Start with a simple CRUD API, add proper monitoring from day one, and remember: Lambda functions are like potato chips - you can't deploy just one.

Loading...

Comments (0)

Join the conversation

Sign in to share your thoughts and engage with the community

No comments yet

Be the first to share your thoughts on this post!

Related Posts