Skip to content
~/sph.sh

AWS CDK Link Shortener Part 4: Production Deployment & Optimization

Multi-environment deployment strategies, performance optimization at scale, and cost management. Production insights and lessons learned with proper monitoring and incident response patterns.

Production Deployment & Optimization

Production optimization requires more than making things fast - it demands predictable performance under any load condition. When traffic spikes unexpectedly, infrastructure that works perfectly in staging can reveal scaling bottlenecks in production.

The most common oversight? Database provisioning for steady-state traffic rather than peak loads. A DynamoDB table optimized for normal operations can become a bottleneck when traffic increases 10x during campaigns or product launches.

In Parts 1-3, we built the foundation, core functionality, and security. Now let's make it bulletproof for production deployment and optimization.

Multi-Environment Deployment: Beyond Dev and Prod

Most tutorials show you dev and prod environments. In practice, you need at least four: dev, staging, pre-prod, and production. Here's why and how to build them:

typescript
// bin/link-shortener.ts - The app entry point that got us through launch day#!/usr/bin/env nodeimport 'source-map-support/register';import * as cdk from 'aws-cdk-lib';import { LinkShortenerStack } from '../lib/link-shortener-stack';import { DatabaseStack } from '../lib/database-stack';import { MonitoringStack } from '../lib/monitoring-stack';
const app = new cdk.App();
// Environment configuration that scales with your teamconst environments = {  dev: {    account: process.env.CDK_DEFAULT_ACCOUNT,    region: 'us-west-2', // Cheaper region for dev    stage: 'dev',    domain: 'dev-links.yourcompany.com',    customDomain: false,    monitoring: {      detailedMetrics: false,      logRetention: 7, // Days      alerting: false,    },    database: {      billingMode: 'PAY_PER_REQUEST',      pointInTimeRecovery: false,      backupRetention: 7,    },    lambda: {      reservedConcurrency: 10, // Limit dev costs      memorySize: 512,      timeout: 30,    }  },    staging: {    account: process.env.CDK_DEFAULT_ACCOUNT,    region: 'us-east-1',    stage: 'staging',    domain: 'staging-links.yourcompany.com',    customDomain: true,    monitoring: {      detailedMetrics: true,      logRetention: 14,      alerting: true,    },    database: {      billingMode: 'PAY_PER_REQUEST',      pointInTimeRecovery: true,      backupRetention: 14,    },    lambda: {      reservedConcurrency: 50,      memorySize: 1024,      timeout: 30,    }  },
  'pre-prod': {    account: process.env.CDK_PREPROD_ACCOUNT,    region: 'us-east-1',    stage: 'pre-prod',    domain: 'pp-links.yourcompany.com',    customDomain: true,    monitoring: {      detailedMetrics: true,      logRetention: 30,      alerting: true,    },    database: {      billingMode: 'PROVISIONED', // Match production patterns      readCapacity: 100,      writeCapacity: 50,      pointInTimeRecovery: true,      backupRetention: 30,    },    lambda: {      reservedConcurrency: 200,      memorySize: 1024,      timeout: 30,    }  },
  production: {    account: process.env.CDK_PROD_ACCOUNT,    region: 'us-east-1',    stage: 'prod',    domain: 'go.yourcompany.com',    customDomain: true,    monitoring: {      detailedMetrics: true,      logRetention: 90,      alerting: true,      dashboard: true,    },    database: {      billingMode: 'PROVISIONED',      readCapacity: 500, // Start conservative, auto-scale up      writeCapacity: 200,      pointInTimeRecovery: true,      backupRetention: 90,      globalTables: true, // Multi-region disaster recovery    },    lambda: {      reservedConcurrency: 1000,      memorySize: 1024,      timeout: 30,      provisionedConcurrency: 10, // Keep some functions warm    }  }};
const stage = app.node.tryGetContext('stage') || 'dev';const config = environments[stage as keyof typeof environments];
if (!config) {  throw new Error(`Invalid stage: ${stage}. Available stages: ${Object.keys(environments).join(', ')}`);}
// Deploy in logical order with dependenciesconst databaseStack = new DatabaseStack(app, `LinkShortener-Database-${stage}`, {  env: { account: config.account, region: config.region },  stage,  config: config.database,});
const appStack = new LinkShortenerStack(app, `LinkShortener-App-${stage}`, {  env: { account: config.account, region: config.region },  stage,  config,  database: databaseStack.database,});
// Only deploy monitoring in staging+ environmentsif (stage !== 'dev') {  new MonitoringStack(app, `LinkShortener-Monitoring-${stage}`, {    env: { account: config.account, region: config.region },    stage,    config: config.monitoring,    appStack,  });}

Why four environments? Each serves a specific purpose:

  • Dev: Development isolation with cost controls for experimentation
  • Staging: Integration testing with production-like data and configurations
  • Pre-prod: Production replica for load testing and final validation
  • Production: Live environment with full monitoring and redundancy

Performance Optimization: Lambda Cold Starts and Beyond

Here are the optimizations that actually made a difference:

1. Lambda Configuration That Matters

typescript
// lib/constructs/optimized-lambda.tsimport * as lambda from 'aws-cdk-lib/aws-lambda';import * as nodejs from 'aws-cdk-lib/aws-lambda-nodejs';import * as cdk from 'aws-cdk-lib';import { Construct } from 'constructs';
export interface OptimizedLambdaProps {  entry: string;  stage: string;  reservedConcurrency?: number;  provisionedConcurrency?: number;  memorySize?: number;}
export class OptimizedLambda extends Construct {  public readonly function: nodejs.NodejsFunction;
  constructor(scope: Construct, id: string, props: OptimizedLambdaProps) {    super(scope, id);
    this.function = new nodejs.NodejsFunction(this, 'Function', {      entry: props.entry,      handler: 'handler',      runtime: lambda.Runtime.NODEJS_20_X,            // Memory configuration affects CPU - sweet spot for most workloads      memorySize: props.memorySize || 1024,            // Timeout aggressive enough to fail fast      timeout: cdk.Duration.seconds(30),            // Environment variables for optimization      environment: {        NODE_OPTIONS: '--enable-source-maps',        AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1', // Reuse TCP connections        POWERTOOLS_SERVICE_NAME: 'link-shortener',        POWERTOOLS_METRICS_NAMESPACE: 'LinkShortener',      },
      // Bundle optimization      bundling: {        minify: true,        sourceMap: true,        target: 'es2022',        format: nodejs.OutputFormat.ESM,        banner: 'import { createRequire } from "module"; const require = createRequire(import.meta.url);',        externalModules: [          '@aws-sdk/*', // Don't bundle AWS SDK        ],        esbuildArgs: {          '--tree-shaking': 'true',          '--platform': 'node',          '--target': 'node20',        },      },
      // Reserved concurrency to prevent one function from eating all capacity      reservedConcurrency: props.reservedConcurrency,
      // VPC configuration only if you need it (adds 1-2s to cold starts)      // vpc: props.stage === 'prod' ? vpc : undefined,    });
    // Provisioned concurrency for production critical paths    if (props.provisionedConcurrency && props.stage === 'prod') {      const version = this.function.currentVersion;            new lambda.Alias(this, 'ProductionAlias', {        aliasName: 'prod',        version,        provisionedConcurrencyConfig: {          provisionedConcurrentExecutions: props.provisionedConcurrency,        },      });    }
    // X-Ray tracing for performance insights    this.function.addEnvironment('_X_AMZN_TRACE_ID', '${_X_AMZN_TRACE_ID}');  }}

2. Connection Pooling That Actually Works

Creating new DynamoDB connections on every invocation was a major performance bottleneck. Here's a connection manager that helps:

typescript
// src/utils/dynamodb-connection.tsimport { DynamoDBClient } from '@aws-sdk/client-dynamodb';import { DynamoDBDocumentClient } from '@aws-sdk/lib-dynamodb';
// Global connection pool - survives between Lambda invocationslet dynamoClient: DynamoDBDocumentClient | null = null;
export function getDynamoClient(): DynamoDBDocumentClient {  if (!dynamoClient) {    const client = new DynamoDBClient({      region: process.env.AWS_REGION,            // Connection pooling configuration      maxAttempts: 3,      requestHandler: {        // Optimize for Lambda runtime        connectionTimeout: 1000, // 1s timeout        requestTimeout: 5000,    // 5s total request timeout                // Connection pooling        httpsAgent: {          maxSockets: 10,       // Reduced from default 50          keepAlive: true,          keepAliveMsecs: 30000,        },      },            // Client-side caching of credentials      credentials: {        accessKeyId: process.env.AWS_ACCESS_KEY_ID!,        secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,        sessionToken: process.env.AWS_SESSION_TOKEN,      },    });
    dynamoClient = DynamoDBDocumentClient.from(client, {      marshallOptions: {        convertEmptyValues: false,        removeUndefinedValues: true,        convertClassInstanceToMap: false,      },      unmarshallOptions: {        wrapNumbers: false,      },    });
    // Log connection creation for debugging    console.log('DynamoDB connection pool initialized');  }
  return dynamoClient;}
// Performance monitoring wrapperexport async function withPerformanceLogging<T>(  operation: string,  fn: () => Promise<T>): Promise<T> {  const start = Date.now();    try {    const result = await fn();    const duration = Date.now() - start;        console.log(JSON.stringify({      operation,      duration,      success: true,      timestamp: new Date().toISOString(),    }));        return result;  } catch (error) {    const duration = Date.now() - start;        console.error(JSON.stringify({      operation,      duration,      success: false,      error: error instanceof Error ? error.message : String(error),      timestamp: new Date().toISOString(),    }));        throw error;  }}

3. Production-Optimized Redirect Handler

typescript
// src/handlers/redirect-optimized.tsimport { APIGatewayProxyEvent, APIGatewayProxyResult } from 'aws-lambda';import { getDynamoClient, withPerformanceLogging } from '../utils/dynamodb-connection';import { GetCommand, UpdateCommand } from '@aws-sdk/lib-dynamodb';
// Declare cold start tracking outside handlerlet isColdStart = true;
export const handler = async (  event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {  const startTime = Date.now();  const coldStart = isColdStart;  isColdStart = false;
  // Extract short code from path  const shortCode = event.pathParameters?.proxy || event.pathParameters?.shortCode;    if (!shortCode) {    return createErrorResponse(404, 'Short code not found');  }
  try {    const dynamodb = getDynamoClient();        // Optimized DynamoDB query with projection    const result = await withPerformanceLogging(      'GetShortUrl',      () => dynamodb.send(new GetCommand({        TableName: process.env.URLS_TABLE_NAME!,        Key: { shortCode },        ProjectionExpression: 'originalUrl, expiresAt, clickCount',        ConsistentRead: false, // Eventually consistent is fine for redirects      }))    );
    if (!result.Item) {      // Log 404 for analytics but don't block      logAnalyticsAsync('404', shortCode, event).catch(console.error);      return createErrorResponse(404, 'Link not found');    }
    const { originalUrl, expiresAt } = result.Item;        // Check expiration    if (expiresAt && Date.now() > expiresAt) {      logAnalyticsAsync('EXPIRED', shortCode, event).catch(console.error);      return createErrorResponse(410, 'Link has expired');    }
    // Update click count asynchronously (fire-and-forget)    updateClickCountAsync(shortCode).catch(console.error);        // Log successful redirect    logAnalyticsAsync('SUCCESS', shortCode, event).catch(console.error);
    const responseTime = Date.now() - startTime;
    // Structured logging for monitoring    console.log(JSON.stringify({      event: 'redirect_success',      shortCode,      responseTime,      coldStart,      userAgent: event.headers['User-Agent']?.substring(0, 100),      referer: event.headers['Referer']?.substring(0, 100),      timestamp: new Date().toISOString(),    }));
    return {      statusCode: 301, // Permanent redirect for caching      headers: {        Location: originalUrl,        'Cache-Control': 'public, max-age=300, s-maxage=3600', // 5min browser, 1hr CDN        'X-Response-Time': responseTime.toString(),        'X-Cold-Start': coldStart.toString(),      },      body: '',    };
  } catch (error) {    const responseTime = Date.now() - startTime;        console.error(JSON.stringify({      event: 'redirect_error',      shortCode,      error: error instanceof Error ? error.message : String(error),      responseTime,      coldStart,      timestamp: new Date().toISOString(),    }));
    return createErrorResponse(500, 'Internal server error');  }};
async function updateClickCountAsync(shortCode: string): Promise<void> {  try {    const dynamodb = getDynamoClient();        await dynamodb.send(new UpdateCommand({      TableName: process.env.URLS_TABLE_NAME!,      Key: { shortCode },      UpdateExpression: 'ADD clickCount :inc SET lastClickAt = :timestamp',      ExpressionAttributeValues: {        ':inc': 1,        ':timestamp': Date.now(),      },    }));  } catch (error) {    // Don't fail redirect if analytics update fails    console.error('Failed to update click count:', error);  }}
async function logAnalyticsAsync(  eventType: string,  shortCode: string,  event: APIGatewayProxyEvent): Promise<void> {  // Implementation for async analytics logging  // This would typically write to a separate analytics table or queue}
function createErrorResponse(statusCode: number, message: string): APIGatewayProxyResult {  return {    statusCode,    headers: {      'Content-Type': 'text/html',      'Cache-Control': 'no-cache',    },    body: `      <!DOCTYPE html>      <html>        <head><title>Error</title></head>        <body style="font-family: Arial, sans-serif; text-align: center; margin-top: 100px;">          <h1>${statusCode}</h1>          <p>${message}</p>        </body>      </html>    `,  };}

Cost Optimization: Learning from Expensive Mistakes

Cost optimization becomes critical when traffic patterns change unexpectedly. Understanding how different AWS services scale and bill helps prevent budget surprises during high-traffic periods:

1. DynamoDB Optimization Strategy

typescript
// lib/database-stack-optimized.tsimport * as dynamodb from 'aws-cdk-lib/aws-dynamodb';import * as applicationautoscaling from 'aws-cdk-lib/aws-applicationautoscaling';import * as cdk from 'aws-cdk-lib';import { Construct } from 'constructs';
export class OptimizedDatabaseStack extends Construct {  public readonly linksTable: dynamodb.Table;
  constructor(scope: Construct, id: string, props: {    stage: string;    expectedReadsPerSecond: number;    expectedWritesPerSecond: number;  }) {    super(scope, id);
    this.linksTable = new dynamodb.Table(this, 'LinksTable', {      partitionKey: {        name: 'shortCode',        type: dynamodb.AttributeType.STRING,      },            // Start with on-demand, switch to provisioned when you understand patterns      billingMode: props.stage === 'prod'         ? dynamodb.BillingMode.PROVISIONED         : dynamodb.BillingMode.PAY_PER_REQUEST,            // Provisioned capacity for production      ...(props.stage === 'prod' && {        readCapacity: Math.max(5, Math.ceil(props.expectedReadsPerSecond * 1.2)),        writeCapacity: Math.max(5, Math.ceil(props.expectedWritesPerSecond * 1.2)),      }),
      pointInTimeRecovery: props.stage === 'prod',      deletionProtection: props.stage === 'prod',            // Encryption for compliance      encryption: dynamodb.TableEncryption.AWS_MANAGED,            stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES, // For analytics    });
    // Auto-scaling for production    if (props.stage === 'prod') {      this.setupAutoScaling();    }
    // Global Secondary Index for analytics queries    this.linksTable.addGlobalSecondaryIndex({      indexName: 'UserIndex',      partitionKey: {        name: 'userId',        type: dynamodb.AttributeType.STRING,      },      sortKey: {        name: 'createdAt',        type: dynamodb.AttributeType.NUMBER,      },      projectionType: dynamodb.ProjectionType.KEYS_ONLY, // Minimize costs            // Same billing mode as main table      ...(props.stage === 'prod' && {        readCapacity: Math.max(5, Math.ceil(props.expectedReadsPerSecond * 0.1)),        writeCapacity: Math.max(5, Math.ceil(props.expectedWritesPerSecond * 1.0)),      }),    });  }
  private setupAutoScaling(): void {    // Read capacity auto-scaling    const readScaling = this.linksTable.autoScaleReadCapacity({      minCapacity: 5,      maxCapacity: 1000, // Reasonable ceiling    });
    readScaling.scaleOnUtilization({      targetUtilizationPercent: 70, // Conservative target      scaleInCooldown: cdk.Duration.minutes(5),      scaleOutCooldown: cdk.Duration.minutes(1),    });
    // Write capacity auto-scaling    const writeScaling = this.linksTable.autoScaleWriteCapacity({      minCapacity: 5,      maxCapacity: 500,    });
    writeScaling.scaleOnUtilization({      targetUtilizationPercent: 70,      scaleInCooldown: cdk.Duration.minutes(5),      scaleOutCooldown: cdk.Duration.minutes(1),    });  }}

2. CloudFront Configuration for Maximum Cost Efficiency

typescript
// lib/cdn-stack-optimized.tsimport * as cloudfront from 'aws-cdk-lib/aws-cloudfront';import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';import * as apigateway from 'aws-cdk-lib/aws-apigateway';import * as s3 from 'aws-cdk-lib/aws-s3';import * as cdk from 'aws-cdk-lib';import { Construct } from 'constructs';
export class OptimizedCDNStack extends Construct {  public readonly distribution: cloudfront.Distribution;
  constructor(scope: Construct, id: string, props: {    apiGateway: apigateway.RestApi;    stage: string;  }) {    super(scope, id);
    this.distribution = new cloudfront.Distribution(this, 'Distribution', {      defaultBehavior: {        origin: new origins.RestApiOrigin(props.apiGateway),                // Caching policy optimized for redirects        cachePolicy: new cloudfront.CachePolicy(this, 'RedirectCachePolicy', {          cachePolicyName: `link-shortener-${props.stage}`,          defaultTtl: cdk.Duration.minutes(5),          maxTtl: cdk.Duration.hours(24),          minTtl: cdk.Duration.minutes(1),                    // Cache based on path only (ignore query strings and headers)          queryStringBehavior: cloudfront.CacheQueryStringBehavior.none(),          headerBehavior: cloudfront.CacheHeaderBehavior.none(),          cookieBehavior: cloudfront.CacheCookieBehavior.none(),        }),
        // Compression saves bandwidth costs        compress: true,                // Only allow GET requests for redirects        allowedMethods: cloudfront.AllowedMethods.ALLOW_GET_HEAD,        cachedMethods: cloudfront.CachedMethods.CACHE_GET_HEAD,
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,      },
      // Additional behavior for API endpoints (no caching)      additionalBehaviors: {        '/api/*': {          origin: new origins.RestApiOrigin(props.apiGateway),          cachePolicy: cloudfront.CachePolicy.CACHING_DISABLED,          viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,          allowedMethods: cloudfront.AllowedMethods.ALLOW_ALL,        },      },
      // Use cheapest price class for non-critical applications      priceClass: props.stage === 'prod'         ? cloudfront.PriceClass.PRICE_CLASS_100  // US, Canada, Europe        : cloudfront.PriceClass.PRICE_CLASS_100,
      // Error handling      errorResponses: [        {          httpStatus: 404,          responseHttpStatus: 404,          responsePagePath: '/404.html',          ttl: cdk.Duration.minutes(5), // Cache 404s to prevent hammering origin        },        {          httpStatus: 500,          responseHttpStatus: 500,          responsePagePath: '/500.html',          ttl: cdk.Duration.minutes(1), // Short cache for server errors        },      ],
      // Enable logging for analytics (additional cost but necessary for insights)      ...(props.stage === 'prod' && {        enableLogging: true,        logBucket: s3.Bucket.fromBucketName(this, 'LogsBucket', `cloudfront-logs-${props.stage}`),        logFilePrefix: 'link-shortener/',      }),    });  }}

3. Cost Monitoring and Alerts

typescript
// lib/cost-monitoring-stack.tsimport * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';import * as sns from 'aws-cdk-lib/aws-sns';import * as subscriptions from 'aws-cdk-lib/aws-sns-subscriptions';import * as actions from 'aws-cdk-lib/aws-cloudwatch-actions';import * as cdk from 'aws-cdk-lib';import { Construct } from 'constructs';
export class CostMonitoringStack extends Construct {  constructor(scope: Construct, id: string, props: {    stage: string;    alertEmail: string;    monthlyBudget: number;  }) {    super(scope, id);
    // SNS topic for cost alerts    const alertTopic = new sns.Topic(this, 'CostAlerts', {      displayName: `Link Shortener Cost Alerts - ${props.stage}`,    });
    alertTopic.addSubscription(      new subscriptions.EmailSubscription(props.alertEmail)    );
    // DynamoDB cost monitoring    const dynamoReadAlarm = new cloudwatch.Alarm(this, 'DynamoReadUnitsHigh', {      metric: new cloudwatch.Metric({        namespace: 'AWS/DynamoDB',        metricName: 'ConsumedReadCapacityUnits',        dimensionsMap: {          TableName: 'LinksTable', // Replace with actual table name        },        statistic: 'Sum',        period: cdk.Duration.minutes(5),      }),      threshold: 1000, // Adjust based on your budget      evaluationPeriods: 2,      alarmDescription: 'DynamoDB read capacity usage is high',    });
    dynamoReadAlarm.addAlarmAction(      new actions.SnsAction(alertTopic)    );
    // Lambda invocation cost monitoring    const lambdaInvocationsAlarm = new cloudwatch.Alarm(this, 'LambdaInvocationsHigh', {      metric: new cloudwatch.Metric({        namespace: 'AWS/Lambda',        metricName: 'Invocations',        dimensionsMap: {          FunctionName: 'redirect-handler', // Replace with actual function name        },        statistic: 'Sum',        period: cdk.Duration.hours(1),      }),      threshold: 100000, // 100k invocations per hour      evaluationPeriods: 1,      alarmDescription: 'Lambda invocations are unusually high',    });
    lambdaInvocationsAlarm.addAlarmAction(      new actions.SnsAction(alertTopic)    );
    // Create cost dashboard    new cloudwatch.Dashboard(this, 'CostDashboard', {      dashboardName: `LinkShortener-Costs-${props.stage}`,      widgets: [        [          new cloudwatch.GraphWidget({            title: 'DynamoDB Read Capacity Units',            left: [dynamoReadAlarm.metric],            width: 12,          }),        ],        [          new cloudwatch.GraphWidget({            title: 'Lambda Invocations',            left: [lambdaInvocationsAlarm.metric],            width: 12,          }),        ],        [          new cloudwatch.GraphWidget({            title: 'CloudFront Requests',            left: [              new cloudwatch.Metric({                namespace: 'AWS/CloudFront',                metricName: 'Requests',                statistic: 'Sum',                period: cdk.Duration.hours(1),              }),            ],            width: 12,          }),        ],      ],    });  }}

Production Monitoring: Beyond "It Works"

Here's the monitoring approach that helped during production incidents:

1. Custom Metrics That Matter

typescript
// src/utils/metrics.tsimport { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
const cloudwatch = new CloudWatchClient({ region: process.env.AWS_REGION });
export class MetricsCollector {  private namespace = 'LinkShortener/Production';  private metrics: Array<{    MetricName: string;    Value: number;    Unit: string;    Timestamp: Date;    Dimensions?: Array<{ Name: string; Value: string }>;  }> = [];
  async recordRedirectSuccess(shortCode: string, responseTime: number, coldStart: boolean): Promise<void> {    this.metrics.push(      {        MetricName: 'RedirectResponseTime',        Value: responseTime,        Unit: 'Milliseconds',        Timestamp: new Date(),        Dimensions: [          { Name: 'ColdStart', Value: coldStart.toString() },        ],      },      {        MetricName: 'RedirectCount',        Value: 1,        Unit: 'Count',        Timestamp: new Date(),        Dimensions: [          { Name: 'Status', Value: 'Success' },        ],      }    );
    await this.flush();  }
  async recordDatabaseLatency(operation: string, latency: number): Promise<void> {    this.metrics.push({      MetricName: 'DatabaseLatency',      Value: latency,      Unit: 'Milliseconds',      Timestamp: new Date(),      Dimensions: [        { Name: 'Operation', Value: operation },      ],    });
    await this.flush();  }
  async recordError(errorType: string, shortCode?: string): Promise<void> {    this.metrics.push({      MetricName: 'ErrorCount',      Value: 1,      Unit: 'Count',      Timestamp: new Date(),      Dimensions: [        { Name: 'ErrorType', Value: errorType },        ...(shortCode ? [{ Name: 'ShortCode', Value: shortCode }] : []),      ],    });
    await this.flush();  }
  private async flush(): Promise<void> {    if (this.metrics.length === 0) return;
    try {      await cloudwatch.send(new PutMetricDataCommand({        Namespace: this.namespace,        MetricData: this.metrics,      }));
      this.metrics = []; // Clear after successful send    } catch (error) {      console.error('Failed to send metrics:', error);      // Don't throw - metrics failures shouldn't break the main functionality    }  }}
// Singleton instanceexport const metrics = new MetricsCollector();

2. Load Testing That Simulates Reality

typescript
// tests/load-test.ts - Load testing that helps catch scaling issuesimport { performance } from 'perf_hooks';
interface LoadTestConfig {  baseUrl: string;  concurrentUsers: number;  testDurationMs: number;  rampUpMs: number;  shortCodes: string[];}
interface LoadTestResult {  totalRequests: number;  successfulRequests: number;  failedRequests: number;  averageResponseTime: number;  p50ResponseTime: number;  p95ResponseTime: number;  p99ResponseTime: number;  errorsPerSecond: number;  requestsPerSecond: number;}
export async function runLoadTest(config: LoadTestConfig): Promise<LoadTestResult> {  const results: Array<{    success: boolean;    responseTime: number;    timestamp: number;    error?: string;  }> = [];
  const startTime = performance.now();  const endTime = startTime + config.testDurationMs;    // Create promise for each concurrent user  const userPromises = Array.from({ length: config.concurrentUsers }, async (_, userIndex) => {    // Stagger user start times during ramp-up    const userStartDelay = (config.rampUpMs * userIndex) / config.concurrentUsers;    await sleep(userStartDelay);        while (performance.now() < endTime) {      const requestStart = performance.now();            try {        // Random short code selection        const shortCode = config.shortCodes[Math.floor(Math.random() * config.shortCodes.length)];        const url = `${config.baseUrl}/${shortCode}`;                const response = await fetch(url, {          method: 'GET',          redirect: 'manual', // Don't follow redirects - we just want timing        });                const responseTime = performance.now() - requestStart;                results.push({          success: response.status >= 200 && response.status < 400,          responseTime,          timestamp: performance.now(),        });              } catch (error) {        const responseTime = performance.now() - requestStart;                results.push({          success: false,          responseTime,          timestamp: performance.now(),          error: error instanceof Error ? error.message : String(error),        });      }            // Wait before next request (adjust for desired load)      await sleep(100 + Math.random() * 200); // 100-300ms between requests per user    }  });
  // Wait for all users to complete  await Promise.all(userPromises);    // Calculate statistics  const successfulResults = results.filter(r => r.success);  const responseTimes = successfulResults.map(r => r.responseTime);  responseTimes.sort((a, b) => a - b);    const totalDurationSec = (performance.now() - startTime) / 1000;    return {    totalRequests: results.length,    successfulRequests: successfulResults.length,    failedRequests: results.length - successfulResults.length,    averageResponseTime: responseTimes.reduce((a, b) => a + b, 0) / responseTimes.length,    p50ResponseTime: responseTimes[Math.floor(responseTimes.length * 0.5)],    p95ResponseTime: responseTimes[Math.floor(responseTimes.length * 0.95)],    p99ResponseTime: responseTimes[Math.floor(responseTimes.length * 0.99)],    errorsPerSecond: (results.length - successfulResults.length) / totalDurationSec,    requestsPerSecond: results.length / totalDurationSec,  };}
async function sleep(ms: number): Promise<void> {  return new Promise(resolve => setTimeout(resolve, ms));}
// Example usage - run this before every deploymentasync function validatePerformance() {  console.log('Running pre-deployment load test...');    const testConfig: LoadTestConfig = {    baseUrl: 'https://staging-links.yourcompany.com',    concurrentUsers: 50,    testDurationMs: 60 * 1000, // 1 minute    rampUpMs: 10 * 1000,       // 10 second ramp-up    shortCodes: ['test1', 'test2', 'test3', 'popular-link', 'campaign-2024'],  };    const results = await runLoadTest(testConfig);    // Performance assertions  const maxAcceptableP95 = 500; // 500ms P95 response time  const maxAcceptableErrorRate = 0.01; // 1% error rate    if (results.p95ResponseTime > maxAcceptableP95) {    throw new Error(`P95 response time too high: ${results.p95ResponseTime}ms > ${maxAcceptableP95}ms`);  }    const errorRate = results.failedRequests / results.totalRequests;  if (errorRate > maxAcceptableErrorRate) {    throw new Error(`Error rate too high: ${(errorRate * 100).toFixed(2)}% > ${(maxAcceptableErrorRate * 100)}%`);  }    console.log('Load test passed:', results);}

Blue-Green Deployments: Deploy Without Fear

A deployment strategy that reduces deployment anxiety:

typescript
// deployment/blue-green-deploy.tsimport * as aws from '@aws-sdk/client-route53';import * as lambda from '@aws-sdk/client-lambda';
interface DeploymentConfig {  stage: 'blue' | 'green';  domainName: string;  hostedZoneId: string;  healthCheckUrl: string;}
export class BlueGreenDeployment {  private route53 = new aws.Route53Client({});  private lambdaClient = new lambda.LambdaClient({});
  async deployNewVersion(config: DeploymentConfig): Promise<void> {    console.log(`Starting ${config.stage} deployment...`);        // Step 1: Deploy new infrastructure    await this.deployCDKStack(config.stage);        // Step 2: Warm up the new environment    await this.warmUpEnvironment(config);        // Step 3: Run health checks    await this.runHealthChecks(config.healthCheckUrl);        // Step 4: Gradually shift traffic    await this.shiftTraffic(config, [10, 25, 50, 100]);        console.log(`${config.stage} deployment completed successfully`);  }
  private async deployCDKStack(stage: string): Promise<void> {    // This would typically use CDK CLI or AWS SDK to deploy    console.log(`Deploying CDK stack for ${stage}...`);        // Example: exec CDK deploy command    const { spawn } = await import('child_process');        return new Promise((resolve, reject) => {      const deploy = spawn('npx', ['cdk', 'deploy', '--all', '--context', `stage=${stage}`], {        stdio: 'inherit',      });            deploy.on('close', (code) => {        if (code === 0) {          resolve();        } else {          reject(new Error(`CDK deploy failed with code ${code}`));        }      });    });  }
  private async warmUpEnvironment(config: DeploymentConfig): Promise<void> {    console.log('Warming up Lambda functions...');        // Get all Lambda functions for this stage    const functions = await this.lambdaClient.send(new lambda.ListFunctionsCommand({      Marker: undefined,      MaxItems: 100,    }));        const stageFunctions = functions.Functions?.filter(fn =>       fn.FunctionName?.includes(config.stage)    ) || [];        // Warm up each function    const warmUpPromises = stageFunctions.map(async (fn) => {      if (!fn.FunctionName) return;            try {        await this.lambdaClient.send(new lambda.InvokeCommand({          FunctionName: fn.FunctionName,          Payload: JSON.stringify({            source: 'warm-up',            warmUp: true,          }),        }));                console.log(`Warmed up ${fn.FunctionName}`);      } catch (error) {        console.warn(`⚠️  Failed to warm up ${fn.FunctionName}:`, error);      }    });        await Promise.all(warmUpPromises);  }
  private async runHealthChecks(healthCheckUrl: string): Promise<void> {    console.log('Running health checks...');        const checks = [      { name: 'Basic redirect', path: '/test-redirect' },      { name: 'API health', path: '/api/health' },      { name: '404 handling', path: '/non-existent-link' },    ];        for (const check of checks) {      const url = `${healthCheckUrl}${check.path}`;      const response = await fetch(url);            // Different expectations for different endpoints      const expectedStatus = check.path === '/non-existent-link' ? 404 : 200;            if (response.status !== expectedStatus) {        throw new Error(`Health check failed for ${check.name}: ${response.status}`);      }            console.log(`${check.name} health check passed`);    }  }
  private async shiftTraffic(    config: DeploymentConfig,     trafficPercentages: number[]  ): Promise<void> {    for (const percentage of trafficPercentages) {      console.log(`Shifting ${percentage}% traffic to ${config.stage}...`);            // Update Route53 weighted routing      await this.updateRoute53WeightedRecord(config, percentage);            // Wait for DNS propagation and monitoring      await this.sleep(120000); // 2 minutes            // Check error rates during traffic shift      await this.monitorErrorRates(config);            console.log(`${percentage}% traffic shifted successfully`);    }  }
  private async updateRoute53WeightedRecord(    config: DeploymentConfig,     weight: number  ): Promise<void> {    const oppositeWeight = 100 - weight;    const oppositeStage = config.stage === 'blue' ? 'green' : 'blue';        // Update current stage weight    await this.route53.send(new aws.ChangeResourceRecordSetsCommand({      HostedZoneId: config.hostedZoneId,      ChangeBatch: {        Changes: [{          Action: 'UPSERT',          ResourceRecordSet: {            Name: config.domainName,            Type: 'CNAME',            SetIdentifier: config.stage,            Weight: weight,            TTL: 60, // Short TTL for quick changes            ResourceRecords: [{               Value: `${config.stage}-api.example.com`             }],          },        }],      },    }));
    // Update opposite stage weight    await this.route53.send(new aws.ChangeResourceRecordSetsCommand({      HostedZoneId: config.hostedZoneId,      ChangeBatch: {        Changes: [{          Action: 'UPSERT',          ResourceRecordSet: {            Name: config.domainName,            Type: 'CNAME',            SetIdentifier: oppositeStage,            Weight: oppositeWeight,            TTL: 60,            ResourceRecords: [{               Value: `${oppositeStage}-api.example.com`             }],          },        }],      },    }));  }
  private async monitorErrorRates(config: DeploymentConfig): Promise<void> {    // This would integrate with CloudWatch to check error rates    // and automatically roll back if error rates exceed threshold        console.log('Monitoring error rates...');        // Example: Check CloudWatch metrics    // If error rate > 1%, rollback    // If response time P95 > 500ms, rollback        await this.sleep(30000); // Monitor for 30 seconds  }
  async rollback(config: DeploymentConfig): Promise<void> {    console.log(`🔄 Rolling back ${config.stage} deployment...`);        // Shift all traffic back to stable version    const stableStage = config.stage === 'blue' ? 'green' : 'blue';    await this.updateRoute53WeightedRecord({      ...config,      stage: stableStage,    }, 100);        console.log('Rollback completed');  }
  private async sleep(ms: number): Promise<void> {    return new Promise(resolve => setTimeout(resolve, ms));  }}

Production Optimization Considerations

Running production infrastructure reveals important patterns about scaling and cost management:

1. Conservative provisioning with aggressive monitoring Start with minimal capacity and rely on auto-scaling. Over-provisioning increases costs without improving reliability for most workloads.

2. Cold start impact on user experience Even 2-3 seconds of cold start latency significantly degrades redirect performance. Provisioned concurrency for critical paths often justifies the additional cost.

3. DynamoDB auto-scaling timing Auto-scaling takes 5-10 minutes to increase capacity but scales down quickly. Setting target utilization at 70% instead of 90% provides buffer for traffic spikes.

4. Business metrics over technical metrics Tracking "redirects per campaign" and "conversion-generating links" provides more actionable insights than raw "Lambda invocations." Business context helps prioritize optimization efforts.

5. Staging load testing effectiveness Comprehensive load testing catches most production issues, but real user patterns often differ from synthetic tests. Focus on simulating actual traffic patterns rather than theoretical peak loads.

Production Metrics That Matter

Here are the dashboards that provide useful daily insights:

typescript
// lib/production-dashboard.tsimport * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
export class ProductionDashboard extends Construct {  constructor(scope: Construct, id: string) {    super(scope, id);
    new cloudwatch.Dashboard(this, 'LinkShortenerProduction', {      dashboardName: 'LinkShortener-Production-Health',      widgets: [        // Row 1: Business metrics        [          new cloudwatch.SingleValueWidget({            title: 'Redirects (24h)',            metrics: [              new cloudwatch.Metric({                namespace: 'LinkShortener/Production',                metricName: 'RedirectCount',                statistic: 'Sum',                period: cdk.Duration.hours(24),              }),            ],            width: 6,          }),                    new cloudwatch.SingleValueWidget({            title: 'Success Rate (24h)',            metrics: [              new cloudwatch.MathExpression({                expression: '(successful / total) * 100',                usingMetrics: {                  successful: new cloudwatch.Metric({                    namespace: 'LinkShortener/Production',                    metricName: 'RedirectCount',                    dimensionsMap: { Status: 'Success' },                    statistic: 'Sum',                  }),                  total: new cloudwatch.Metric({                    namespace: 'LinkShortener/Production',                    metricName: 'RedirectCount',                    statistic: 'Sum',                  }),                },              }),            ],            width: 6,          }),        ],
        // Row 2: Performance metrics        [          new cloudwatch.GraphWidget({            title: 'Response Time Percentiles',            left: [              new cloudwatch.Metric({                namespace: 'LinkShortener/Production',                metricName: 'RedirectResponseTime',                statistic: 'p50',                period: cdk.Duration.minutes(5),                label: 'P50',              }),              new cloudwatch.Metric({                namespace: 'LinkShortener/Production',                 metricName: 'RedirectResponseTime',                statistic: 'p95',                period: cdk.Duration.minutes(5),                label: 'P95',              }),              new cloudwatch.Metric({                namespace: 'LinkShortener/Production',                metricName: 'RedirectResponseTime',                 statistic: 'p99',                period: cdk.Duration.minutes(5),                label: 'P99',              }),            ],            width: 12,          }),        ],
        // Row 3: Infrastructure health        [          new cloudwatch.GraphWidget({            title: 'DynamoDB Throttling',            left: [              new cloudwatch.Metric({                namespace: 'AWS/DynamoDB',                metricName: 'ReadThrottledRequests',                dimensionsMap: { TableName: 'LinksTable' },                statistic: 'Sum',              }),              new cloudwatch.Metric({                namespace: 'AWS/DynamoDB',                metricName: 'WriteThrottledRequests',                dimensionsMap: { TableName: 'LinksTable' },                statistic: 'Sum',              }),            ],            width: 6,          }),                    new cloudwatch.GraphWidget({            title: 'Lambda Cold Starts',            left: [              new cloudwatch.Metric({                namespace: 'LinkShortener/Production',                metricName: 'RedirectCount',                dimensionsMap: { ColdStart: 'true' },                statistic: 'Sum',                period: cdk.Duration.minutes(5),              }),            ],            width: 6,          }),        ],      ],    });  }}

What's Next?

In Part 5, we'll tackle the final frontier: scaling to handle millions of redirects per day, cost optimization at scale, and the operational practices that let a small team manage a high-traffic service.

We'll cover advanced topics like multi-region deployments, database sharding strategies, and monitoring that alerts you before users notice problems.

The infrastructure we've built scales well, but there are specific patterns that help services handle increasing load efficiently.


Got stories from your own production optimizations? Performance tuning can be a deep rabbit hole, and it's always interesting to hear what worked (and what backfired) in different environments.

AWS CDK Link Shortener: From Zero to Production

A comprehensive 5-part series on building a production-grade link shortener service with AWS CDK, Node.js Lambda, and DynamoDB. Real war stories, performance optimization, and cost management included.

Progress4/5 posts completed

Related Posts