API Versioning with AWS CDK: How I Finally Got It Right After 3 Failed Attempts

Real lessons from building multi-version APIs in production. Why my first three versioning strategies failed, what actually works, and the CDK patterns that saved my sanity.

Last week, I had to explain to our largest enterprise client why their integration broke. Again. The culprit? Our "simple" API update that renamed a field from userId to user_id for consistency. A breaking change we pushed without proper versioning. That conversation hurt.

After maintaining APIs for 15 years and making every versioning mistake possible, I've learned that API versioning isn't just about adding /v1 to your URLs. It's about managing entropy, client expectations, and your own sanity. Here's what actually works in production with AWS CDK.

My Versioning Horror Stories (So You Don't Repeat Them)#

Attempt #1: The "We'll Never Need Versioning" Approach (2019)#

My first startup had one API, five clients. "We control all the clients," I thought. "We'll just update everything together."

Six months later, we had 50 clients, including a government agency running our SDK on air-gapped networks. They couldn't update for 18 months. We maintained a shadow API just for them, manually backporting security fixes. Cost us $8,000/month in additional infrastructure.

Attempt #2: The "Version Everything" Disaster (2020)#

Overcorrected hard. Created versions for everything - endpoints, headers, response formats. Ended up with this monstrosity:

Text
GET /v2/users?response_version=1.3
X-API-Version: 2.1
Accept: application/vnd.company.user.v4+json

Three different version numbers in one request. Our junior developers literally printed out a compatibility matrix and taped it to their monitors. Testing was impossible. We had 47 different version combinations in production.

Attempt #3: The "Smart Router" That Wasn't (2021)#

Built an elaborate Lambda@Edge function that would "intelligently" route requests to the right version based on client fingerprinting. It added 200ms latency to every request and crashed during Black Friday, taking down all API versions simultaneously. Revenue impact: $1.2M.

What Actually Works: Path-Based Versioning with Deprecation Warnings#

After all that pain, here's the approach that's been running smoothly for 2 years across 300+ enterprise clients:

TypeScript
// lib/config/api-versions.ts
export interface ApiVersion {
  version: string;
  status: 'alpha' | 'beta' | 'stable' | 'deprecated' | 'sunset';
  launchedAt: Date;
  deprecatedAt?: Date;
  sunsetAt?: Date;
  monthlyActiveClients?: number;  // Track this!
  breakingChanges: string[];
  supportedFeatures: Set<string>;
}

export const API_VERSIONS: Record<string, ApiVersion> = {
  v1: {
    version: 'v1',
    status: 'deprecated',
    launchedAt: new Date('2022-01-15'),
    deprecatedAt: new Date('2024-01-15'),
    sunsetAt: new Date('2025-01-15'),
    monthlyActiveClients: 47,  // Still have government clients
    breakingChanges: [],
    supportedFeatures: new Set(['basic-crud']),
  },
  v2: {
    version: 'v2',
    status: 'stable',
    launchedAt: new Date('2023-06-01'),
    monthlyActiveClients: 1823,
    breakingChanges: [
      'Changed userId to user_id in all responses',
      'Removed XML support',
      'Made email field required',
    ],
    supportedFeatures: new Set(['basic-crud', 'pagination', 'filtering']),
  },
  v3: {
    version: 'v3',
    status: 'beta',
    launchedAt: new Date('2024-11-01'),
    monthlyActiveClients: 89,
    breakingChanges: [
      'Moved to JSON:API spec',
      'Changed all IDs to UUIDs',
      'Nested resources under data property',
    ],
    supportedFeatures: new Set([
      'basic-crud',
      'pagination',
      'filtering',
      'webhooks',
      'graphql',
      'batch-operations'
    ]),
  },
};

The CDK Stack That Powers Our APIs#

Here's the actual CDK code running in production. It's not pretty, but it handles 50M requests/day:

TypeScript
// lib/stacks/versioned-api-stack.ts
export class VersionedApiStack extends Stack {
  constructor(scope: Construct, id: string, props: StackProps) {
    super(scope, id, props);

    const api = new RestApi(this, 'MultiVersionAPI', {
      restApiName: 'production-api',
      // Learned this the hard way: always enable CloudWatch
      deployOptions: {
        loggingLevel: MethodLoggingLevel.INFO,
        dataTraceEnabled: true,  // Saved me during the userId incident
        metricsEnabled: true,
        tracingEnabled: true,
      },
    });

    // Add the version check Lambda - this is crucial
    const versionCheckFn = new NodejsFunction(this, 'VersionCheck', {
      entry: 'src/middleware/version-check.ts',
      memorySize: 256,  // Don't need much
      timeout: Duration.seconds(3),
      environment: {
        VERSIONS: JSON.stringify(API_VERSIONS),
        SLACK_WEBHOOK: process.env.SLACK_WEBHOOK!,  // Alert on deprecated version usage
      },
    });

    // Set up each version
    Object.entries(API_VERSIONS).forEach(([version, config]) => {
      if (config.status === 'sunset') return;  // Don't deploy sunset versions

      const versionResource = api.root.addResource(version);
      this.setupVersionEndpoints(versionResource, config);
    });

    // Critical: version discovery endpoint
    this.addVersionDiscovery(api);

    // The alarm that saved us during the v1 sunset
    new Alarm(this, 'DeprecatedVersionHighUsage', {
      metric: new Metric({
        namespace: 'API/Versions',
        metricName: 'DeprecatedVersionCalls',
        statistic: 'Sum',
      }),
      threshold: 1000,
      evaluationPeriods: 1,
    });
  }

  private setupVersionEndpoints(resource: IResource, config: ApiVersion) {
    // Real talk: we have 47 Lambda functions across versions
    // It's not elegant, but it's maintainable

    const handlers = new Map<string, Function>();

    // User endpoints - the source of most breaking changes
    const usersResource = resource.addResource('users');

    const listUsersHandler = new NodejsFunction(this, `ListUsers-${config.version}`, {
      entry: `src/handlers/${config.version}/users/list.ts`,
      memorySize: config.version === 'v1' ? 512 : 1024,  // V1 is inefficient
      timeout: Duration.seconds(29),  // API Gateway limit
      environment: {
        TABLE_NAME: process.env.USERS_TABLE!,
        VERSION: config.version,
        FEATURES: [...config.supportedFeatures].join(','),
        // This saved debugging time countless times
        DEPLOYMENT_TIME: new Date().toISOString(),
      },
      bundling: {
        // Version-specific dependencies
        externalModules: [
          'aws-sdk',  // Use Lambda runtime version
          ...(config.version === 'v1' ? ['xmlbuilder'] : []),  // V1 XML support
        ],
      },
    });

    usersResource.addMethod('GET', new LambdaIntegration(listUsersHandler), {
      requestParameters: {
        'method.request.querystring.page': config.supportedFeatures.has('pagination'),
        'method.request.querystring.limit': config.supportedFeatures.has('pagination'),
        'method.request.querystring.filter': config.supportedFeatures.has('filtering'),
        // V3 specific parameters
        'method.request.querystring.include': config.version === 'v3',
        'method.request.querystring.fields': config.version === 'v3',
      },
    });

    // Track every version call - this metric is gold
    listUsersHandler.metricInvocations().createAlarm(this, `HighTraffic-${config.version}`, {
      threshold: 10000,
      evaluationPeriods: 1,
      alarmDescription: `High traffic on ${config.version} - check scaling`,
    });
  }
}

The Version Handlers That Actually Run#

Here's the real code with all its warts:

TypeScript
// src/handlers/v1/users/list.ts
// This code is 3 years old and it shows
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
  console.log('V1 handler called', {
    path: event.path,
    clientIp: event.requestContext.identity.sourceIp,
    userAgent: event.headers['User-Agent'],
  });

  try {
    // V1 doesn't support pagination, returns everything
    // Yes, this is terrible. No, we can't fix it.
    const users = await getAllUsers();  // This once returned 50K records

    // The field that caused the incident
    const transformedUsers = users.map(u => ({
      userId: u.user_id,  // V1 uses camelCase
      userName: u.name,
      userEmail: u.email,
      createdDate: u.created_at,  // Different field name because reasons
    }));

    return {
      statusCode: 200,
      headers: {
        'Content-Type': 'application/json',
        'X-API-Version': 'v1',
        'X-API-Deprecated': 'true',
        'X-API-Sunset': '2025-01-15',
        'Warning': '299 - "API v1 is deprecated. Please migrate to v2. Guides: https://docs.api.com/migration"',
        // Had to add this for a banking client
        'X-Total-Count': transformedUsers.length.toString(),
      },
      body: JSON.stringify(transformedUsers),
    };
  } catch (error) {
    // Learned to log everything after debugging prod for 6 hours
    console.error('V1 handler error', {
      error,
      stack: error.stack,
      event: JSON.stringify(event),
    });

    return {
      statusCode: 500,
      body: JSON.stringify({
        error: 'Internal Server Error',
        // V1 clients expect this exact format
        errorCode: 'INTERNAL_ERROR',
        timestamp: new Date().toISOString(),
      }),
    };
  }
};

// src/handlers/v2/users/list.ts
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
  // V2 added proper pagination after the 50K incident
  const page = parseInt(event.queryStringParameters?.page || '1');
  const limit = Math.min(
    parseInt(event.queryStringParameters?.limit || '20'),
    100  // Hard limit after someone requested limit=10000
  );

  const metrics = {
    version: 'v2',
    page,
    limit,
    clientIp: event.requestContext.identity.sourceIp,
  };

  // Track deprecated version usage
  if (event.headers['User-Agent']?.includes('OldSDK/1.')) {
    await cloudwatch.putMetricData({
      Namespace: 'API/Clients',
      MetricData: [{
        MetricName: 'OutdatedSDKUsage',
        Value: 1,
        Dimensions: [{ Name: 'Version', Value: 'v2' }],
      }],
    }).promise();
  }

  try {
    const { users, total } = await getUsersPaginated({ page, limit });

    // V2 response format - note the inconsistency that haunts me
    const response = {
      data: users.map(u => ({
        id: u.user_id,  // Changed from userId
        name: u.name,
        email: u.email,
        status: u.status || 'active',  // New required field
        created_at: u.created_at,  // Snake case everywhere
        updated_at: u.updated_at,
      })),
      pagination: {
        page,
        limit,
        total,
        total_pages: Math.ceil(total / limit),
        has_next: page < Math.ceil(total / limit),
        has_prev: page > 1,
      },
      // Added after clients couldn't figure out pagination
      _links: {
        self: `/v2/users?page=${page}&limit=${limit}`,
        next: page < Math.ceil(total / limit) ? `/v2/users?page=${page + 1}&limit=${limit}` : null,
        prev: page > 1 ? `/v2/users?page=${page - 1}&limit=${limit}` : null,
      },
    };

    return {
      statusCode: 200,
      headers: {
        'Content-Type': 'application/json',
        'X-API-Version': 'v2',
        'X-RateLimit-Limit': '500',
        'X-RateLimit-Remaining': await getRateLimitRemaining(event),
        'Cache-Control': 'private, max-age=60',  // Added after accidental caching incident
      },
      body: JSON.stringify(response),
    };
  } catch (error) {
    logger.error('V2 handler error', { error, metrics });
    throw error;  // Let API Gateway handle it
  }
};

// src/handlers/v3/users/list.ts
// V3: Where we finally got it right (mostly)
export const handler = middy(async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
  // V3 uses JSON:API spec because enterprise clients demanded it
  const params = parseJsonApiParams(event.queryStringParameters);

  // Feature flags for gradual rollout
  const features = await getFeatureFlags('v3', event.headers['X-Client-Id']);

  const { users, total, included } = await getUsersWithRelationships({
    ...params,
    includeRelationships: params.include,
    sparseFields: params.fields,
    experimentalFeatures: features,
  });

  // JSON:API format - love it or hate it
  const response = {
    data: users.map(u => ({
      type: 'users',
      id: u.id,  // Finally using UUIDs everywhere
      attributes: {
        name: u.name,
        email: u.email,
        status: u.status,
        created_at: u.created_at,
        updated_at: u.updated_at,
      },
      relationships: {
        organization: {
          data: { type: 'organizations', id: u.organization_id },
        },
        roles: {
          data: u.role_ids.map(id => ({ type: 'roles', id })),
        },
      },
      links: {
        self: `/v3/users/${u.id}`,
      },
    })),
    included: included,  // Related resources
    meta: {
      pagination: {
        page: params.page.number,
        pages: Math.ceil(total / params.page.size),
        count: users.length,
        total: total,
      },
      api_version: 'v3',
      generated_at: new Date().toISOString(),
      experimental_features: [...features],
    },
    links: generateJsonApiLinks(params, total),
  };

  return {
    statusCode: 200,
    headers: {
      'Content-Type': 'application/vnd.api+json',  // JSON:API requirement
      'X-API-Version': 'v3',
      'X-RateLimit-Limit': '1000',
      'X-RateLimit-Remaining': event.requestContext.requestId,  // Placeholder
      'Vary': 'Accept, X-Client-Id',  // Important for caching
    },
    body: JSON.stringify(response),
  };
})
  .use(jsonBodyParser())
  .use(httpErrorHandler())
  .use(correlationIds())
  .use(logTimeout())
  .use(warmup());

Migration Pain Points and Solutions#

The Database Migration That Almost Killed Us#

When moving from V1 to V2, we needed to change userId (string) to user_id (UUID). Here's how we did it without downtime:

TypeScript
// migrations/v1-to-v2-user-ids.ts
export const migrateUserIds = async () => {
  const BATCH_SIZE = 100;
  let lastEvaluatedKey: any = undefined;
  let migrated = 0;
  let failed = 0;

  // First pass: Add new field
  do {
    const { Items, LastEvaluatedKey } = await dynamodb.scan({
      TableName: process.env.USERS_TABLE!,
      Limit: BATCH_SIZE,
      ExclusiveStartKey: lastEvaluatedKey,
    }).promise();

    const batch = Items?.map(item => ({
      PutRequest: {
        Item: {
          ...item,
          user_id: item.userId || generateUUID(),  // New field
          _migration: 'v1-to-v2-phase1',
          _migrated_at: new Date().toISOString(),
        },
      },
    })) || [];

    if (batch.length > 0) {
      try {
        await dynamodb.batchWrite({
          RequestItems: { [process.env.USERS_TABLE!]: batch },
        }).promise();
        migrated += batch.length;
      } catch (error) {
        // Log but don't stop - we'll retry failed items
        console.error('Batch failed', { error, batch: batch.map(b => b.PutRequest.Item.userId) });
        failed += batch.length;
      }
    }

    lastEvaluatedKey = LastEvaluatedKey;

    // Throttle to avoid hot partitions
    await new Promise(resolve => setTimeout(resolve, 100));

  } while (lastEvaluatedKey);

  console.log(`Migration complete: ${migrated} succeeded, ${failed} failed`);

  // Second pass: Remove old field (after all clients updated)
  // We waited 6 months for this
};

Client SDK Backwards Compatibility#

Our SDK had to work with all API versions. This is messy but necessary:

TypeScript
// sdk/src/client.ts
export class ApiClient {
  private version: string;
  private warned = new Set<string>();

  constructor(options: ClientOptions = {}) {
    this.version = options.version || 'v2';  // Default to stable

    if (this.version === 'v1' && !this.warned.has('deprecation')) {
      console.warn(
        '\x1b[33m%s\x1b[0m',  // Yellow text
        '[DEPRECATION] API v1 will be sunset on 2025-01-15. ' +
        'Migration guide: https://docs.api.com/migration'
      );
      this.warned.add('deprecation');

      // Track SDK version usage
      this.trackEvent('sdk_deprecation_warning', { version: 'v1' });
    }
  }

  async getUsers(options?: GetUsersOptions) {
    const url = this.buildUrl('users', options);
    const response = await this.request(url);

    // Normalize responses across versions
    return this.normalizeUserResponse(response);
  }

  private normalizeUserResponse(response: any): User[] {
    switch (this.version) {
      case 'v1':
        // V1 returns flat array
        return response.map((u: any) => ({
          id: u.userId,
          name: u.userName,
          email: u.userEmail,
          createdAt: new Date(u.createdDate),
          // V1 doesn't have these
          status: 'active',
          updatedAt: new Date(u.createdDate),
        }));

      case 'v2':
        // V2 returns paginated response
        return response.data.map((u: any) => ({
          id: u.id,
          name: u.name,
          email: u.email,
          status: u.status,
          createdAt: new Date(u.created_at),
          updatedAt: new Date(u.updated_at),
        }));

      case 'v3':
        // V3 returns JSON:API format
        return response.data.map((u: any) => ({
          id: u.id,
          name: u.attributes.name,
          email: u.attributes.email,
          status: u.attributes.status,
          createdAt: new Date(u.attributes.created_at),
          updatedAt: new Date(u.attributes.updated_at),
          // V3 includes relationships
          organizationId: u.relationships?.organization?.data?.id,
          roleIds: u.relationships?.roles?.data?.map((r: any) => r.id) || [],
        }));

      default:
        throw new Error(`Unknown API version: ${this.version}`);
    }
  }
}

Monitoring and Alerting That Actually Helps#

After getting paged at 3 AM too many times, here's our monitoring setup:

TypeScript
// lib/constructs/api-monitoring.ts
export class ApiMonitoring extends Construct {
  constructor(scope: Construct, id: string) {
    super(scope, id);

    // Dashboard that actually gets looked at
    const dashboard = new Dashboard(this, 'ApiDashboard', {
      dashboardName: 'api-versions-prod',
      defaultInterval: Duration.hours(3),  // Recent enough to be useful
    });

    // Version distribution - watched this like a hawk during v2 rollout
    dashboard.addWidgets(
      new GraphWidget({
        title: 'API Version Distribution (% of requests)',
        left: [v1Percentage, v2Percentage, v3Percentage],
        leftYAxis: { max: 100, min: 0 },
        period: Duration.minutes(5),
        statistic: 'Average',
        // This annotation saved us from sunsetting v1 too early
        leftAnnotations: [{
          label: 'Min safe threshold',
          value: 5,
          color: Color.RED,
        }],
      })
    );

    // The metric that matters: client errors by version
    dashboard.addWidgets(
      new GraphWidget({
        title: '4xx Errors by Version',
        left: [
          new MathExpression({
            expression: 'RATE(m1)',
            usingMetrics: {
              m1: v1Errors,
            },
            label: 'V1 Error Rate',
            color: Color.RED,
          }),
          // Similar for v2, v3
        ],
      })
    );

    // Deprecation warning effectiveness
    const deprecationAlarm = new Alarm(this, 'V1StillHighUsage', {
      metric: v1Percentage,
      threshold: 10,
      evaluationPeriods: 3,
      comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
      alarmDescription: 'V1 still above 10% - delay sunset?',
      treatMissingData: TreatMissingData.NOT_BREACHING,
    });

    deprecationAlarm.addAlarmAction(
      new SnsAction(Topic.fromTopicArn(this, 'AlertTopic', process.env.ALERT_TOPIC_ARN!))
    );
  }
}

Hard-Learned Lessons#

1. Version Sunset Is Harder Than Launch#

We still have 47 clients on V1, two years after deprecation. Why?

  • Government client with 18-month deployment cycles
  • IoT devices that can't be updated remotely
  • One client hard-coded our URLs in firmware (!)

Cost of maintaining V1: ~$3,000/month. Cost of losing these clients: ~$180,000/month.

2. Breaking Changes Are Like Compound Interest#

Every breaking change multiplies your testing matrix. We have:

  • 3 API versions
  • 4 SDK versions (one legacy)
  • 5 different response formats
  • = 60 combinations to test

Our integration tests take 45 minutes to run.

3. Documentation Drift Is Real#

Our V1 docs were last updated in 2022. Found out last month a major client was using undocumented behavior we'd "fixed" in V2. Had to add it back as a feature flag.

4. Version Discovery Is Critical#

TypeScript
// This endpoint saves more support tickets than any other
app.get('/api', (req, res) => {
  res.json({
    versions: {
      v1: {
        status: 'deprecated',
        sunset_date: '2025-01-15',
        docs: 'https://docs.api.com/v1',
        migration_guide: 'https://docs.api.com/v1-to-v2',
      },
      v2: {
        status: 'stable',
        docs: 'https://docs.api.com/v2',
      },
      v3: {
        status: 'beta',
        docs: 'https://docs.api.com/v3',
        breaking_changes: 'https://docs.api.com/v3-breaking-changes',
      },
    },
    current_stable: 'v2',
    recommended: 'v2',
    your_version: detectVersion(req),  // What the client is using
  });
});

The Actual Costs#

Running multiple API versions isn't free:

  • Infrastructure: 3x Lambda functions, 3x API Gateway configs = +$850/month
  • Development: Every feature takes 40% longer to implement across versions
  • Testing: CI/CD pipeline went from 15 minutes to 45 minutes
  • Documentation: Maintaining three sets of docs = 1 technical writer, part-time
  • Support: 30% of tickets are version-related confusion

Total additional cost: ~$15,000/month. But removing V1 would cost us $180,000/month in lost revenue.

What I'd Do Differently#

  1. Start with versioning from day one - Adding it later is 10x harder
  2. Make breaking changes in batches - We did 15 small breaking changes instead of 3 big ones
  3. Invest in better migration tools - Should have built automated migration scripts earlier
  4. Set realistic sunset dates - 6 months is fantasy. 18 months is realistic.
  5. Track client versions from the start - We didn't know who was using what until too late

The CDK Pattern That Actually Works#

If you're starting fresh, use this structure:

Text
/api
  /v1
    /users
    /orders
    /internal/health
  /v2
    /users
    /orders
    /internal/health
  /versions (discovery endpoint)
  /health (version-agnostic)

Keep your Lambda code organized by version:

Text
/src
  /handlers
    /v1
      /users
      /orders
    /v2
      /users
      /orders
  /shared
    /database
    /auth
    /utils

Final Thoughts#

API versioning with CDK isn't about finding the perfect pattern - it's about finding the pattern that matches your reality. Our three-version system isn't elegant, but it works. It keeps our enterprise clients happy, our developers sane(ish), and our revenue flowing.

The next time someone suggests "just adding a version number," show them this post. Then budget 6 months and $15K/month for the real implementation.

Loading...

Comments (0)

Join the conversation

Sign in to share your thoughts and engage with the community

No comments yet

Be the first to share your thoughts on this post!

Related Posts