API Versioning with AWS CDK: How I Finally Got It Right After 3 Failed Attempts
Real lessons from building multi-version APIs in production. Why my first three versioning strategies failed, what actually works, and the CDK patterns that saved my sanity.
Last week, I had to explain to our largest enterprise client why their integration broke. Again. The culprit? Our "simple" API update that renamed a field from userId
to user_id
for consistency. A breaking change we pushed without proper versioning. That conversation hurt.
After maintaining APIs for 15 years and making every versioning mistake possible, I've learned that API versioning isn't just about adding /v1
to your URLs. It's about managing entropy, client expectations, and your own sanity. Here's what actually works in production with AWS CDK.
My Versioning Horror Stories (So You Don't Repeat Them)#
Attempt #1: The "We'll Never Need Versioning" Approach (2019)#
My first startup had one API, five clients. "We control all the clients," I thought. "We'll just update everything together."
Six months later, we had 50 clients, including a government agency running our SDK on air-gapped networks. They couldn't update for 18 months. We maintained a shadow API just for them, manually backporting security fixes. Cost us $8,000/month in additional infrastructure.
Attempt #2: The "Version Everything" Disaster (2020)#
Overcorrected hard. Created versions for everything - endpoints, headers, response formats. Ended up with this monstrosity:
GET /v2/users?response_version=1.3
X-API-Version: 2.1
Accept: application/vnd.company.user.v4+json
Three different version numbers in one request. Our junior developers literally printed out a compatibility matrix and taped it to their monitors. Testing was impossible. We had 47 different version combinations in production.
Attempt #3: The "Smart Router" That Wasn't (2021)#
Built an elaborate Lambda@Edge function that would "intelligently" route requests to the right version based on client fingerprinting. It added 200ms latency to every request and crashed during Black Friday, taking down all API versions simultaneously. Revenue impact: $1.2M.
What Actually Works: Path-Based Versioning with Deprecation Warnings#
After all that pain, here's the approach that's been running smoothly for 2 years across 300+ enterprise clients:
// lib/config/api-versions.ts
export interface ApiVersion {
version: string;
status: 'alpha' | 'beta' | 'stable' | 'deprecated' | 'sunset';
launchedAt: Date;
deprecatedAt?: Date;
sunsetAt?: Date;
monthlyActiveClients?: number; // Track this!
breakingChanges: string[];
supportedFeatures: Set<string>;
}
export const API_VERSIONS: Record<string, ApiVersion> = {
v1: {
version: 'v1',
status: 'deprecated',
launchedAt: new Date('2022-01-15'),
deprecatedAt: new Date('2024-01-15'),
sunsetAt: new Date('2025-01-15'),
monthlyActiveClients: 47, // Still have government clients
breakingChanges: [],
supportedFeatures: new Set(['basic-crud']),
},
v2: {
version: 'v2',
status: 'stable',
launchedAt: new Date('2023-06-01'),
monthlyActiveClients: 1823,
breakingChanges: [
'Changed userId to user_id in all responses',
'Removed XML support',
'Made email field required',
],
supportedFeatures: new Set(['basic-crud', 'pagination', 'filtering']),
},
v3: {
version: 'v3',
status: 'beta',
launchedAt: new Date('2024-11-01'),
monthlyActiveClients: 89,
breakingChanges: [
'Moved to JSON:API spec',
'Changed all IDs to UUIDs',
'Nested resources under data property',
],
supportedFeatures: new Set([
'basic-crud',
'pagination',
'filtering',
'webhooks',
'graphql',
'batch-operations'
]),
},
};
The CDK Stack That Powers Our APIs#
Here's the actual CDK code running in production. It's not pretty, but it handles 50M requests/day:
// lib/stacks/versioned-api-stack.ts
export class VersionedApiStack extends Stack {
constructor(scope: Construct, id: string, props: StackProps) {
super(scope, id, props);
const api = new RestApi(this, 'MultiVersionAPI', {
restApiName: 'production-api',
// Learned this the hard way: always enable CloudWatch
deployOptions: {
loggingLevel: MethodLoggingLevel.INFO,
dataTraceEnabled: true, // Saved me during the userId incident
metricsEnabled: true,
tracingEnabled: true,
},
});
// Add the version check Lambda - this is crucial
const versionCheckFn = new NodejsFunction(this, 'VersionCheck', {
entry: 'src/middleware/version-check.ts',
memorySize: 256, // Don't need much
timeout: Duration.seconds(3),
environment: {
VERSIONS: JSON.stringify(API_VERSIONS),
SLACK_WEBHOOK: process.env.SLACK_WEBHOOK!, // Alert on deprecated version usage
},
});
// Set up each version
Object.entries(API_VERSIONS).forEach(([version, config]) => {
if (config.status === 'sunset') return; // Don't deploy sunset versions
const versionResource = api.root.addResource(version);
this.setupVersionEndpoints(versionResource, config);
});
// Critical: version discovery endpoint
this.addVersionDiscovery(api);
// The alarm that saved us during the v1 sunset
new Alarm(this, 'DeprecatedVersionHighUsage', {
metric: new Metric({
namespace: 'API/Versions',
metricName: 'DeprecatedVersionCalls',
statistic: 'Sum',
}),
threshold: 1000,
evaluationPeriods: 1,
});
}
private setupVersionEndpoints(resource: IResource, config: ApiVersion) {
// Real talk: we have 47 Lambda functions across versions
// It's not elegant, but it's maintainable
const handlers = new Map<string, Function>();
// User endpoints - the source of most breaking changes
const usersResource = resource.addResource('users');
const listUsersHandler = new NodejsFunction(this, `ListUsers-${config.version}`, {
entry: `src/handlers/${config.version}/users/list.ts`,
memorySize: config.version === 'v1' ? 512 : 1024, // V1 is inefficient
timeout: Duration.seconds(29), // API Gateway limit
environment: {
TABLE_NAME: process.env.USERS_TABLE!,
VERSION: config.version,
FEATURES: [...config.supportedFeatures].join(','),
// This saved debugging time countless times
DEPLOYMENT_TIME: new Date().toISOString(),
},
bundling: {
// Version-specific dependencies
externalModules: [
'aws-sdk', // Use Lambda runtime version
...(config.version === 'v1' ? ['xmlbuilder'] : []), // V1 XML support
],
},
});
usersResource.addMethod('GET', new LambdaIntegration(listUsersHandler), {
requestParameters: {
'method.request.querystring.page': config.supportedFeatures.has('pagination'),
'method.request.querystring.limit': config.supportedFeatures.has('pagination'),
'method.request.querystring.filter': config.supportedFeatures.has('filtering'),
// V3 specific parameters
'method.request.querystring.include': config.version === 'v3',
'method.request.querystring.fields': config.version === 'v3',
},
});
// Track every version call - this metric is gold
listUsersHandler.metricInvocations().createAlarm(this, `HighTraffic-${config.version}`, {
threshold: 10000,
evaluationPeriods: 1,
alarmDescription: `High traffic on ${config.version} - check scaling`,
});
}
}
The Version Handlers That Actually Run#
Here's the real code with all its warts:
// src/handlers/v1/users/list.ts
// This code is 3 years old and it shows
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
console.log('V1 handler called', {
path: event.path,
clientIp: event.requestContext.identity.sourceIp,
userAgent: event.headers['User-Agent'],
});
try {
// V1 doesn't support pagination, returns everything
// Yes, this is terrible. No, we can't fix it.
const users = await getAllUsers(); // This once returned 50K records
// The field that caused the incident
const transformedUsers = users.map(u => ({
userId: u.user_id, // V1 uses camelCase
userName: u.name,
userEmail: u.email,
createdDate: u.created_at, // Different field name because reasons
}));
return {
statusCode: 200,
headers: {
'Content-Type': 'application/json',
'X-API-Version': 'v1',
'X-API-Deprecated': 'true',
'X-API-Sunset': '2025-01-15',
'Warning': '299 - "API v1 is deprecated. Please migrate to v2. Guides: https://docs.api.com/migration"',
// Had to add this for a banking client
'X-Total-Count': transformedUsers.length.toString(),
},
body: JSON.stringify(transformedUsers),
};
} catch (error) {
// Learned to log everything after debugging prod for 6 hours
console.error('V1 handler error', {
error,
stack: error.stack,
event: JSON.stringify(event),
});
return {
statusCode: 500,
body: JSON.stringify({
error: 'Internal Server Error',
// V1 clients expect this exact format
errorCode: 'INTERNAL_ERROR',
timestamp: new Date().toISOString(),
}),
};
}
};
// src/handlers/v2/users/list.ts
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
// V2 added proper pagination after the 50K incident
const page = parseInt(event.queryStringParameters?.page || '1');
const limit = Math.min(
parseInt(event.queryStringParameters?.limit || '20'),
100 // Hard limit after someone requested limit=10000
);
const metrics = {
version: 'v2',
page,
limit,
clientIp: event.requestContext.identity.sourceIp,
};
// Track deprecated version usage
if (event.headers['User-Agent']?.includes('OldSDK/1.')) {
await cloudwatch.putMetricData({
Namespace: 'API/Clients',
MetricData: [{
MetricName: 'OutdatedSDKUsage',
Value: 1,
Dimensions: [{ Name: 'Version', Value: 'v2' }],
}],
}).promise();
}
try {
const { users, total } = await getUsersPaginated({ page, limit });
// V2 response format - note the inconsistency that haunts me
const response = {
data: users.map(u => ({
id: u.user_id, // Changed from userId
name: u.name,
email: u.email,
status: u.status || 'active', // New required field
created_at: u.created_at, // Snake case everywhere
updated_at: u.updated_at,
})),
pagination: {
page,
limit,
total,
total_pages: Math.ceil(total / limit),
has_next: page < Math.ceil(total / limit),
has_prev: page > 1,
},
// Added after clients couldn't figure out pagination
_links: {
self: `/v2/users?page=${page}&limit=${limit}`,
next: page < Math.ceil(total / limit) ? `/v2/users?page=${page + 1}&limit=${limit}` : null,
prev: page > 1 ? `/v2/users?page=${page - 1}&limit=${limit}` : null,
},
};
return {
statusCode: 200,
headers: {
'Content-Type': 'application/json',
'X-API-Version': 'v2',
'X-RateLimit-Limit': '500',
'X-RateLimit-Remaining': await getRateLimitRemaining(event),
'Cache-Control': 'private, max-age=60', // Added after accidental caching incident
},
body: JSON.stringify(response),
};
} catch (error) {
logger.error('V2 handler error', { error, metrics });
throw error; // Let API Gateway handle it
}
};
// src/handlers/v3/users/list.ts
// V3: Where we finally got it right (mostly)
export const handler = middy(async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
// V3 uses JSON:API spec because enterprise clients demanded it
const params = parseJsonApiParams(event.queryStringParameters);
// Feature flags for gradual rollout
const features = await getFeatureFlags('v3', event.headers['X-Client-Id']);
const { users, total, included } = await getUsersWithRelationships({
...params,
includeRelationships: params.include,
sparseFields: params.fields,
experimentalFeatures: features,
});
// JSON:API format - love it or hate it
const response = {
data: users.map(u => ({
type: 'users',
id: u.id, // Finally using UUIDs everywhere
attributes: {
name: u.name,
email: u.email,
status: u.status,
created_at: u.created_at,
updated_at: u.updated_at,
},
relationships: {
organization: {
data: { type: 'organizations', id: u.organization_id },
},
roles: {
data: u.role_ids.map(id => ({ type: 'roles', id })),
},
},
links: {
self: `/v3/users/${u.id}`,
},
})),
included: included, // Related resources
meta: {
pagination: {
page: params.page.number,
pages: Math.ceil(total / params.page.size),
count: users.length,
total: total,
},
api_version: 'v3',
generated_at: new Date().toISOString(),
experimental_features: [...features],
},
links: generateJsonApiLinks(params, total),
};
return {
statusCode: 200,
headers: {
'Content-Type': 'application/vnd.api+json', // JSON:API requirement
'X-API-Version': 'v3',
'X-RateLimit-Limit': '1000',
'X-RateLimit-Remaining': event.requestContext.requestId, // Placeholder
'Vary': 'Accept, X-Client-Id', // Important for caching
},
body: JSON.stringify(response),
};
})
.use(jsonBodyParser())
.use(httpErrorHandler())
.use(correlationIds())
.use(logTimeout())
.use(warmup());
Migration Pain Points and Solutions#
The Database Migration That Almost Killed Us#
When moving from V1 to V2, we needed to change userId
(string) to user_id
(UUID). Here's how we did it without downtime:
// migrations/v1-to-v2-user-ids.ts
export const migrateUserIds = async () => {
const BATCH_SIZE = 100;
let lastEvaluatedKey: any = undefined;
let migrated = 0;
let failed = 0;
// First pass: Add new field
do {
const { Items, LastEvaluatedKey } = await dynamodb.scan({
TableName: process.env.USERS_TABLE!,
Limit: BATCH_SIZE,
ExclusiveStartKey: lastEvaluatedKey,
}).promise();
const batch = Items?.map(item => ({
PutRequest: {
Item: {
...item,
user_id: item.userId || generateUUID(), // New field
_migration: 'v1-to-v2-phase1',
_migrated_at: new Date().toISOString(),
},
},
})) || [];
if (batch.length > 0) {
try {
await dynamodb.batchWrite({
RequestItems: { [process.env.USERS_TABLE!]: batch },
}).promise();
migrated += batch.length;
} catch (error) {
// Log but don't stop - we'll retry failed items
console.error('Batch failed', { error, batch: batch.map(b => b.PutRequest.Item.userId) });
failed += batch.length;
}
}
lastEvaluatedKey = LastEvaluatedKey;
// Throttle to avoid hot partitions
await new Promise(resolve => setTimeout(resolve, 100));
} while (lastEvaluatedKey);
console.log(`Migration complete: ${migrated} succeeded, ${failed} failed`);
// Second pass: Remove old field (after all clients updated)
// We waited 6 months for this
};
Client SDK Backwards Compatibility#
Our SDK had to work with all API versions. This is messy but necessary:
// sdk/src/client.ts
export class ApiClient {
private version: string;
private warned = new Set<string>();
constructor(options: ClientOptions = {}) {
this.version = options.version || 'v2'; // Default to stable
if (this.version === 'v1' && !this.warned.has('deprecation')) {
console.warn(
'\x1b[33m%s\x1b[0m', // Yellow text
'[DEPRECATION] API v1 will be sunset on 2025-01-15. ' +
'Migration guide: https://docs.api.com/migration'
);
this.warned.add('deprecation');
// Track SDK version usage
this.trackEvent('sdk_deprecation_warning', { version: 'v1' });
}
}
async getUsers(options?: GetUsersOptions) {
const url = this.buildUrl('users', options);
const response = await this.request(url);
// Normalize responses across versions
return this.normalizeUserResponse(response);
}
private normalizeUserResponse(response: any): User[] {
switch (this.version) {
case 'v1':
// V1 returns flat array
return response.map((u: any) => ({
id: u.userId,
name: u.userName,
email: u.userEmail,
createdAt: new Date(u.createdDate),
// V1 doesn't have these
status: 'active',
updatedAt: new Date(u.createdDate),
}));
case 'v2':
// V2 returns paginated response
return response.data.map((u: any) => ({
id: u.id,
name: u.name,
email: u.email,
status: u.status,
createdAt: new Date(u.created_at),
updatedAt: new Date(u.updated_at),
}));
case 'v3':
// V3 returns JSON:API format
return response.data.map((u: any) => ({
id: u.id,
name: u.attributes.name,
email: u.attributes.email,
status: u.attributes.status,
createdAt: new Date(u.attributes.created_at),
updatedAt: new Date(u.attributes.updated_at),
// V3 includes relationships
organizationId: u.relationships?.organization?.data?.id,
roleIds: u.relationships?.roles?.data?.map((r: any) => r.id) || [],
}));
default:
throw new Error(`Unknown API version: ${this.version}`);
}
}
}
Monitoring and Alerting That Actually Helps#
After getting paged at 3 AM too many times, here's our monitoring setup:
// lib/constructs/api-monitoring.ts
export class ApiMonitoring extends Construct {
constructor(scope: Construct, id: string) {
super(scope, id);
// Dashboard that actually gets looked at
const dashboard = new Dashboard(this, 'ApiDashboard', {
dashboardName: 'api-versions-prod',
defaultInterval: Duration.hours(3), // Recent enough to be useful
});
// Version distribution - watched this like a hawk during v2 rollout
dashboard.addWidgets(
new GraphWidget({
title: 'API Version Distribution (% of requests)',
left: [v1Percentage, v2Percentage, v3Percentage],
leftYAxis: { max: 100, min: 0 },
period: Duration.minutes(5),
statistic: 'Average',
// This annotation saved us from sunsetting v1 too early
leftAnnotations: [{
label: 'Min safe threshold',
value: 5,
color: Color.RED,
}],
})
);
// The metric that matters: client errors by version
dashboard.addWidgets(
new GraphWidget({
title: '4xx Errors by Version',
left: [
new MathExpression({
expression: 'RATE(m1)',
usingMetrics: {
m1: v1Errors,
},
label: 'V1 Error Rate',
color: Color.RED,
}),
// Similar for v2, v3
],
})
);
// Deprecation warning effectiveness
const deprecationAlarm = new Alarm(this, 'V1StillHighUsage', {
metric: v1Percentage,
threshold: 10,
evaluationPeriods: 3,
comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
alarmDescription: 'V1 still above 10% - delay sunset?',
treatMissingData: TreatMissingData.NOT_BREACHING,
});
deprecationAlarm.addAlarmAction(
new SnsAction(Topic.fromTopicArn(this, 'AlertTopic', process.env.ALERT_TOPIC_ARN!))
);
}
}
Hard-Learned Lessons#
1. Version Sunset Is Harder Than Launch#
We still have 47 clients on V1, two years after deprecation. Why?
- Government client with 18-month deployment cycles
- IoT devices that can't be updated remotely
- One client hard-coded our URLs in firmware (!)
Cost of maintaining V1: ~$3,000/month. Cost of losing these clients: ~$180,000/month.
2. Breaking Changes Are Like Compound Interest#
Every breaking change multiplies your testing matrix. We have:
- 3 API versions
- 4 SDK versions (one legacy)
- 5 different response formats
- = 60 combinations to test
Our integration tests take 45 minutes to run.
3. Documentation Drift Is Real#
Our V1 docs were last updated in 2022. Found out last month a major client was using undocumented behavior we'd "fixed" in V2. Had to add it back as a feature flag.
4. Version Discovery Is Critical#
// This endpoint saves more support tickets than any other
app.get('/api', (req, res) => {
res.json({
versions: {
v1: {
status: 'deprecated',
sunset_date: '2025-01-15',
docs: 'https://docs.api.com/v1',
migration_guide: 'https://docs.api.com/v1-to-v2',
},
v2: {
status: 'stable',
docs: 'https://docs.api.com/v2',
},
v3: {
status: 'beta',
docs: 'https://docs.api.com/v3',
breaking_changes: 'https://docs.api.com/v3-breaking-changes',
},
},
current_stable: 'v2',
recommended: 'v2',
your_version: detectVersion(req), // What the client is using
});
});
The Actual Costs#
Running multiple API versions isn't free:
- Infrastructure: 3x Lambda functions, 3x API Gateway configs = +$850/month
- Development: Every feature takes 40% longer to implement across versions
- Testing: CI/CD pipeline went from 15 minutes to 45 minutes
- Documentation: Maintaining three sets of docs = 1 technical writer, part-time
- Support: 30% of tickets are version-related confusion
Total additional cost: ~$15,000/month. But removing V1 would cost us $180,000/month in lost revenue.
What I'd Do Differently#
- Start with versioning from day one - Adding it later is 10x harder
- Make breaking changes in batches - We did 15 small breaking changes instead of 3 big ones
- Invest in better migration tools - Should have built automated migration scripts earlier
- Set realistic sunset dates - 6 months is fantasy. 18 months is realistic.
- Track client versions from the start - We didn't know who was using what until too late
The CDK Pattern That Actually Works#
If you're starting fresh, use this structure:
/api
/v1
/users
/orders
/internal/health
/v2
/users
/orders
/internal/health
/versions (discovery endpoint)
/health (version-agnostic)
Keep your Lambda code organized by version:
/src
/handlers
/v1
/users
/orders
/v2
/users
/orders
/shared
/database
/auth
/utils
Final Thoughts#
API versioning with CDK isn't about finding the perfect pattern - it's about finding the pattern that matches your reality. Our three-version system isn't elegant, but it works. It keeps our enterprise clients happy, our developers sane(ish), and our revenue flowing.
The next time someone suggests "just adding a version number," show them this post. Then budget 6 months and $15K/month for the real implementation.
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!