AWS CDK Link Shortener Part 5: Scaling & Long-term Maintenance
Multi-region deployment, database scaling strategies, disaster recovery patterns, and long-term maintenance approaches. Real lessons from running production systems at scale and the architectural decisions that matter years later.
AWS CDK Link Shortener Part 5: Scaling & Long-term Maintenance#
Two years after launching our link shortener, I got the call during a quarterly business review. "We need to expand to APAC markets, and our European users are complaining about slow redirects." What started as a simple request turned into a six-month global scaling project that taught me more about distributed systems than any architecture course ever could.
The real kicker? Our "perfectly architected" single-region system wasn't just slow for international users—it was a single point of failure for a business that now depended on it for customer acquisition across three continents. Time to learn about scaling the hard way.
In Parts 1-4, we built, secured, and optimized our link shortener for production. Now let's scale it globally and build the operational excellence patterns that'll keep it running for years. This is where architecture decisions really start showing their consequences.
Multi-Region Architecture: When Simple Isn't Enough Anymore#
Our original single-region setup worked great for 100K redirects per day. At 10M redirects across global markets, every millisecond of latency became a conversion rate problem. Here's how we evolved the architecture:
// lib/global-link-shortener-stack.ts - Multi-region deployment pattern
import * as cdk from 'aws-cdk-lib';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import { Construct } from 'constructs';
export interface GlobalLinkShortenerProps {
readonly primaryRegion: string;
readonly replicationRegions: string[];
readonly domainName: string;
readonly certificateArn: string;
}
export class GlobalLinkShortenerStack extends cdk.Stack {
public readonly globalTable: dynamodb.Table;
public readonly distribution: cloudfront.Distribution;
constructor(scope: Construct, id: string, props: GlobalLinkShortenerProps) {
super(scope, id, {
env: { region: props.primaryRegion },
crossRegionReferences: true
});
// Global DynamoDB table with cross-region replication
this.globalTable = new dynamodb.Table(this, 'GlobalLinksTable', {
tableName: 'global-links-table',
partitionKey: { name: 'shortCode', type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
// Point-in-time recovery for global data
pointInTimeRecovery: true,
// Global tables for multi-region active-active
replicationRegions: props.replicationRegions,
// Stream for real-time analytics across regions
stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
// Deletion protection - learned this one the hard way
removalPolicy: cdk.RemovalPolicy.RETAIN,
deletionProtection: true,
});
// Global secondary index for analytics queries
this.globalTable.addGlobalSecondaryIndex({
indexName: 'domain-timestamp-index',
partitionKey: { name: 'domain', type: dynamodb.AttributeType.STRING },
sortKey: { name: 'createdAt', type: dynamodb.AttributeType.STRING },
});
// Route 53 health checks for each region
const healthChecks = props.replicationRegions.map((region, index) => {
return new route53.CfnHealthCheck(this, `HealthCheck-${region}`, {
type: 'HTTPS',
resourcePath: '/health',
fullyQualifiedDomainName: `${region}.${props.domainName}`,
port: 443,
requestInterval: 30,
failureThreshold: 3,
});
});
// Global CloudFront distribution with regional origins
this.distribution = new cloudfront.Distribution(this, 'GlobalDistribution', {
comment: 'Global Link Shortener Distribution',
// Price class for global edge locations
priceClass: cloudfront.PriceClass.PRICE_CLASS_ALL,
// Custom domain configuration
domainNames: [props.domainName],
certificate: acm.Certificate.fromCertificateArn(
this, 'Certificate', props.certificateArn
),
// Regional origins with health check failover
additionalBehaviors: this.createRegionalBehaviors(props.replicationRegions),
// Cache policy for redirect responses
defaultBehavior: {
origin: new origins.HttpOrigin(`${props.primaryRegion}.${props.domainName}`),
cachePolicy: cloudfront.CachePolicy.CACHING_OPTIMIZED,
originRequestPolicy: cloudfront.OriginRequestPolicy.CORS_S3_ORIGIN,
viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
// Edge Lambda for geo-routing optimization
edgeLambdas: [{
functionVersion: this.createEdgeFunction(),
eventType: cloudfront.LambdaEdgeEventType.ORIGIN_REQUEST,
}],
},
});
}
private createRegionalBehaviors(regions: string[]) {
const behaviors: Record<string, cloudfront.BehaviorOptions> = {};
regions.forEach(region => {
behaviors[`/${region}/*`] = {
origin: new origins.HttpOrigin(`${region}.api.example.com`),
cachePolicy: cloudfront.CachePolicy.CACHING_OPTIMIZED,
viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
};
});
return behaviors;
}
}
The regional deployment pattern that saved our international performance:
// bin/global-deployment.ts - Regional deployment orchestration
#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import { GlobalLinkShortenerStack } from '../lib/global-link-shortener-stack';
import { RegionalLinkShortenerStack } from '../lib/regional-link-shortener-stack';
const app = new cdk.App();
// Configuration driven deployment
const regions = [
{ name: 'us-east-1', isPrimary: true, weight: 40 },
{ name: 'eu-west-1', isPrimary: false, weight: 35 },
{ name: 'ap-southeast-1', isPrimary: false, weight: 25 },
];
const domainName = app.node.tryGetContext('domainName') || 'links.example.com';
// Deploy primary global resources
const globalStack = new GlobalLinkShortenerStack(app, 'GlobalLinkShortener', {
primaryRegion: 'us-east-1',
replicationRegions: regions.filter(r => !r.isPrimary).map(r => r.name),
domainName,
certificateArn: app.node.tryGetContext('certificateArn'),
});
// Deploy regional stacks
regions.forEach(region => {
new RegionalLinkShortenerStack(app, `LinkShortener-${region.name}`, {
env: { region: region.name },
globalTable: globalStack.globalTable,
isPrimaryRegion: region.isPrimary,
trafficWeight: region.weight,
domainName,
// Cross-stack references for global resources
crossRegionReferences: true,
});
});
The Hard Truth About Multi-Region: It's not just about deploying to multiple regions. You need to think about data consistency, regional failover, cost implications, and operational complexity. Our first attempt took 3 months because we underestimated the operational overhead.
Database Scaling Strategies: Beyond DynamoDB Auto-Scaling#
When you hit 10M+ requests per day, even DynamoDB's auto-scaling has limits. Here are the patterns that actually worked in production:
// lib/database-scaling-stack.ts - Advanced DynamoDB scaling patterns
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as elasticache from 'aws-cdk-lib/aws-elasticache';
import * as lambda from 'aws-cdk-lib/aws-lambda';
export class ScalableDatabaseStack extends cdk.Stack {
// Hot partition detection and mitigation
private createShardedTable() {
const table = new dynamodb.Table(this, 'ShardedLinksTable', {
partitionKey: { name: 'shardKey', type: dynamodb.AttributeType.STRING },
sortKey: { name: 'shortCode', type: dynamodb.AttributeType.STRING },
// On-demand scaling for unpredictable traffic
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
// Contributor insights for hot partition detection
contributorInsightsEnabled: true,
});
// Add write sharding logic
const shardingFunction = new lambda.Function(this, 'ShardingFunction', {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'sharding.handler',
code: lambda.Code.fromAsset('functions'),
environment: {
SHARD_COUNT: '100', // Distribute load across shards
TABLE_NAME: table.tableName,
},
});
return table;
}
// Redis cluster for hot link caching
private createCacheCluster() {
const cacheSubnetGroup = new elasticache.CfnSubnetGroup(
this, 'CacheSubnetGroup', {
description: 'Subnet group for Redis cluster',
subnetIds: this.vpc.privateSubnets.map(subnet => subnet.subnetId),
}
);
return new elasticache.CfnCacheCluster(this, 'RedisCluster', {
engine: 'redis',
engineVersion: '7.0',
cacheNodeType: 'cache.r6g.large',
numCacheNodes: 1,
// Multi-AZ for high availability
azMode: 'cross-az',
preferredAvailabilityZones: ['us-east-1a', 'us-east-1b'],
// Subnet and security configuration
cacheSubnetGroupName: cacheSubnetGroup.ref,
vpcSecurityGroupIds: [this.cacheSecurityGroup.securityGroupId],
// Backup and maintenance
snapshotRetentionLimit: 5,
snapshotWindow: '03:00-05:00',
preferredMaintenanceWindow: 'sun:05:00-sun:07:00',
});
}
// Read replica pattern for analytics
private createAnalyticsReadReplicas() {
// Separate table for analytics to avoid impacting redirects
return new dynamodb.Table(this, 'AnalyticsTable', {
partitionKey: { name: 'date', type: dynamodb.AttributeType.STRING },
sortKey: { name: 'linkId', type: dynamodb.AttributeType.STRING },
// Time-based partitioning for analytics queries
timeToLiveAttribute: 'ttl',
// Stream processing for real-time aggregation
stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
});
}
}
The sharding logic that solved our hot partition problems:
// functions/sharding.ts - Hot partition mitigation
import { DynamoDBClient, PutItemCommand } from '@aws-sdk/client-dynamodb';
import { createHash } from 'crypto';
interface LinkData {
shortCode: string;
targetUrl: string;
domain: string;
createdAt: string;
}
export const handler = async (event: any) => {
const { shortCode, targetUrl, domain } = event as LinkData;
// Shard key generation to distribute load
const shardKey = generateShardKey(shortCode, domain);
const client = new DynamoDBClient({});
// Write to sharded partition
const command = new PutItemCommand({
TableName: process.env.TABLE_NAME,
Item: {
shardKey: { S: shardKey },
shortCode: { S: shortCode },
targetUrl: { S: targetUrl },
domain: { S: domain },
createdAt: { S: new Date().toISOString() },
// TTL for automatic cleanup of old links
ttl: { N: Math.floor(Date.now() / 1000) + (365 * 24 * 60 * 60) },
},
// Conditional write to prevent overwrites
ConditionExpression: 'attribute_not_exists(shortCode)',
});
try {
await client.send(command);
return { statusCode: 201, body: JSON.stringify({ shortCode, shardKey }) };
} catch (error) {
console.error('Sharding write failed:', error);
throw new Error('Failed to create sharded link');
}
};
function generateShardKey(shortCode: string, domain: string): string {
const shardCount = parseInt(process.env.SHARD_COUNT || '10');
// Consistent hashing for even distribution
const hash = createHash('md5')
.update(`${shortCode}-${domain}`)
.digest('hex');
const shardIndex = parseInt(hash.substring(0, 8), 16) % shardCount;
return `shard-${shardIndex.toString().padStart(3, '0')}`;
}
Scaling Reality Check: Sharding looks elegant in theory, but debugging distributed queries across 100 shards at 3 AM is not fun. We learned to start with simple solutions and add complexity only when metrics proved it necessary.
Disaster Recovery: Planning for the Worst Day#
Six months into our global deployment, AWS had a major outage in us-east-1. Our primary region was down for 4 hours. Here's what we learned about real disaster recovery:
// lib/disaster-recovery-stack.ts - Multi-region failover automation
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
export class DisasterRecoveryStack extends cdk.Stack {
// Automated failover using Route 53 health checks
private createFailoverRouting() {
const hostedZone = route53.HostedZone.fromLookup(this, 'Zone', {
domainName: 'example.com',
});
// Primary region record with health check
const primaryHealthCheck = new route53.CfnHealthCheck(this, 'PrimaryHealth', {
type: 'HTTPS',
resourcePath: '/health',
fullyQualifiedDomainName: 'us-east-1.api.example.com',
port: 443,
requestInterval: 30,
failureThreshold: 3,
// CloudWatch alarm integration
insufficientDataHealthStatus: 'Failure',
measureLatency: true,
regions: ['us-east-1', 'us-west-1', 'eu-west-1'],
});
// Primary record with failover routing
new route53.ARecord(this, 'PrimaryRecord', {
zone: hostedZone,
recordName: 'api',
target: route53.RecordTarget.fromIpAddresses('1.2.3.4'),
setIdentifier: 'primary',
failover: route53.FailoverRoutingPolicy.PRIMARY,
healthCheckId: primaryHealthCheck.attrHealthCheckId,
});
// Secondary region record
const secondaryHealthCheck = new route53.CfnHealthCheck(this, 'SecondaryHealth', {
type: 'HTTPS',
resourcePath: '/health',
fullyQualifiedDomainName: 'eu-west-1.api.example.com',
port: 443,
requestInterval: 30,
failureThreshold: 3,
});
new route53.ARecord(this, 'SecondaryRecord', {
zone: hostedZone,
recordName: 'api',
target: route53.RecordTarget.fromIpAddresses('5.6.7.8'),
setIdentifier: 'secondary',
failover: route53.FailoverRoutingPolicy.SECONDARY,
healthCheckId: secondaryHealthCheck.attrHealthCheckId,
});
}
// Cross-region backup automation
private createBackupStrategy() {
const backupFunction = new lambda.Function(this, 'BackupFunction', {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'backup.handler',
code: lambda.Code.fromAsset('functions'),
timeout: cdk.Duration.minutes(15),
environment: {
PRIMARY_TABLE: 'links-table-us-east-1',
BACKUP_BUCKET: 'links-backup-bucket',
CROSS_REGION_BUCKET: 'links-backup-eu-west-1',
},
});
// Schedule daily backups
new events.Rule(this, 'BackupSchedule', {
schedule: events.Schedule.cron({
hour: '2',
minute: '0'
}),
targets: [new targets.LambdaFunction(backupFunction)],
});
// Point-in-time recovery monitoring
const recoveryAlarm = new cloudwatch.Alarm(this, 'RecoveryAlarm', {
metric: backupFunction.metricErrors(),
threshold: 1,
evaluationPeriods: 1,
});
// SNS notification for backup failures
const alertTopic = new sns.Topic(this, 'BackupAlerts');
recoveryAlarm.addAlarmAction(new cloudwatchActions.SnsAction(alertTopic));
}
}
The backup automation that saved us during the outage:
// functions/backup.ts - Automated disaster recovery backup
import { DynamoDBClient, ScanCommand } from '@aws-sdk/client-dynamodb';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { gzip } from 'zlib';
import { promisify } from 'util';
const gzipAsync = promisify(gzip);
export const handler = async (event: any) => {
const dynamoClient = new DynamoDBClient({ region: 'us-east-1' });
const s3Client = new S3Client({ region: 'us-east-1' });
const timestamp = new Date().toISOString().split('T')[0];
let lastEvaluatedKey;
let backupData = [];
try {
// Paginated scan of entire table
do {
const scanCommand = new ScanCommand({
TableName: process.env.PRIMARY_TABLE,
ExclusiveStartKey: lastEvaluatedKey,
Limit: 1000, // Process in chunks
});
const result = await dynamoClient.send(scanCommand);
if (result.Items) {
backupData.push(...result.Items);
}
lastEvaluatedKey = result.LastEvaluatedKey;
// Progress logging for large tables
console.log(`Backed up ${backupData.length} items...`);
} while (lastEvaluatedKey);
// Compress and upload backup
const compressed = await gzipAsync(JSON.stringify(backupData));
const uploadCommand = new PutObjectCommand({
Bucket: process.env.BACKUP_BUCKET,
Key: `daily-backups/${timestamp}/links-backup.json.gz`,
Body: compressed,
// Cross-region replication tags
Tagging: 'BackupType=Daily&Region=us-east-1&Replicate=true',
// Encryption for sensitive data
ServerSideEncryption: 'AES256',
});
await s3Client.send(uploadCommand);
// Cross-region copy for true disaster recovery
await copyToSecondaryRegion(compressed, timestamp);
return {
statusCode: 200,
body: JSON.stringify({
itemsBackedUp: backupData.length,
backupKey: `daily-backups/${timestamp}/links-backup.json.gz`,
timestamp,
}),
};
} catch (error) {
console.error('Backup failed:', error);
// Send alert to operations team
await sendAlert({
subject: 'Link Shortener Backup Failed',
message: `Backup failed at ${new Date().toISOString()}: ${error.message}`,
severity: 'HIGH',
});
throw error;
}
};
async function copyToSecondaryRegion(data: Buffer, timestamp: string) {
const secondaryS3 = new S3Client({ region: 'eu-west-1' });
return secondaryS3.send(new PutObjectCommand({
Bucket: process.env.CROSS_REGION_BUCKET,
Key: `daily-backups/${timestamp}/links-backup.json.gz`,
Body: data,
ServerSideEncryption: 'AES256',
}));
}
DR Reality: Route 53 health checks take 90-180 seconds to detect failures and trigger failover. In internet time, that's an eternity. Plan for it, and have manual override procedures ready.
Long-term Maintenance & Technical Debt#
Two years in, our "quick MVP" had accumulated significant technical debt. Here's how we managed it without breaking production:
// lib/maintenance-automation-stack.ts - Technical debt management
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as stepfunctions from 'aws-cdk-lib/aws-stepfunctions';
import * as events from 'aws-cdk-lib/aws-events';
export class MaintenanceAutomationStack extends cdk.Stack {
// Automated dependency updates
private createDependencyUpdatePipeline() {
const updateFunction = new lambda.Function(this, 'DependencyUpdater', {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'maintenance.updateDependencies',
code: lambda.Code.fromAsset('functions'),
timeout: cdk.Duration.minutes(5),
environment: {
GITHUB_TOKEN: 'your-github-token',
REPOSITORY: 'your-org/link-shortener',
SLACK_WEBHOOK: process.env.SLACK_WEBHOOK || '',
},
});
// Weekly dependency check
new events.Rule(this, 'WeeklyUpdates', {
schedule: events.Schedule.cron({
weekDay: '1', // Monday
hour: '9',
minute: '0',
}),
targets: [new targets.LambdaFunction(updateFunction)],
});
}
// Data cleanup automation
private createDataCleanupPipeline() {
// Step Function for safe data cleanup
const cleanupWorkflow = new stepfunctions.StateMachine(this, 'CleanupWorkflow', {
definition: stepfunctions.Chain
.start(new stepfunctions.Task(this, 'IdentifyExpiredLinks', {
task: new tasks.LambdaInvoke(this.identifyExpiredLinksFunction),
}))
.next(new stepfunctions.Task(this, 'CreateBackupSnapshot', {
task: new tasks.LambdaInvoke(this.createBackupFunction),
}))
.next(new stepfunctions.Task(this, 'DeleteExpiredLinks', {
task: new tasks.LambdaInvoke(this.deleteExpiredLinksFunction),
}))
.next(new stepfunctions.Task(this, 'VerifyCleanup', {
task: new tasks.LambdaInvoke(this.verifyCleanupFunction),
})),
timeout: cdk.Duration.hours(2),
});
// Monthly cleanup schedule
new events.Rule(this, 'MonthlyCleanup', {
schedule: events.Schedule.cron({
day: '1',
hour: '3',
minute: '0',
}),
targets: [new targets.SfnStateMachine(cleanupWorkflow)],
});
}
// Security audit automation
private createSecurityAuditPipeline() {
const auditFunction = new lambda.Function(this, 'SecurityAuditor', {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'security.auditSystem',
code: lambda.Code.fromAsset('functions'),
timeout: cdk.Duration.minutes(10),
environment: {
SECURITY_SCAN_BUCKET: 'security-audit-results',
COMPLIANCE_WEBHOOK: process.env.COMPLIANCE_WEBHOOK || '',
},
});
// Daily security checks
new events.Rule(this, 'DailySecurityAudit', {
schedule: events.Schedule.rate(cdk.Duration.days(1)),
targets: [new targets.LambdaFunction(auditFunction)],
});
}
}
The maintenance automation that kept us ahead of technical debt:
// functions/maintenance.ts - Automated maintenance tasks
import { Octokit } from '@octokit/rest';
import { execSync } from 'child_process';
import { writeFileSync, readFileSync } from 'fs';
export const updateDependencies = async (event: any) => {
const octokit = new Octokit({
auth: process.env.GITHUB_TOKEN,
});
try {
// Check for outdated packages
const outdated = execSync('npm outdated --json', { encoding: 'utf8' });
const outdatedPackages = JSON.parse(outdated);
if (Object.keys(outdatedPackages).length === 0) {
console.log('All dependencies are up to date');
return { statusCode: 200, body: 'No updates needed' };
}
// Create feature branch for updates
const branchName = `dependency-updates-${new Date().toISOString().split('T')[0]}`;
await octokit.rest.git.createRef({
owner: 'your-org',
repo: 'link-shortener',
ref: `refs/heads/${branchName}`,
sha: await getCurrentCommitSha(),
});
// Update package.json with compatible versions only
const packageJson = JSON.parse(readFileSync('package.json', 'utf8'));
let updatedCount = 0;
for (const [pkg, info] of Object.entries(outdatedPackages)) {
const pkgInfo = info as any;
// Only update patch and minor versions for stability
if (isCompatibleUpdate(pkgInfo.current, pkgInfo.latest)) {
if (packageJson.dependencies[pkg]) {
packageJson.dependencies[pkg] = `^${pkgInfo.latest}`;
updatedCount++;
}
if (packageJson.devDependencies[pkg]) {
packageJson.devDependencies[pkg] = `^${pkgInfo.latest}`;
updatedCount++;
}
}
}
if (updatedCount > 0) {
writeFileSync('package.json', JSON.stringify(packageJson, null, 2));
// Run tests to ensure compatibility
const testResult = execSync('npm test', { encoding: 'utf8' });
// Create pull request
await octokit.rest.pulls.create({
owner: 'your-org',
repo: 'link-shortener',
title: `Automated dependency updates (${updatedCount} packages)`,
head: branchName,
base: 'main',
body: createPRBody(outdatedPackages, updatedCount),
});
await notifySlack(`Created PR for ${updatedCount} dependency updates`);
}
return {
statusCode: 200,
body: JSON.stringify({ updatedPackages: updatedCount }),
};
} catch (error) {
console.error('Dependency update failed:', error);
await notifySlack(`❌ Dependency update failed: ${error.message}`);
throw error;
}
};
function isCompatibleUpdate(current: string, latest: string): boolean {
const [currentMajor, currentMinor] = current.split('.').map(Number);
const [latestMajor, latestMinor] = latest.split('.').map(Number);
// Only allow same major version updates
return currentMajor === latestMajor && latestMinor >= currentMinor;
}
async function notifySlack(message: string) {
if (!process.env.SLACK_WEBHOOK) return;
await fetch(process.env.SLACK_WEBHOOK, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: message }),
});
}
Team Processes & Operational Excellence#
Running a global system taught us that technology is only half the battle. The other half is building team processes that scale:
// lib/operational-excellence-stack.ts - Observability and alerting
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as chatbot from 'aws-cdk-lib/aws-chatbot';
export class OperationalExcellenceStack extends cdk.Stack {
// Comprehensive monitoring dashboard
private createOperationalDashboard() {
const dashboard = new cloudwatch.Dashboard(this, 'OperationalDashboard', {
dashboardName: 'LinkShortener-Operations',
widgets: [
// SLA monitoring
[
new cloudwatch.GraphWidget({
title: 'Response Time SLA (95th percentile)',
left: [
new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'Duration',
statistic: 'p95',
dimensionsMap: {
FunctionName: 'redirect-function',
},
}),
],
leftYAxis: { min: 0, max: 100 },
// SLA line at 50ms
leftAnnotations: [{
value: 50,
label: 'SLA Threshold',
color: cloudwatch.Color.RED,
}],
}),
new cloudwatch.SingleValueWidget({
title: 'Current Availability',
metrics: [
new cloudwatch.MathExpression({
expression: '100 - (errors / requests * 100)',
usingMetrics: {
errors: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'Errors',
statistic: 'Sum',
}),
requests: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'Invocations',
statistic: 'Sum',
}),
},
}),
],
}),
],
// Cost monitoring
[
new cloudwatch.GraphWidget({
title: 'Daily Cost Breakdown',
stacked: true,
left: [
new cloudwatch.Metric({
namespace: 'AWS/Billing',
metricName: 'EstimatedCharges',
statistic: 'Maximum',
dimensionsMap: {
Currency: 'USD',
ServiceName: 'AmazonDynamoDB',
},
}),
new cloudwatch.Metric({
namespace: 'AWS/Billing',
metricName: 'EstimatedCharges',
statistic: 'Maximum',
dimensionsMap: {
Currency: 'USD',
ServiceName: 'AWSLambda',
},
}),
],
}),
],
// Business metrics
[
new cloudwatch.GraphWidget({
title: 'Business Impact Metrics',
left: [
new cloudwatch.Metric({
namespace: 'LinkShortener/Business',
metricName: 'LinksCreated',
statistic: 'Sum',
}),
new cloudwatch.Metric({
namespace: 'LinkShortener/Business',
metricName: 'RedirectsServed',
statistic: 'Sum',
}),
],
}),
],
],
});
return dashboard;
}
// Intelligent alerting system
private createIntelligentAlerting() {
const criticalAlerts = new sns.Topic(this, 'CriticalAlerts');
const warningAlerts = new sns.Topic(this, 'WarningAlerts');
// P1: Service down
new cloudwatch.Alarm(this, 'ServiceDownAlarm', {
alarmName: 'LinkShortener-ServiceDown-P1',
metric: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'Errors',
statistic: 'Sum',
dimensionsMap: { FunctionName: 'redirect-function' },
}),
threshold: 10,
evaluationPeriods: 2,
datapointsToAlarm: 2,
treatMissingData: cloudwatch.TreatMissingData.BREACHING,
alarmActions: [new cloudwatchActions.SnsAction(criticalAlerts)],
});
// P2: Performance degradation
new cloudwatch.Alarm(this, 'PerformanceDegradationAlarm', {
alarmName: 'LinkShortener-SlowResponse-P2',
metric: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'Duration',
statistic: 'p95',
}),
threshold: 100, // 100ms P95
evaluationPeriods: 3,
datapointsToAlarm: 2,
alarmActions: [new cloudwatchActions.SnsAction(warningAlerts)],
});
// P3: Capacity planning
new cloudwatch.Alarm(this, 'CapacityPlanningAlarm', {
alarmName: 'LinkShortener-HighLoad-P3',
metric: new cloudwatch.Metric({
namespace: 'AWS/DynamoDB',
metricName: 'ConsumedReadCapacityUnits',
statistic: 'Sum',
}),
threshold: 8000, // 80% of provisioned capacity
evaluationPeriods: 5,
datapointsToAlarm: 3,
alarmActions: [new cloudwatchActions.SnsAction(warningAlerts)],
});
// Slack integration for team notifications
new chatbot.SlackChannelConfiguration(this, 'SlackNotifications', {
slackChannelConfigurationName: 'linkshortener-alerts',
slackWorkspaceId: 'YOUR_WORKSPACE_ID',
slackChannelId: 'C01234567890',
notificationTopics: [criticalAlerts, warningAlerts],
guardrailPolicies: ['arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess'],
});
}
}
The runbook automation that saved our weekends:
// functions/incident-response.ts - Automated incident response
export const autoIncidentResponse = async (event: any) => {
const alarmName = event.Records[0].Sns.Message.AlarmName;
const severity = extractSeverity(alarmName);
console.log(`Processing ${severity} incident: ${alarmName}`);
// Automated remediation based on severity
switch (severity) {
case 'P1':
await handleCriticalIncident(event);
break;
case 'P2':
await handlePerformanceIssue(event);
break;
case 'P3':
await handleCapacityWarning(event);
break;
}
};
async function handleCriticalIncident(event: any) {
// 1. Create PagerDuty incident
await createPagerDutyIncident({
title: 'Link Shortener Service Down',
severity: 'critical',
service: 'link-shortener-prod',
});
// 2. Enable emergency read replicas
await enableEmergencyReadReplicas();
// 3. Switch to maintenance page
await updateMaintenancePage(true);
// 4. Start diagnostic data collection
await collectDiagnosticData();
// 5. Notify stakeholders
await notifyStakeholders('CRITICAL: Link shortener is experiencing downtime');
}
async function handlePerformanceIssue(event: any) {
// Auto-scale DynamoDB capacity
await scaleDynamoDBCapacity(1.5); // 50% increase
// Clear cache to remove potentially slow queries
await clearApplicationCache();
// Collect performance metrics
await collectPerformanceMetrics();
}
async function handleCapacityWarning(event: any) {
// Capacity planning automation
const projectedGrowth = await calculateGrowthTrend();
if (projectedGrowth > 0.8) { // 80% growth trend
await scheduleCapacityReview();
await notifyCapacityTeam(projectedGrowth);
}
}
Operational Excellence Lesson: Automation doesn't replace good judgment—it buys you time to use it. Our automated responses handle 80% of common issues, leaving humans to focus on the truly complex problems.
Capacity Planning & Growth Forecasting#
The business question that keeps engineering leaders up at night: "Can our system handle Black Friday?" Here's how we built capacity planning into our architecture:
// functions/capacity-planning.ts - Growth forecasting and capacity planning
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';
import { DynamoDBClient, DescribeTableCommand } from '@aws-sdk/client-dynamodb';
interface CapacityProjection {
currentCapacity: number;
projectedDemand: number;
recommendedCapacity: number;
confidenceLevel: number;
timeframe: string;
}
export const generateCapacityForecast = async (event: any): Promise<CapacityProjection> => {
const cloudwatch = new CloudWatchClient({});
const dynamodb = new DynamoDBClient({});
// Analyze historical traffic patterns
const historicalData = await getHistoricalMetrics(cloudwatch, 90); // 90 days
const seasonalPatterns = analyzeSeasonalTrends(historicalData);
const growthTrend = calculateGrowthTrend(historicalData);
// Get current capacity settings
const currentCapacity = await getCurrentCapacity(dynamodb);
// Forecast future demand
const projection = projectDemand({
historicalData,
seasonalPatterns,
growthTrend,
currentCapacity,
timeframe: '30days',
});
// Generate actionable recommendations
const recommendations = generateRecommendations(projection);
// Create capacity planning report
await createCapacityReport({
projection,
recommendations,
timestamp: new Date().toISOString(),
});
return projection;
};
async function getHistoricalMetrics(client: CloudWatchClient, days: number) {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - days * 24 * 60 * 60 * 1000);
const command = new GetMetricStatisticsCommand({
Namespace: 'AWS/DynamoDB',
MetricName: 'ConsumedReadCapacityUnits',
Dimensions: [
{ Name: 'TableName', Value: 'links-table' },
],
StartTime: startTime,
EndTime: endTime,
Period: 3600, // 1 hour periods
Statistics: ['Average', 'Maximum'],
});
const response = await client.send(command);
return response.Datapoints || [];
}
function analyzeSeasonalTrends(data: any[]) {
// Group by day of week and hour
const patterns = {
hourly: new Array(24).fill(0),
daily: new Array(7).fill(0),
monthly: new Array(12).fill(0),
};
data.forEach(point => {
const date = new Date(point.Timestamp);
const hour = date.getHours();
const day = date.getDay();
const month = date.getMonth();
patterns.hourly[hour] += point.Average;
patterns.daily[day] += point.Average;
patterns.monthly[month] += point.Average;
});
// Normalize patterns
return {
peakHour: patterns.hourly.indexOf(Math.max(...patterns.hourly)),
peakDay: patterns.daily.indexOf(Math.max(...patterns.daily)),
peakMonth: patterns.monthly.indexOf(Math.max(...patterns.monthly)),
variance: calculateVariance(patterns.hourly),
};
}
function projectDemand(config: any): CapacityProjection {
const {
historicalData,
seasonalPatterns,
growthTrend,
currentCapacity,
timeframe,
} = config;
// Linear regression for growth projection
const baselineGrowth = growthTrend.slope * 30; // 30-day projection
// Seasonal adjustment
const seasonalMultiplier = getSeasonalMultiplier(seasonalPatterns, timeframe);
// Business event adjustments (holiday sales, marketing campaigns)
const eventMultiplier = getBusinessEventMultiplier(timeframe);
const projectedDemand =
currentCapacity.average *
(1 + baselineGrowth) *
seasonalMultiplier *
eventMultiplier;
return {
currentCapacity: currentCapacity.provisioned,
projectedDemand: Math.ceil(projectedDemand),
recommendedCapacity: Math.ceil(projectedDemand * 1.2), // 20% buffer
confidenceLevel: calculateConfidence(growthTrend.r2),
timeframe,
};
}
function generateRecommendations(projection: CapacityProjection) {
const recommendations = [];
if (projection.projectedDemand > projection.currentCapacity * 0.8) {
recommendations.push({
type: 'SCALE_UP',
urgency: 'HIGH',
action: `Increase DynamoDB capacity to ${projection.recommendedCapacity} RCU`,
estimatedCost: calculateCostIncrease(projection),
});
}
if (projection.confidenceLevel <0.7) {
recommendations.push({
type: 'MONITORING',
urgency: 'MEDIUM',
action: 'Increase monitoring frequency due to low confidence in projection',
estimatedCost: 0,
});
}
return recommendations;
}
Capacity Planning Reality: Our first forecast was off by 300% because we didn't account for viral marketing campaigns. Now we integrate business event calendars with our technical forecasting. Marketing launches and engineering capacity planning are no longer separate conversations.
Series Wrap-up: Lessons Learned#
After five parts and thousands of lines of CDK code, here's what building a production-grade link shortener really taught us:
What We Got Right#
- Infrastructure as Code from Day One: CDK saved us countless hours during scaling and disasters
- Observability Before Optimization: You can't improve what you can't measure
- Security by Design: Adding security later is 10x harder than building it in
- Multi-Region from the Start: Global users don't wait for your architecture to catch up
What We'd Do Differently#
- Start with Sharding: Hot partitions are inevitable at scale—plan for them
- Invest in Operational Excellence Earlier: Good runbooks are worth their weight in gold
- Business Metrics from Day One: Technical metrics don't tell the business story
- Team Processes Evolve with Scale: What works for 3 engineers breaks with 30
The Real Cost of Scale#
Our final monthly AWS bill for handling 50M redirects:
- DynamoDB Global Tables: $1,200 (3 regions, on-demand)
- Lambda: $180 (includes cross-region invocations)
- CloudFront: $45 (global distribution)
- Route 53: $25 (health checks and DNS)
- Monitoring & Alerts: $80 (CloudWatch, X-Ray)
- Data Transfer: $120 (cross-region replication)
- Total: $1,650/month for enterprise-grade global infrastructure
Cost Reality Check: That's about $0.000033 per redirect. The engineering time to build and maintain it? About 2.5 engineers full-time across setup, scaling, and maintenance. The business value? Immeasurable when your redirects are critical customer touchpoints.
Key Architectural Decisions and Their Long-term Impact#
DynamoDB Global Tables vs Aurora Global Database: We chose DynamoDB for its predictable performance and pay-per-request billing. Two years later, with traffic spikes ranging from 1K to 100K redirects per minute, we're glad we did. Aurora would have required more capacity planning overhead.
Lambda vs ECS/Fargate: Lambda's cold starts were a concern initially, but provisioned concurrency solved that. The operational simplicity of not managing containers won out. We've had zero server maintenance issues because there are no servers.
CDK vs Terraform: CDK's TypeScript integration with our Lambda functions made refactoring across infrastructure and application code seamless. The type safety caught dozens of configuration errors before deployment.
Multi-Region Active-Active vs Active-Passive: Active-active was more complex to implement but eliminated the "failover test" problem. When us-east-1 went down, traffic seamlessly continued from other regions.
The Human Side of Scale#
Technical scaling is just half the story. Here's what we learned about team scaling:
- Documentation Becomes Critical: When the original architect leaves, tribal knowledge leaves too
- On-Call Rotation Needs Structure: Burn out is real when your system spans timezones
- Cross-Training Is Investment: Every component needs at least two people who understand it
- Incident Reviews Create Learning: Blameless postmortems improved our architecture more than any planning session
Beyond Link Shorteners#
These patterns apply to any high-traffic, low-latency service:
- Event-driven architecture scales better than request-response patterns
- Regional data locality beats global consistency for user-facing features
- Operational automation is the difference between a job and a career
- Business alignment turns infrastructure costs into business investments
Looking Forward#
The link shortener that started as a weekend project now handles more traffic than some Fortune 500 websites. It's a reminder that with modern cloud services and infrastructure as code, small teams can build systems that would have required enterprise data centers just a decade ago.
The real lesson isn't about building link shorteners—it's about building systems that grow with your business, support your team, and survive the inevitable complexities of scale. Whether you're building a link shortener, an API gateway, or the next unicorn startup, these patterns will serve you well.
Final Thought: Architecture is about trade-offs, but operational excellence is about minimizing the consequences of those trade-offs. Build systems that fail gracefully, scale predictably, and can be maintained by humans under pressure. Your future self will thank you.
This concludes our 5-part journey from zero to production-scale link shortener. The complete source code with all CDK constructs, Lambda functions, and deployment scripts is available in the GitHub repository. Happy building!
AWS CDK Link Shortener: From Zero to Production
A comprehensive 5-part series on building a production-grade link shortener service with AWS CDK, Node.js Lambda, and DynamoDB. Real war stories, performance optimization, and cost management included.
All Posts in This Series
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!