From RFC to Production: What They Don't Tell You About Implementation

A Principal Engineer's honest take on the gap between beautiful RFC designs and messy production reality, featuring real-world lessons from implementing notification systems at scale

You know that feeling when you're reading through a beautifully crafted RFC, nodding along to the elegant architecture diagrams, and thinking "This is it, this is the design that will finally work perfectly"? Then six months later you're knee-deep in production issues, the timeline has doubled, and that pristine database schema looks like it went through a blender?

Welcome to the reality of turning RFCs into production systems. After 20 years of watching brilliant designs collide with organizational reality, I've learned that the gap between RFC and production isn't a bug – it's a feature of building complex systems with teams under business pressures.

Let me walk you through what happens when that 12-week notification system RFC meets the chaos of production, using experiences from multiple implementations. Spoiler alert: it took 8 months, not 12 weeks, and the final system looked nothing like the original design.

The RFC Honeymoon Phase#

Every RFC starts with optimism. The notification system RFC I'm thinking of was a masterpiece: clean architecture diagrams, comprehensive database schemas, phased rollout plans. It promised to solve every notification problem we'd ever had:

TypeScript
// The RFC's beautiful promise
interface NotificationSystemGoals {
  deliveryTime: '<100ms for in-app, <5s for email',
  throughput: '10,000+ notifications per second',
  uptime: '99.9% availability',
  timeline: '12 weeks with 2 developers',
  budget: '$120,000-180,000'
}

// What we shipped
interface ProductionReality {
  deliveryTime: '2-3s for in-app on good days, 30s+ during peaks',
  throughput: 'Started at 500/sec, took 6 months to reach 5,000/sec',
  uptime: '97% first quarter, 99% after year one',
  timeline: '8 months with 4 developers plus 2 contractors',
  budget: '$400,000+ and still counting maintenance costs'
}

The RFC looked bulletproof. We had thought of everything: rate limiting, deduplication, preference management, even quiet hours. The phased approach seemed conservative – surely 4 weeks for core infrastructure was enough?

When Database Schemas Meet Reality#

The RFC's database schema was a thing of beauty. Clean, normalized, with proper foreign keys and constraints. Here's what the RFC proposed:

SQL
-- The RFC's pristine schema
CREATE TABLE notification_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id) ON DELETE CASCADE,
    notification_type VARCHAR(100) NOT NULL,
    template_id UUID REFERENCES notification_templates(id),
    data JSONB DEFAULT '{}',
    status VARCHAR(20) DEFAULT 'pending',
    sent_at TIMESTAMP,
    delivered_at TIMESTAMP,
    read_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);

Three months into production, here's what that table looked like:

SQL
-- Production reality after 20+ migrations
CREATE TABLE notification_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID, -- Foreign key removed due to performance issues
    notification_type VARCHAR(100),
    notification_type_v2 VARCHAR(255), -- Migration in progress
    template_id UUID,
    template_id_v2 BIGINT, -- Different team used different ID type
    data JSONB DEFAULT '{}',
    data_compressed BYTEA, -- Added when JSONB got too large
    status VARCHAR(20) DEFAULT 'pending',
    status_v2 VARCHAR(50), -- More statuses than expected
    priority INTEGER DEFAULT 0, -- Not in RFC, critical for production
    retry_count INTEGER DEFAULT 0, -- Not in RFC, essential for debugging
    channel VARCHAR(50), -- Denormalized for query performance
    correlation_id UUID, -- Added for distributed tracing
    partition_key INTEGER, -- Added for sharding
    sent_at TIMESTAMP,
    delivered_at TIMESTAMP,
    read_at TIMESTAMP,
    failed_at TIMESTAMP, -- Not in RFC, very much needed
    expires_at TIMESTAMP, -- Not in RFC, prevented infinite growth
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW() -- Added after debugging nightmares
);

-- Plus 15 indexes we didn't anticipate
CREATE INDEX CONCURRENTLY idx_notification_events_user_created ON notification_events(user_id, created_at DESC) WHERE status != 'deleted';
CREATE INDEX CONCURRENTLY idx_notification_events_correlation ON notification_events(correlation_id) WHERE correlation_id IS NOT NULL;
-- ... and 13 more

Each of those schema changes represents a production incident, a performance crisis, or a feature request we couldn't have anticipated. The pristine RFC design met reality, and reality won.

The WebSocket Connection Management Disaster#

The RFC confidently stated we'd handle real-time notifications with WebSockets. The sample code looked so clean:

TypeScript
// RFC's WebSocket implementation
class NotificationWebSocketManager {
  private connections: Map<string, WebSocket> = new Map();
  
  async sendNotification(userId: string, notification: NotificationEvent) {
    const connection = this.connections.get(userId);
    if (connection && connection.readyState === WebSocket.OPEN) {
      connection.send(JSON.stringify({
        type: 'notification',
        data: notification
      }));
    }
  }
}

Six months later, after multiple production incidents including the infamous "50,000 zombie connections" disaster during a mobile app deployment gone wrong, here's what we had:

TypeScript
// Production reality with all the edge cases
class NotificationWebSocketManager {
  private connections: Map<string, Set<WebSocketConnection>> = new Map();
  private connectionMetadata: Map<string, ConnectionMetadata> = new Map();
  private healthChecks: Map<string, NodeJS.Timeout> = new Map();
  private rateLimiters: Map<string, RateLimiter> = new Map();
  private deadLetterQueue: Queue<FailedNotification>;
  private circuit: CircuitBreaker;
  
  async sendNotification(userId: string, notification: NotificationEvent) {
    // 200+ lines of defensive programming
    const connections = this.connections.get(userId);
    if (!connections || connections.size === 0) {
      await this.queueForLaterDelivery(userId, notification);
      return;
    }
    
    // Handle multiple connections per user (mobile + web + tablet)
    const results = await Promise.allSettled(
      Array.from(connections).map(async (conn) => {
        try {
          // Check connection health
          if (!this.isConnectionHealthy(conn)) {
            await this.reconnectOrEvict(conn);
            throw new Error('Unhealthy connection');
          }
          
          // Rate limiting per connection
          const limiter = this.getRateLimiter(conn.id);
          if (!await limiter.tryAcquire()) {
            await this.backpressure(conn, notification);
            return;
          }
          
          // Circuit breaker for cascading failures
          return await this.circuit.fire(async () => {
            // Message size validation (learned this the hard way)
            const message = this.serializeNotification(notification);
            if (message.length > MAX_MESSAGE_SIZE) {
              const chunks = this.chunkMessage(message);
              for (const chunk of chunks) {
                await this.sendChunk(conn, chunk);
              }
            } else {
              await this.sendMessage(conn, message);
            }
          });
        } catch (error) {
          await this.handleDeliveryFailure(conn, notification, error);
        }
      })
    );
    
    // Track delivery metrics
    await this.recordDeliveryMetrics(userId, notification, results);
  }
  
  // Plus 50+ other methods for handling edge cases
}

Every single one of those additions came from a production incident. The circuit breaker? Added after we took down our Redis cluster. The chunking logic? Discovered when a marketing notification with embedded images crashed mobile clients. The rate limiting? Learned that one during a notification storm that generated 100,000 support tickets.

Timeline vs Reality: The 12-Week Delusion#

The RFC's phased approach seemed so reasonable:

  • Phase 1 (Weeks 1-4): Core Infrastructure
  • Phase 2 (Weeks 5-8): Advanced Features
  • Phase 3 (Weeks 9-12): Integration & Optimization

Here's what happened:

Weeks 1-4: The Infrastructure Surprise#

Instead of building core infrastructure, we spent three weeks just setting up environments and discovering that our "standard" database setup couldn't handle the write throughput. Week 4 was entirely consumed by a production incident in another system that pulled our team away.

Weeks 5-12: The Scope Creep Symphony#

Product management, seeing the early demos, got excited. "Can we add Slack notifications?" Sure, not too hard. "What about SMS for critical alerts?" Makes sense. "Oh, and we need to support scheduling notifications up to 90 days in advance for marketing campaigns." Wait, what?

TypeScript
// Original scope
const originalChannels = ['in_app', 'email', 'push'];

// Month 3 scope
const actualChannels = [
  'in_app', 
  'email', 
  'push', 
  'sms',          // Added week 6
  'slack',        // Added week 8
  'teams',        // Added week 10
  'webhook',      // Added week 11
  'discord',      // Added week 14 (yes, we were already late)
  'voice_call'    // Added week 20 (for critical security alerts)
];

Months 4-6: The Integration Hell#

Remember that clean API design in the RFC? It assumed all our services used the same authentication system. Plot twist: they didn't. We had three different auth systems (JWT, OAuth2, and a legacy session-based system), and notifications needed to work with all of them.

TypeScript
// RFC assumption
interface AuthContext {
  userId: string;
  token: string;
}

// Production reality
type AuthContext = 
  | { type: 'jwt'; userId: string; token: string; claims: JWTClaims }
  | { type: 'oauth2'; userId: string; accessToken: string; refreshToken: string; expiresAt: Date }
  | { type: 'legacy'; sessionId: string; userId?: string; cookieData: LegacyCookie }
  | { type: 'service_account'; serviceId: string; apiKey: string }
  | { type: 'anonymous'; temporaryId: string; ipAddress: string };

// Each auth type needed different handling, different rate limits, 
// different security checks, and different audit logging

Months 7-8: The Performance Reckoning#

The system was "working" but couldn't handle production load. The RFC promised 10,000 notifications per second. We were struggling with 500. The next two months were a blur of profiling, caching, query optimization, and architectural changes.

The biggest surprise? The bottleneck wasn't where we expected (database writes). It was template rendering. Our fancy personalization system was making 20+ API calls per notification to gather user context.

Team Dynamics: The Human Factor#

The RFC mentioned "2 developers for 12 weeks." Here's who worked on it:

  • 2 senior engineers (supposed to be full-time, averaged 60% due to production support)
  • 1 junior engineer (added month 2, spent month 3 learning the codebase)
  • 2 contractors (added month 4 for "quick wins," spent month 5 fixing their code)
  • 1 DevOps engineer (supposedly "consulting," became full-time by month 3)
  • 1 database expert (brought in month 5 for performance crisis)
  • Product manager (changed twice during the project)
  • 3 different engineering managers (reorg happened in month 6)

Each team change meant context loss, architecture debates, and rework. The contractor code seemed fine in code review but created technical debt we're still paying off. The reorg in month 6 almost killed the project when the new engineering manager wanted to "revisit the architecture."

The Monitoring Gap Nobody Talks About#

The RFC had a section on monitoring. It listed metrics like delivery rate, response time, and error rate. Sensible, right? Here's what we actually needed to monitor in production:

TypeScript
// RFC monitoring plan
const plannedMetrics = [
  'delivery_rate',
  'response_time', 
  'error_rate',
  'throughput'
];

// What we actually monitor
const productionMetrics = [
  // Basic metrics (from RFC)
  'delivery_rate_by_channel_by_priority_by_user_segment',
  'response_time_p50_p95_p99_p999',
  'error_rate_by_type_by_service_by_retry_count',
  
  // The metrics that actually matter
  'template_render_time_by_template_by_variables_count',
  'database_connection_pool_wait_time',
  'redis_operation_time_by_operation_type',
  'webhook_retry_backoff_effectiveness',
  'notification_staleness_at_delivery',
  'user_preference_cache_hit_rate',
  'deduplication_effectiveness_by_time_window',
  'rate_limit_rejection_by_reason',
  'circuit_breaker_state_transitions',
  'message_size_distribution_by_channel',
  'websocket_reconnection_storms',
  'push_token_invalidation_rate',
  'email_bounce_classification',
  'notification_feedback_loop_latency',
  'cost_per_notification_by_channel',
  'regulatory_compliance_audit_completeness',
  
  // The weird ones we needed after specific incidents
  'mobile_app_version_vs_notification_compatibility',
  'timezone_calculation_accuracy',
  'emoji_rendering_failures_by_client',
  'notification_delivery_during_database_failover',
  'memory_leak_in_template_cache',
  'thundering_herd_detection'
];

Each of those metrics exists because something went wrong, and we didn't see it coming until it was too late.

Technical Debt: The Compound Interest of Shortcuts#

The RFC didn't mention technical debt. By month 8, here's what we were dealing with:

The Template System Frankenstein#

We started with a simple template system. By production, we had three different template engines running simultaneously because different teams had different requirements, and we never had time to unify them.

TypeScript
// The technical debt we're still paying
class NotificationTemplateManager {
  private mustacheTemplates: Map<string, MustacheTemplate>;    // Original system
  private handlebarsTemplates: Map<string, HandlebarsTemplate>; // Added for marketing
  private reactEmailTemplates: Map<string, ReactEmailTemplate>; // Added for pretty emails
  
  async render(templateId: string, data: any): Promise<string> {
    // 150 lines of logic to figure out which template engine to use,
    // handle edge cases, maintain backwards compatibility,
    // and work around bugs we can't fix without breaking production
    
    // This comment has been here since month 4:
    // TODO: Unify template systems (estimated: 2 weeks)
    // Actual estimate after investigation: 3 months + migration plan
  }
}

The Migration That Never Ends#

Remember that beautiful database schema? We've been migrating to "v2" for six months. We're running both schemas in parallel, with a complex sync system that occasionally loses notifications.

SQL
-- The migration nightmare
BEGIN;
  -- Step 1 of 47 in the migration plan
  INSERT INTO notification_events_v2 
  SELECT 
    id,
    user_id,
    -- 50 lines of complex transformation logic
    CASE 
      WHEN notification_type IN ('old_type_1', 'old_type_2') THEN 'new_type_1'
      WHEN notification_type LIKE 'legacy_%' THEN REPLACE(notification_type, 'legacy_', 'classic_')
      -- 20 more WHEN clauses
    END as notification_type_v2,
    -- More transformations...
  FROM notification_events 
  WHERE created_at > NOW() - INTERVAL '1 hour'
    AND status != 'migrated'
    AND NOT EXISTS (
      SELECT 1 FROM notification_events_v2 
      WHERE notification_events_v2.id = notification_events.id
    );
  
  -- Update migration status
  UPDATE migration_status 
  SET last_run = NOW(), 
      records_migrated = records_migrated + row_count,
      estimated_completion = NOW() + (remaining_records / current_rate * INTERVAL '1 second')
  WHERE migration_name = 'notification_schema_v2';
  
  -- Check for conflicts
  -- Handle rollback scenarios
  -- Update monitoring metrics
  -- 100 more lines...
COMMIT;

The Success Metrics That Didn't Matter#

The RFC defined clear success criteria: 99.9% uptime, <100ms delivery, 10,000 notifications per second. We eventually hit some of these numbers, but they turned out to be the wrong metrics.

What actually mattered:

  • User happiness: We had 99% delivery rate but users hated the notifications because they were poorly timed
  • Developer productivity: Other teams couldn't integrate with our "clean" API without extensive hand-holding
  • Operational burden: The system required constant babysitting despite all our automation
  • Business value: Marketing couldn't use half the features because they were too complex
TypeScript
// What we optimized for (from RFC)
const technicalMetrics = {
  uptime: 99.9,
  deliveryTime: 95, // ms
  throughput: 10000, // per second
  errorRate: 0.1 // percent
};

// What actually mattered
const businessMetrics = {
  userNotificationDisableRate: 45, // percent - way too high
  developerIntegrationTime: 3, // weeks - should be hours
  supportTicketsPerWeek: 150, // related to notifications
  marketingCampaignSetupTime: 2, // days - should be minutes
  monthlyOperationalCost: 25000, // dollars - 5x the estimate
  engineersPagedPerWeek: 12 // times - unsustainable
};

Lessons Learned the Hard Way#

After implementing notification systems at multiple companies, watching RFCs collide with reality, here's what I've learned:

1. RFCs Are Hypothesis, Not Specifications#

Treat your RFC as a starting hypothesis. The moment it hits production, it becomes a living document that needs constant revision. We kept our RFC frozen as "the spec" for too long, causing endless confusion when reality diverged.

2. Budget for the Unknown Unknowns#

Whatever timeline and budget you have, double it, then add 50% for the things you don't know you don't know. That's not pessimism; it's pattern recognition from dozens of projects.

3. Design for Migration from Day One#

Every beautiful schema will need migration. Every clean API will need versioning. Every simple system will need backward compatibility. Build these capabilities in from the start, not as afterthoughts.

4. The Edge Cases Are the Norm#

That edge case you're discussing in the RFC review? The one everyone says "we'll handle that if it comes up"? It will come up, probably in production, probably at the worst possible time. If you're discussing it, it's not an edge case.

5. Organizational Dynamics Trump Technical Excellence#

The best technical design will fail if it doesn't account for team dynamics, political realities, and organizational constraints. The contractor who joins in month 3 doesn't care about your beautiful architecture. The reorg in month 6 will challenge every decision.

6. Monitor What You'll Actually Debug#

Don't monitor what the RFC says. Monitor what you'll need at during an incident when everything is on fire. That means business metrics, user experience metrics, and detailed operational metrics, not just technical stats.

The Path Forward: Bridging Design and Reality#

So how do we bridge the gap between RFC and production? After years of trying different approaches, here's what actually works:

Start with a Minimum Lovable Product#

Not minimum viable – minimum lovable. Build something small that users actually want to use, then iterate. Our notification system would have been better if we'd started with just email notifications that worked perfectly, rather than trying to build everything at once.

Design for Change, Not Perfection#

Your database schema will change. Your API will evolve. Your architecture will shift. Design systems that can evolve gracefully rather than trying to predict the perfect end state.

Invest in Developer Experience Early#

The easier your system is to integrate with and operate, the more successful it will be. We spent months optimizing performance when we should have been making the API easier to use.

Create Living Documentation#

That RFC shouldn't be a historical artifact. It should evolve with the system. We now maintain our RFCs as living documents, with sections for "Original Design," "Current Implementation," and "Lessons Learned."

Build Feedback Loops at Every Level#

From user feedback to operational metrics to developer experience surveys, build feedback loops into your process. The faster you can learn what's not working, the faster you can fix it.

Conclusion: Embracing the Mess#

After 20 years of building systems, I've learned to embrace the mess. That pristine RFC will get messy. Your beautiful architecture will grow warts. Your clean codebase will accumulate technical debt. This isn't failure – it's the natural evolution of systems that solve real problems for real users.

The gap between RFC and production isn't something to be eliminated; it's something to be managed. The best engineers aren't the ones who write perfect RFCs that predict every detail. They're the ones who can adapt when reality inevitably diverges from the plan.

Looking back at our notification system, it's not what we designed in the RFC. It's messier, more complex, and took three times longer to build. But it's also more capable, more resilient, and solves problems we didn't even know existed when we wrote that RFC.

The next time you're writing an RFC, remember: you're not writing a specification, you're starting a conversation with reality. And reality always gets the last word.

Have you experienced the RFC-to-production gap? What lessons have you learned from watching beautiful designs meet messy reality? I'd love to hear your war stories and what you'd do differently next time.

Loading...

Comments (0)

Join the conversation

Sign in to share your thoughts and engage with the community

No comments yet

Be the first to share your thoughts on this post!

Related Posts