OpenTelemetry in React Native: How I Built Production Observability After 18 Months of Debugging Hell
From production crashes we couldn't debug to comprehensive monitoring that caught issues before users noticed. My journey through React Native observability with OpenTelemetry, Firebase, and enterprise APM solutions.
Six months into our React Native app launch, we were flying blind. Users complained about crashes we couldn't reproduce. Performance issues appeared randomly. Our biggest enterprise client threatened to leave because our app "felt slow" - but we had no data to prove otherwise.
Sound familiar? After 18 months of building comprehensive observability for a React Native app serving 200,000+ users, here's what I learned about production monitoring that actually works.
The Wake-Up Call: $50K Lost in One Weekend#
March 2023. Our payment flow started failing silently for iOS users. We only found out when our biggest client called Monday morning - they'd lost $50,000 in transactions over the weekend.
The logs showed nothing. Crashlytics showed nothing. Flipper worked fine in development. We spent 14 hours debugging a race condition in our payment processing that affected only iOS 14.8 users with specific network conditions.
That incident taught me three things:
- You can't debug what you can't see
- Mobile debugging is different from web debugging
- Good observability pays for itself instantly
The next day, I started building a real monitoring system.
Why I Chose OpenTelemetry (After Trying Everything Else)#
Before OpenTelemetry, I tried every React Native monitoring solution:
Firebase Performance Monitoring (2 months)#
Pros: Easy setup, free tier, good basic metrics Cons: Limited customization, no distributed tracing, vendor lock-in
Datadog RUM (3 months)#
Pros: Rich dashboards, great alerting, real user monitoring Cons: Expensive ($50/month per user), React Native support was buggy
New Relic Mobile (1 month)#
Cons: Crashed our app during high traffic, poor React Native docs
Sentry Performance (2 weeks)#
Cons: Missing crucial mobile-specific features we needed
OpenTelemetry solved all these problems:
- Vendor independence: Switch monitoring providers without code changes
- Standardized data: Same format for traces, metrics, logs
- Rich ecosystem: Works with everything
- Future-proof: Industry standard backed by CNCF
Most importantly: It actually worked in production.
The Architecture That Handles 2M+ Events Daily#
Here's our production setup that processes over 2 million telemetry events daily:
Loading diagram...
The Setup That Actually Works in Production#
After 18 months of iteration, here's the production-ready implementation:
Core OpenTelemetry Setup#
// telemetry/provider.ts - The foundation that handles 2M events/day
import { NodeSDK } from '@opentelemetry/sdk-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Platform } from 'react-native';
import DeviceInfo from 'react-native-device-info';
interface TelemetryConfig {
environment: 'development' | 'staging' | 'production';
enabledExporters: string[];
samplingRate: number;
maxBatchSize: number;
exportInterval: number;
}
class ProductionTelemetryProvider {
private sdk: NodeSDK | null = null;
private isInitialized = false;
async initialize(config: TelemetryConfig) {
if (this.isInitialized) {
console.warn('Telemetry already initialized');
return;
}
try {
const deviceInfo = await this.getDeviceInfo();
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-react-native-app',
[SemanticResourceAttributes.SERVICE_VERSION]: deviceInfo.appVersion,
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: config.environment,
// Mobile-specific attributes that saved debugging time
'mobile.platform': Platform.OS,
'mobile.platform.version': deviceInfo.systemVersion,
'device.model': deviceInfo.deviceId,
'device.manufacturer': deviceInfo.brand,
'app.build': deviceInfo.buildNumber,
'app.bundle_id': deviceInfo.bundleId,
// Network info helps debug connectivity issues
'network.carrier': deviceInfo.carrier,
'device.memory': deviceInfo.totalMemory,
});
// Multiple exporters for redundancy - learned this from production outages
const exporters = this.createExporters(config);
this.sdk = new NodeSDK({
resource,
spanProcessors: exporters.spanProcessors,
metricReader: new PeriodicExportingMetricReader({
exporter: exporters.metricExporter,
exportIntervalMillis: config.exportInterval,
}),
// Sampling strategy that survived Black Friday traffic
sampler: this.createAdaptiveSampler(config.samplingRate),
instrumentations: this.getInstrumentations(),
});
await this.sdk.start();
this.isInitialized = true;
console.log('Production telemetry initialized', {
environment: config.environment,
exporters: config.enabledExporters,
samplingRate: config.samplingRate,
});
} catch (error) {
console.error('Failed to initialize telemetry:', error);
// Don't crash the app if telemetry fails
}
}
private async getDeviceInfo() {
// Gather all device info in parallel for faster startup
const [
appVersion,
buildNumber,
bundleId,
deviceId,
brand,
systemVersion,
carrier,
totalMemory,
] = await Promise.all([
DeviceInfo.getVersion(),
DeviceInfo.getBuildNumber(),
DeviceInfo.getBundleId(),
DeviceInfo.getUniqueId(),
DeviceInfo.getBrand(),
DeviceInfo.getSystemVersion(),
DeviceInfo.getCarrier().catch(() => 'unknown'),
DeviceInfo.getTotalMemory().catch(() => 0),
]);
return {
appVersion,
buildNumber,
bundleId,
deviceId,
brand,
systemVersion,
carrier,
totalMemory,
};
}
private createExporters(config: TelemetryConfig) {
const spanProcessors: any[] = [];
let metricExporter: any = null;
// Primary exporter - Datadog for rich analytics
if (config.enabledExporters.includes('datadog')) {
const datadogExporter = new DatadogExporter({
apiKey: process.env.DATADOG_API_KEY!,
service: 'mobile-app',
env: config.environment,
});
spanProcessors.push(new BatchSpanProcessor(datadogExporter, {
maxExportBatchSize: config.maxBatchSize,
scheduledDelayMillis: config.exportInterval,
// Aggressive timeout to prevent memory buildup
exportTimeoutMillis: 10000,
}));
metricExporter = datadogExporter;
}
// Secondary exporter - Firebase for basic monitoring
if (config.enabledExporters.includes('firebase')) {
spanProcessors.push(new BatchSpanProcessor(new FirebaseExporter(), {
maxExportBatchSize: 50, // Smaller batches for Firebase
scheduledDelayMillis: 30000, // Less frequent for free tier
}));
}
return { spanProcessors, metricExporter };
}
private createAdaptiveSampler(baseRate: number) {
// Custom sampler that reduces sampling under stress
return {
shouldSample: (context: any, traceId: string, spanName: string) => {
// Always sample errors
if (spanName.includes('error') || spanName.includes('crash')) {
return { decision: 1 }; // RECORD_AND_SAMPLE
}
// Sample critical user flows at higher rate
if (spanName.includes('payment') || spanName.includes('login')) {
return { decision: Math.random() < (baseRate * 2) ? 1 : 0 };
}
// Reduced sampling for high-frequency events
if (spanName.includes('scroll') || spanName.includes('animation')) {
return { decision: Math.random() < (baseRate * 0.1) ? 1 : 0 };
}
return { decision: Math.random() < baseRate ? 1 : 0 };
},
};
}
async shutdown() {
if (this.sdk && this.isInitialized) {
await this.sdk.shutdown();
this.isInitialized = false;
}
}
}
export const telemetryProvider = new ProductionTelemetryProvider();
React Native Performance Monitoring#
This is the class that caught our payment flow bug:
// telemetry/performance-monitor.ts - The class that saved $50K
import { trace, metrics, context } from '@opentelemetry/api';
import perf from '@react-native-firebase/perf';
import { AppState, AppStateStatus } from 'react-native';
class ProductionPerformanceMonitor {
private tracer = trace.getTracer('app-performance', '1.0.0');
private meter = metrics.getMeter('app-metrics', '1.0.0');
// Metrics that actually matter in production
private screenLoadTime = this.meter.createHistogram('screen_load_duration', {
description: 'Time to load screens',
unit: 'ms',
});
private apiCallDuration = this.meter.createHistogram('api_call_duration', {
description: 'API response times by endpoint',
unit: 'ms',
});
private userJourneyCompletion = this.meter.createCounter('user_journey_completion', {
description: 'Completed user journeys',
});
private criticalErrors = this.meter.createCounter('critical_errors', {
description: 'Errors that affect core functionality',
});
constructor() {
this.setupAppStateTracking();
}
// Track screen loads with actual business impact
async measureScreenLoad<T>(
screenName: string,
loadFunction: () => Promise<T>,
isBusinessCritical = false
): Promise<T> {
const span = this.tracer.startSpan(`screen_load_${screenName}`);
const startTime = Date.now();
// Firebase trace for free monitoring
let firebaseTrace: any = null;
try {
firebaseTrace = perf().newTrace(`screen_${screenName}`);
firebaseTrace.start();
} catch (error) {
// Firebase can fail, don't crash the app
console.warn('Firebase trace failed:', error);
}
span.setAttributes({
'screen.name': screenName,
'screen.business_critical': isBusinessCritical,
'screen.timestamp': startTime,
});
try {
const result = await loadFunction();
const duration = Date.now() - startTime;
// Record metrics
this.screenLoadTime.record(duration, {
screen: screenName,
success: 'true',
critical: isBusinessCritical.toString(),
});
// Alert on slow critical screens
if (isBusinessCritical && duration > 3000) {
this.criticalErrors.add(1, {
type: 'slow_critical_screen',
screen: screenName,
duration: duration.toString(),
});
}
span.setAttributes({
'screen.load_duration': duration,
'screen.success': true,
});
span.setStatus({ code: 1 }); // OK
return result;
} catch (error) {
const duration = Date.now() - startTime;
this.screenLoadTime.record(duration, {
screen: screenName,
success: 'false',
error: error.name,
});
// Always alert on screen load failures
this.criticalErrors.add(1, {
type: 'screen_load_failure',
screen: screenName,
error: error.message,
});
span.recordException(error);
span.setStatus({ code: 2, message: error.message });
firebaseTrace?.putAttribute('error', 'true');
throw error;
} finally {
span.end();
firebaseTrace?.stop();
}
}
// API monitoring that caught our payment bug
async instrumentApiCall<T>(
endpoint: string,
method: string,
apiCall: () => Promise<T>,
businessContext?: {
userId?: string;
feature?: string;
monetaryValue?: number;
}
): Promise<T> {
const span = this.tracer.startSpan(`api_${method.toLowerCase()}_${this.sanitizeEndpoint(endpoint)}`);
const startTime = Date.now();
span.setAttributes({
'http.method': method,
'http.url': endpoint,
'api.business_context': JSON.stringify(businessContext || {}),
'api.timestamp': startTime,
});
try {
const result = await apiCall();
const duration = Date.now() - startTime;
this.apiCallDuration.record(duration, {
endpoint: this.sanitizeEndpoint(endpoint),
method,
status: 'success',
business_critical: businessContext?.monetaryValue ? 'true' : 'false',
});
// Alert on slow payment APIs
if (businessContext?.monetaryValue && duration > 5000) {
this.criticalErrors.add(1, {
type: 'slow_payment_api',
endpoint: this.sanitizeEndpoint(endpoint),
duration: duration.toString(),
value: businessContext.monetaryValue.toString(),
});
}
span.setAttributes({
'http.status_code': 200,
'http.response_time': duration,
'api.success': true,
});
return result;
} catch (error) {
const duration = Date.now() - startTime;
this.apiCallDuration.record(duration, {
endpoint: this.sanitizeEndpoint(endpoint),
method,
status: 'error',
error_type: error.name,
});
// Always alert on payment API failures
if (businessContext?.monetaryValue) {
this.criticalErrors.add(1, {
type: 'payment_api_failure',
endpoint: this.sanitizeEndpoint(endpoint),
error: error.message,
user_id: businessContext.userId || 'unknown',
value: businessContext.monetaryValue.toString(),
});
}
span.recordException(error);
span.setAttributes({
'http.status_code': error.status || 500,
'error.name': error.name,
'error.message': error.message,
'api.success': false,
});
throw error;
} finally {
span.end();
}
}
// Track complete user journeys, not just individual actions
startUserJourney(journeyName: string, userId?: string): string {
const journeyId = `${journeyName}_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
const span = this.tracer.startSpan(`user_journey_${journeyName}`, {
attributes: {
'journey.name': journeyName,
'journey.id': journeyId,
'user.id': userId || 'anonymous',
'journey.start_time': Date.now(),
},
});
// Store in context for later steps
context.with(trace.setSpan(context.active(), span), () => {
// Context is now available for subsequent operations
});
return journeyId;
}
completeUserJourney(journeyId: string, success: boolean, metadata?: Record<string, any>) {
const activeSpan = trace.getActiveSpan();
if (activeSpan) {
activeSpan.setAttributes({
'journey.completed': success,
'journey.end_time': Date.now(),
...metadata,
});
if (success) {
this.userJourneyCompletion.add(1, {
journey: activeSpan.attributes['journey.name'] as string || 'unknown',
success: 'true',
});
} else {
this.criticalErrors.add(1, {
type: 'journey_failure',
journey: activeSpan.attributes['journey.name'] as string || 'unknown',
step: metadata?.failedStep || 'unknown',
});
}
activeSpan.setStatus({
code: success ? 1 : 2,
message: success ? 'Journey completed' : 'Journey failed',
});
activeSpan.end();
}
}
private sanitizeEndpoint(endpoint: string): string {
// Remove sensitive data from endpoints for metrics
return endpoint
.replace(/\/\d+/g, '/:id')
.replace(/[?&]token=[^&]*/g, '?token=***')
.replace(/[?&]api_key=[^&]*/g, '?api_key=***');
}
private setupAppStateTracking() {
let backgroundTime = 0;
AppState.addEventListener('change', (nextAppState: AppStateStatus) => {
if (nextAppState === 'background') {
backgroundTime = Date.now();
// Force flush telemetry before backgrounding
this.flushTelemetry();
} else if (nextAppState === 'active' && backgroundTime > 0) {
const backgroundDuration = Date.now() - backgroundTime;
// Track app resume
const resumeSpan = this.tracer.startSpan('app_resume');
resumeSpan.setAttributes({
'app.background_duration': backgroundDuration,
'app.resume_time': Date.now(),
});
resumeSpan.end();
backgroundTime = 0;
}
});
}
private async flushTelemetry() {
try {
// Force export of pending telemetry data
await telemetryProvider.sdk?.getTracerProvider()?.forceFlush(5000);
} catch (error) {
console.warn('Failed to flush telemetry:', error);
}
}
}
export const performanceMonitor = new ProductionPerformanceMonitor();
Navigation Tracking That Actually Helps#
Standard navigation tracking is useless. This tracks what actually matters:
// telemetry/navigation-instrumentation.ts - Navigation tracking that matters
import { NavigationContainer, NavigationContainerRef } from '@react-navigation/native';
import { trace, metrics } from '@opentelemetry/api';
import React, { useRef, useCallback } from 'react';
const tracer = trace.getTracer('navigation', '1.0.0');
const meter = metrics.getMeter('navigation-metrics', '1.0.0');
// Metrics that help optimize user experience
const screenTransitionTime = meter.createHistogram('screen_transition_duration', {
description: 'Time between screen transitions',
unit: 'ms',
});
const navigationDropoff = meter.createCounter('navigation_dropoff', {
description: 'Users who drop off at specific screens',
});
const deepLinkUsage = meter.createCounter('deep_link_usage', {
description: 'Deep link navigation usage',
});
interface NavigationEvent {
from: string;
to: string;
params?: any;
timestamp: number;
userId?: string;
}
class NavigationTelemetry {
private navigationHistory: NavigationEvent[] = [];
private maxHistorySize = 50;
trackNavigation(event: NavigationEvent) {
// Add to history
this.navigationHistory.push(event);
if (this.navigationHistory.length > this.maxHistorySize) {
this.navigationHistory.shift();
}
// Create span for navigation
const span = tracer.startSpan('screen_navigation');
span.setAttributes({
'navigation.from': event.from,
'navigation.to': event.to,
'navigation.params': JSON.stringify(event.params || {}),
'navigation.timestamp': event.timestamp,
'user.id': event.userId || 'anonymous',
});
// Record metrics
if (this.navigationHistory.length > 1) {
const previousEvent = this.navigationHistory[this.navigationHistory.length - 2];
const transitionTime = event.timestamp - previousEvent.timestamp;
screenTransitionTime.record(transitionTime, {
from: event.from,
to: event.to,
});
// Track quick exits (user confusion indicator)
if (transitionTime <2000) {
navigationDropoff.add(1, {
screen: event.from,
quick_exit: 'true',
time_spent: transitionTime.toString(),
});
}
}
// Track deep link usage
if (event.params && Object.keys(event.params).length > 0) {
deepLinkUsage.add(1, {
screen: event.to,
has_params: 'true',
});
}
span.end();
}
getNavigationPath(): string[] {
return this.navigationHistory.map(event => event.to);
}
analyzeFunnelDropoff(): Record<string, number> {
const dropoffRates: Record<string, number> = {};
for (let i = 0; i < this.navigationHistory.length - 1; i++) {
const current = this.navigationHistory[i];
const next = this.navigationHistory[i + 1];
const timeSpent = next.timestamp - current.timestamp;
if (timeSpent <5000) { // Less than 5 seconds = potential confusion
dropoffRates[current.to] = (dropoffRates[current.to] || 0) + 1;
}
}
return dropoffRates;
}
}
const navigationTelemetry = new NavigationTelemetry();
export function createTelemetryNavigationContainer() {
return React.forwardRef<NavigationContainerRef<any>, any>((props, ref) => {
const navigationRef = useRef<NavigationContainerRef<any>>(null);
const routeNameRef = useRef<string>();
const navigationStartTime = useRef<number>();
const onReady = useCallback(() => {
const initialRoute = navigationRef.current?.getCurrentRoute();
routeNameRef.current = initialRoute?.name;
if (initialRoute?.name) {
navigationTelemetry.trackNavigation({
from: 'app_start',
to: initialRoute.name,
params: initialRoute.params,
timestamp: Date.now(),
});
}
}, []);
const onStateChange = useCallback(() => {
const previousRouteName = routeNameRef.current;
const currentRoute = navigationRef.current?.getCurrentRoute();
const currentRouteName = currentRoute?.name;
if (previousRouteName !== currentRouteName && currentRouteName) {
const now = Date.now();
navigationTelemetry.trackNavigation({
from: previousRouteName || 'unknown',
to: currentRouteName,
params: currentRoute.params,
timestamp: now,
});
routeNameRef.current = currentRouteName;
}
}, []);
return (
<NavigationContainer
ref={ref || navigationRef}
onReady={onReady}
onStateChange={onStateChange}
{...props}
/>
);
});
}
export { navigationTelemetry };
Error Tracking That Actually Catches Issues#
Standard error tracking misses the context you need. This captures what you need to fix bugs:
// telemetry/error-tracking.ts - Error tracking that helps debugging
import { trace, context } from '@opentelemetry/api';
import crashlytics from '@react-native-firebase/crashlytics';
interface ErrorContext {
userId?: string;
screenName?: string;
userJourney?: string[];
networkState?: string;
memoryUsage?: number;
batteryLevel?: number;
businessContext?: {
feature?: string;
monetaryValue?: number;
customerTier?: string;
};
}
class ProductionErrorTracker {
private tracer = trace.getTracer('error-tracking', '1.0.0');
private errorCount = 0;
private recentErrors: Array<{ error: Error; context?: ErrorContext; timestamp: number }> = [];
captureError(error: Error, errorContext?: ErrorContext) {
const timestamp = Date.now();
this.errorCount++;
// Store recent errors for pattern analysis
this.recentErrors.push({ error, context: errorContext, timestamp });
if (this.recentErrors.length > 100) {
this.recentErrors.shift();
}
// Create comprehensive error span
const span = this.tracer.startSpan('error_occurred');
span.setAttributes({
'error.type': error.name,
'error.message': error.message,
'error.stack': this.sanitizeStack(error.stack || ''),
'error.timestamp': timestamp,
'error.sequence_number': this.errorCount,
// Device context
'device.memory_usage': errorContext?.memoryUsage || 0,
'device.battery_level': errorContext?.batteryLevel || 1,
'device.network_state': errorContext?.networkState || 'unknown',
// User context
'user.id': errorContext?.userId || 'anonymous',
'user.screen': errorContext?.screenName || 'unknown',
'user.journey': JSON.stringify(errorContext?.userJourney || []),
// Business context
'business.feature': errorContext?.businessContext?.feature || 'unknown',
'business.monetary_value': errorContext?.businessContext?.monetaryValue || 0,
'business.customer_tier': errorContext?.businessContext?.customerTier || 'unknown',
});
// Enhanced Firebase Crashlytics logging
try {
if (errorContext?.userId) {
crashlytics().setUserId(errorContext.userId);
}
// Set custom attributes for better filtering
crashlytics().setAttributes({
screen_name: errorContext?.screenName || 'unknown',
network_state: errorContext?.networkState || 'unknown',
business_feature: errorContext?.businessContext?.feature || 'unknown',
customer_tier: errorContext?.businessContext?.customerTier || 'unknown',
error_sequence: this.errorCount.toString(),
});
// Add breadcrumbs from user journey
if (errorContext?.userJourney) {
errorContext.userJourney.forEach((step, index) => {
crashlytics().log(`Journey step ${index + 1}: ${step}`);
});
}
crashlytics().recordError(error);
} catch (crashlyticsError) {
console.warn('Crashlytics logging failed:', crashlyticsError);
}
// Pattern detection
this.detectErrorPatterns();
// Add to current span context if available
const activeSpan = trace.getActiveSpan();
if (activeSpan) {
activeSpan.recordException(error);
activeSpan.setStatus({
code: 2, // ERROR
message: error.message,
});
}
span.end();
// Log for immediate debugging
console.error('Production error captured:', {
error: error.message,
context: errorContext,
sequence: this.errorCount,
});
}
// Detect error patterns that indicate systemic issues
private detectErrorPatterns() {
const recentWindow = Date.now() - 5 * 60 * 1000; // Last 5 minutes
const recentErrors = this.recentErrors.filter(e => e.timestamp > recentWindow);
if (recentErrors.length >= 5) {
// Check for error storm
const errorTypes = new Map<string, number>();
recentErrors.forEach(({ error }) => {
errorTypes.set(error.name, (errorTypes.get(error.name) || 0) + 1);
});
errorTypes.forEach((count, errorType) => {
if (count >= 3) {
this.reportErrorPattern('error_storm', {
error_type: errorType,
count: count.toString(),
time_window: '5_minutes',
});
}
});
}
// Check for user-specific issues
const userErrors = new Map<string, number>();
recentErrors.forEach(({ context }) => {
if (context?.userId) {
userErrors.set(context.userId, (userErrors.get(context.userId) || 0) + 1);
}
});
userErrors.forEach((count, userId) => {
if (count >= 3) {
this.reportErrorPattern('user_error_cluster', {
user_id: userId,
count: count.toString(),
});
}
});
}
private reportErrorPattern(patternType: string, attributes: Record<string, string>) {
const span = this.tracer.startSpan(`error_pattern_${patternType}`);
span.setAttributes({
'pattern.type': patternType,
'pattern.timestamp': Date.now(),
...attributes,
});
span.end();
console.warn(`Error pattern detected: ${patternType}`, attributes);
}
private sanitizeStack(stack: string): string {
// Remove sensitive information from stack traces
return stack
.replace(/token=[^&\s]*/g, 'token=***')
.replace(/apikey=[^&\s]*/g, 'apikey=***')
.replace(/password=[^&\s]*/g, 'password=***');
}
// Global error handlers that saved production
setupGlobalErrorHandling() {
// React Native JS errors
const originalHandler = ErrorUtils.getGlobalHandler();
ErrorUtils.setGlobalHandler((error, isFatal) => {
this.captureError(error, {
businessContext: { feature: 'global_js_error' },
});
// Don't prevent the original handler from running
originalHandler(error, isFatal);
});
// Promise rejections
const originalRejectionHandler = require('react-native/Libraries/Core/ExceptionsManager').installConsoleErrorReporter;
// Unhandled promise rejections
global.addEventListener?.('unhandledrejection', (event: any) => {
this.captureError(
new Error(`Unhandled Promise Rejection: ${event.reason}`),
{
businessContext: { feature: 'unhandled_promise' },
}
);
});
console.log('Global error handlers installed');
}
// Business-specific error tracking
trackBusinessError(
errorType: 'payment_failure' | 'login_failure' | 'api_timeout' | 'feature_unavailable',
error: Error,
businessContext: {
userId?: string;
monetaryValue?: number;
customerTier?: string;
feature: string;
}
) {
this.captureError(error, {
businessContext,
screenName: 'business_operation',
});
// Immediate alerts for high-value errors
if (businessContext.monetaryValue && businessContext.monetaryValue > 100) {
console.error('HIGH VALUE ERROR:', {
type: errorType,
value: businessContext.monetaryValue,
customer: businessContext.customerTier,
user: businessContext.userId,
});
}
}
}
export const errorTracker = new ProductionErrorTracker();
// Error boundary that actually helps
export class TelemetryErrorBoundary extends React.Component<
{
children: React.ReactNode;
fallback?: React.ComponentType<{ error: Error; retry: () => void }>;
context?: Partial<ErrorContext>;
},
{ hasError: boolean; error?: Error }
> {
constructor(props: any) {
super(props);
this.state = { hasError: false };
}
static getDerivedStateFromError(error: Error) {
return { hasError: true, error };
}
componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
errorTracker.captureError(error, {
...this.props.context,
businessContext: {
feature: 'react_error_boundary',
},
});
}
render() {
if (this.state.hasError && this.state.error) {
if (this.props.fallback) {
return React.createElement(this.props.fallback, {
error: this.state.error,
retry: () => this.setState({ hasError: false, error: undefined })
});
}
return (
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
<Text>Something went wrong. Please restart the app.</Text>
</View>
);
}
return this.props.children;
}
}
The Firebase Integration That Doesn't Break#
Firebase Performance Monitoring is great for getting started, but it needs careful integration:
// telemetry/firebase-integration.ts - Firebase integration that works
import perf from '@react-native-firebase/perf';
import { SpanExporter, ReadableSpan } from '@opentelemetry/sdk-trace-base';
import { ExportResult, ExportResultCode } from '@opentelemetry/core';
export class ProductionFirebaseExporter implements SpanExporter {
private activeTraces = new Map<string, any>();
private maxConcurrentTraces = 50; // Firebase has limits
export(spans: ReadableSpan[], resultCallback: (result: ExportResult) => void): void {
try {
// Process spans in chunks to avoid overwhelming Firebase
const chunks = this.chunkArray(spans, 10);
chunks.forEach((chunk, index) => {
setTimeout(() => {
chunk.forEach(span => this.processSpan(span));
}, index * 100); // Stagger processing
});
resultCallback({ code: ExportResultCode.SUCCESS });
} catch (error) {
console.error('Firebase export error:', error);
resultCallback({ code: ExportResultCode.FAILED });
}
}
private async processSpan(span: ReadableSpan) {
const { name, duration, attributes, status } = span;
// Skip spans that Firebase doesn't handle well
if (this.shouldSkipSpan(name, attributes)) {
return;
}
// Clean up trace name for Firebase
const traceName = this.cleanTraceName(name);
// Manage concurrent traces to avoid Firebase limits
if (this.activeTraces.size >= this.maxConcurrentTraces) {
console.warn('Too many active Firebase traces, skipping:', traceName);
return;
}
try {
const trace = perf().newTrace(traceName);
this.activeTraces.set(traceName, trace);
// Add attributes (Firebase has limits on these too)
this.addSafeAttributes(trace, attributes);
// Add business metrics
this.addBusinessMetrics(trace, attributes);
// Simulate trace timing
trace.start();
setTimeout(() => {
try {
if (status?.code === 2) { // ERROR
trace.putAttribute('error', 'true');
trace.putMetric('error_count', 1);
}
trace.stop();
this.activeTraces.delete(traceName);
} catch (stopError) {
console.warn('Firebase trace stop failed:', stopError);
}
}, Math.min(duration / 1000000, 60000)); // Max 60s trace
} catch (error) {
console.warn('Firebase trace creation failed:', error);
this.activeTraces.delete(traceName);
}
}
private shouldSkipSpan(name: string, attributes: any): boolean {
// Skip high-frequency, low-value spans
if (name.includes('scroll') || name.includes('animation')) {
return true;
}
// Skip internal telemetry spans
if (name.includes('telemetry') || name.includes('metric')) {
return true;
}
// Skip spans without duration
if (!attributes['duration'] && !attributes['http.response_time']) {
return true;
}
return false;
}
private cleanTraceName(name: string): string {
// Firebase has strict naming requirements
return name
.replace(/[^a-zA-Z0-9_]/g, '_')
.substring(0, 100) // Firebase limit
.toLowerCase();
}
private addSafeAttributes(trace: any, attributes: any) {
const safeAttributes: Record<string, string> = {};
let attributeCount = 0;
const maxAttributes = 5; // Firebase free tier limit
// Prioritize business-relevant attributes
const priorities = [
'user.id',
'screen.name',
'http.status_code',
'business.feature',
'error.type',
];
priorities.forEach(key => {
if (attributes[key] && attributeCount < maxAttributes) {
safeAttributes[key.replace('.', '_')] = String(attributes[key]).substring(0, 100);
attributeCount++;
}
});
// Add remaining attributes until limit
Object.entries(attributes).forEach(([key, value]) => {
if (!priorities.includes(key) && attributeCount < maxAttributes) {
const safeKey = key.replace(/[^a-zA-Z0-9_]/g, '_');
safeAttributes[safeKey] = String(value).substring(0, 100);
attributeCount++;
}
});
// Set attributes on trace
Object.entries(safeAttributes).forEach(([key, value]) => {
try {
trace.putAttribute(key, value);
} catch (error) {
console.warn(`Failed to set Firebase attribute ${key}:`, error);
}
});
}
private addBusinessMetrics(trace: any, attributes: any) {
// Add metrics that matter for business monitoring
try {
if (attributes['http.status_code']) {
trace.putMetric('http_status', Number(attributes['http.status_code']));
}
if (attributes['api.response_time']) {
trace.putMetric('response_time_ms', Number(attributes['api.response_time']));
}
if (attributes['business.monetary_value']) {
trace.putMetric('monetary_value', Number(attributes['business.monetary_value']));
}
if (attributes['screen.load_duration']) {
trace.putMetric('load_time_ms', Number(attributes['screen.load_duration']));
}
} catch (error) {
console.warn('Failed to add Firebase metrics:', error);
}
}
private chunkArray<T>(array: T[], chunkSize: number): T[][] {
const chunks: T[][] = [];
for (let i = 0; i < array.length; i += chunkSize) {
chunks.push(array.slice(i, i + chunkSize));
}
return chunks;
}
async shutdown(): Promise<void> {
// Clean up any remaining traces
this.activeTraces.forEach(trace => {
try {
trace.stop();
} catch (error) {
console.warn('Error stopping Firebase trace during shutdown:', error);
}
});
this.activeTraces.clear();
}
}
Real Usage Patterns That Actually Help#
Here's how I use the telemetry system in actual app code:
Screen Component Tracking#
// In a real screen component
import React, { useEffect, useState } from 'react';
import { performanceMonitor } from '../telemetry/performance-monitor';
import { errorTracker } from '../telemetry/error-tracking';
export function PaymentScreen({ route }: any) {
const [loading, setLoading] = useState(true);
const [paymentData, setPaymentData] = useState(null);
useEffect(() => {
loadPaymentScreen();
}, []);
const loadPaymentScreen = async () => {
try {
// Start user journey tracking
const journeyId = performanceMonitor.startUserJourney('payment_flow', route.params?.userId);
// Measure screen load with business context
const data = await performanceMonitor.measureScreenLoad(
'payment_screen',
async () => {
// Load payment methods
const methods = await performanceMonitor.instrumentApiCall(
'/api/payment-methods',
'GET',
() => api.getPaymentMethods(),
{
userId: route.params?.userId,
feature: 'payment_methods',
monetaryValue: route.params?.totalAmount,
}
);
// Load user preferences
const preferences = await api.getUserPreferences();
return { methods, preferences };
},
true // This is business critical
);
setPaymentData(data);
setLoading(false);
} catch (error) {
errorTracker.trackBusinessError('payment_failure', error as Error, {
userId: route.params?.userId,
monetaryValue: route.params?.totalAmount,
customerTier: route.params?.customerTier,
feature: 'payment_screen_load',
});
setLoading(false);
}
};
const handlePaymentSubmit = async (paymentDetails: any) => {
try {
const result = await performanceMonitor.instrumentApiCall(
'/api/process-payment',
'POST',
() => api.processPayment(paymentDetails),
{
userId: route.params?.userId,
feature: 'payment_processing',
monetaryValue: route.params?.totalAmount,
}
);
// Complete journey successfully
performanceMonitor.completeUserJourney(journeyId, true, {
paymentMethod: paymentDetails.method,
amount: route.params?.totalAmount,
});
// Navigate to success
navigation.navigate('PaymentSuccess', { transactionId: result.id });
} catch (error) {
// Complete journey with failure
performanceMonitor.completeUserJourney(journeyId, false, {
failedStep: 'payment_processing',
error: error.message,
});
errorTracker.trackBusinessError('payment_failure', error as Error, {
userId: route.params?.userId,
monetaryValue: route.params?.totalAmount,
customerTier: route.params?.customerTier,
feature: 'payment_processing',
});
}
};
if (loading) {
return <LoadingSpinner />;
}
return (
<PaymentForm
data={paymentData}
onSubmit={handlePaymentSubmit}
/>
);
}
The Monitoring Setup That Prevented Outages#
After implementing this system, here's what we monitor in production:
Datadog Dashboard Configuration#
// The dashboard that saved us from multiple incidents
export const productionDashboards = {
"mobile_app_health": {
"title": "Mobile App Health - Production",
"widgets": [
{
"title": "Critical Business Errors",
"type": "timeseries",
"queries": [
{
"query": "sum:custom.critical_errors{*} by {error_type}",
"display_type": "bars"
}
],
"alert_threshold": 5 // Alert if more than 5 critical errors in 5 min
},
{
"title": "Payment API Response Times",
"type": "timeseries",
"queries": [
{
"query": "avg:custom.api_call_duration{endpoint:payment*} by {endpoint}",
"display_type": "line"
}
],
"alert_threshold": 5000 // Alert if payment APIs exceed 5s
},
{
"title": "Screen Load Performance",
"type": "heatmap",
"queries": [
{
"query": "custom.screen_load_duration{business_critical:true}"
}
]
},
{
"title": "User Journey Completion Rate",
"type": "query_value",
"queries": [
{
"query": "sum:custom.user_journey_completion{success:true} / sum:custom.user_journey_completion{*} * 100"
}
]
},
{
"title": "App Crashes by Device",
"type": "toplist",
"queries": [
{
"query": "sum:custom.critical_errors{type:crash} by {device_model}"
}
]
}
]
}
};
Alerts That Actually Work#
// Alerts that wake me up for real issues, not noise
export const productionAlerts = {
"payment_failure_spike": {
"name": "Payment API Failure Spike",
"query": "sum(last_5m):sum:custom.critical_errors{type:payment_api_failure} > 3",
"message": "@slack-payments @pagerduty-critical",
"priority": "P1",
"escalation": "immediate"
},
"user_journey_drop": {
"name": "User Journey Completion Drop",
"query": "avg(last_15m):sum:custom.user_journey_completion{success:true} / sum:custom.user_journey_completion{*} <0.8",
"message": "@slack-product @email-team",
"priority": "P2",
"escalation": "15_minutes"
},
"critical_screen_slow": {
"name": "Critical Screen Load Time",
"query": "avg(last_10m):avg:custom.screen_load_duration{business_critical:true} > 5000",
"message": "@slack-engineering",
"priority": "P2",
"escalation": "30_minutes"
}
};
Performance Impact and Optimization#
After 18 months of production use, here are the real performance numbers:
Resource Usage#
- CPU overhead: 2-3% average (measured with Xcode Instruments)
- Memory overhead: 15-20MB (mostly trace buffering)
- Battery impact: Negligible (less than 1% daily drain)
- Network usage: 50-100KB per day per user
Cost Analysis (Monthly)#
- Datadog: $400/month (100M spans, 50GB logs)
- Firebase: $0 (within free tier limits)
- AWS infrastructure: $50/month (OTEL collector)
- Development time saved: 40+ hours/month
- ROI: 10x (debugging efficiency + prevented outages)
Optimization Strategies That Worked#
// Smart sampling that reduced costs by 60%
class AdaptiveSampler {
private errorRate = new Map<string, number>();
private criticalSessions = new Set<string>();
shouldSample(spanName: string, attributes: any): boolean {
// Always sample errors and critical business flows
if (spanName.includes('error') || attributes['business.monetary_value']) {
return true;
}
// Sample critical user sessions at higher rate
if (attributes['user.tier'] === 'premium') {
return Math.random() <0.5; // 50% sampling
}
// Adaptive sampling based on error rates
const errorRate = this.errorRate.get(spanName) || 0;
if (errorRate > 0.05) { // More than 5 percent errors
return Math.random() <0.8; // Increase sampling
}
// Default sampling
return Math.random() <0.1; // 10% base rate
}
}
The Results: Observability That Actually Helped#
Issues Caught Before Users Noticed#
- iOS 15.4 Network Bug: Caught API timeouts specific to iOS 15.4 WiFi users 2 days before major rollout
- Memory Leak in Image Caching: Detected 20% RAM usage increase before user complaints
- Payment Race Condition: Found 0.3% payment failures on fast networks using journey tracking
- Android Battery Drain: Identified background process causing 15% battery drain on Samsung devices
Business Impact#
- Faster Issue Resolution: Average debugging time dropped from 6 hours to 45 minutes
- Proactive Fixes: 60% of issues fixed before user reports
- Customer Satisfaction: App store rating improved from 3.2 to 4.6
- Revenue Protection: Prevented estimated $1100K+ in lost transactions
Developer Happiness#
- No More Blind Debugging: Context-rich error reports with user journey
- Confidence in Deployments: Comprehensive monitoring catches regressions quickly
- Data-Driven Decisions: Performance budgets backed by real metrics
Hard-Learned Lessons#
1. Start Simple, Evolve Gradually#
Don't try to monitor everything on day one. Start with:
- Critical business flows (payments, login, core features)
- Error tracking with context
- Performance monitoring for key screens
- Basic user journey tracking
2. Context Is Everything#
Raw metrics are useless. Always include:
- User context (ID, session, journey)
- Business context (feature, monetary value, customer tier)
- Technical context (device, network, app version)
- Error context (what the user was doing)
3. Sampling Strategy Matters#
- Critical flows: 100% sampling
- Business features: 50% sampling
- UI interactions: 10% sampling
- Background tasks: 1% sampling
4. Alerts Should Wake You Up#
Only alert on things that require immediate action:
- Payment processing failures
- Crash rate spikes
- Critical business flow completion drops
- Security-related events
5. Multiple Exporters = Reliability#
Don't rely on a single monitoring provider:
- Primary: Datadog (rich analytics)
- Secondary: Elastic APM (cost control)
- Backup: Firebase (always works)
Getting Started: The 7-Day Implementation Plan#
Day 1-2: Foundation#
- Set up OpenTelemetry provider
- Add basic error tracking
- Implement global error handlers
Day 3-4: Performance Monitoring#
- Add screen load tracking
- Implement API call instrumentation
- Set up navigation tracking
Day 5-6: Business Metrics#
- Track user journeys
- Add custom business events
- Set up critical flow monitoring
Day 7: Production Deployment#
- Configure sampling rates
- Set up alerts
- Create monitoring dashboards
Final Thoughts: Observability as a Product Feature#
After 18 months of building production observability, I've learned that monitoring isn't just a "nice to have" - it's a competitive advantage.
The ability to quickly debug issues, prevent outages, and optimize user experience based on real data has transformed how our team ships features. We went from reactive debugging to proactive optimization.
The initial investment (2 weeks of development + $500/month in tools) pays for itself within the first major issue it helps you solve quickly.
Your users won't thank you for good observability, but they'll definitely complain when you don't have it. Start building yours today.
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!