Documentation as Infrastructure: Scaling Knowledge Across Engineering Teams
Documentation debt kills organizations faster than technical debt. A comprehensive guide to treating documentation as critical infrastructure and scaling knowledge across engineering teams.
When Missing Documentation Cost Us More Than We Expected#
You know that sinking feeling when you realize the person who understood a critical system just left? I've been there, and honestly, it's one of those lessons you hope to learn from someone else's experience rather than your own.
A few years back, our team faced what turned out to be an expensive learning moment. Three senior engineers moved on within six months—normal career progression, nothing dramatic. We did all the "right" things: handovers, knowledge transfer sessions, documentation sprints. But when our payment system hit a snag during our biggest sales weekend, we discovered the gap between documented procedures and deep system understanding.
The recovery took longer than anyone wanted—about 18 hours of stressed engineers, worried executives, and customers wondering what was happening. The revenue impact was significant, but honestly, the bigger hit was realizing how fragile our knowledge architecture was.
That experience shifted my thinking: Documentation isn't just about writing things down. It's about building knowledge systems that can outlive any individual engineer—including ourselves.
The Documentation Patterns I Keep Running Into#
Working across different teams, I've noticed some recurring challenges that honestly make me wonder if we're all making the same mistakes:
Level 1: The Wiki Graveyard
- 10,000 pages in Confluence
- 90% outdated or irrelevant
- Search returns 847 results for "authentication"
- Nobody knows which one is current
Level 2: README Roulette
- Every repository has different documentation standards
- Quality varies from excellent to non-existent
- New engineers play guessing games about which README to trust
Level 3: Slack Knowledge
- Critical architectural decisions buried in #general
- "Remember that conversation about the database migration?" No, nobody does
- Institutional knowledge trapped in private DMs
Level 4: Hero Documentation
- One person knows everything about the billing system
- They're overloaded with questions
- When they leave, knowledge walks out the door
Level 5: Meeting Minutes Maze
- Important decisions scattered across hundreds of Google Docs
- No consistent format or structure
- Finding the rationale for a design choice requires archaeological skills
If any of these sound familiar, you're not alone. What I've learned is that this usually isn't about which tools we're using—it's about how we think about information architecture.
Documentation Debt: The Silent Organization Killer#
We spend a lot of time discussing technical debt, but I've found documentation debt can be even trickier to spot. Technical debt usually shows up in slower deployments or harder maintenance. Documentation debt shows up when teams start second-guessing decisions they made six months ago because no one remembers the reasoning.
From what I've observed across different teams, the costs tend to show up in these areas:
interface DocumentationDebtCost {
// Immediate costs
onboardingTime: '6 weeks → 2 weeks with proper docs';
dailyInterruptions: '40 Slack questions → 5 questions';
duplicatedWork: '3 teams solving same problem unknowingly';
// Hidden costs
badDecisions: 'Repeating past mistakes';
analysisParalysis: 'Afraid to change undocumented systems';
talentLoss: 'Senior engineers become human documentation';
// Crisis costs
productionIncidents: '60% caused by knowledge gaps';
auditFailures: 'Cannot prove compliance decisions';
acquisitionIntegration: '18 months instead of 6';
}
What I've noticed is that teams who view documentation as "extra work" often end up spending more time later explaining, re-explaining, and re-discovering the same information.
A Documentation Approach That's Worked for Me#
Through various experiments (some successful, others... educational), I've settled on a three-layer approach that seems to scale reasonably well:
Layer 1: Decision Architecture (The Why)#
This is where you capture the reasoning behind choices. Not what you built, but why you built it that way.
/docs
/decisions # ADRs - architecture decisions made
/proposals # RFCs - future changes being considered
/discussions # RFDs - open problems being explored
Template approach I've found helpful:
Mini-RFC (1-2 pages):
- Single team impact
- Reversible decisions
- 1-week timeline
Standard RFC (5-10 pages):
- Multi-team impact
- Significant investment
- 2-4 week timeline
Strategic RFC (10+ pages):
- Company-wide impact
- Major architectural changes
- 6+ week timeline
Layer 2: System Documentation (The What)#
This describes your current reality. What exists, how it connects, who owns it.
/systems
/service-catalog # What services exist, who owns them
/architecture # How systems connect and communicate
/runbooks # How to operate and troubleshoot
/dependencies # What depends on what
Something I learned the hard way: This layer works best when it's mostly automated. Hand-written system docs seem to become outdated the moment you finish writing them.
Layer 3: Process Documentation (The How)#
This captures your cultural DNA. How you work, how you make decisions, how you handle incidents.
/processes
/engineering # How we design, build, and review
/oncall # How we respond to incidents
/releases # How we deploy and rollback
/hiring # How we evaluate and onboard
What I've found helpful: Process docs seem to work better with concrete examples rather than just abstract guidelines. People (myself included) tend to learn better from "here's what we actually did" rather than "here's what we should do."
The Amazon vs Google Documentation Philosophy#
I've looked at how some larger organizations handle this, and there seem to be two main approaches that come up often:
Amazon's Narrative Approach#
6-page written narratives instead of PowerPoint presentations:
- Forces complete thinking before meetings
- Creates artifact of the decision process
- "Study hall" format ensures everyone actually reads
Structure I've tried to adapt (with mixed results):
- Executive Summary (1 page)
- Context and Problem (1 page)
- Proposed Solution (2 pages)
- Alternatives Considered (1 page)
- Implementation Plan (1 page)
- Appendix (unlimited)
Google's Design Doc Culture#
Collaborative technical documents with peer review:
- Emphasis on trade-offs and alternatives
- System context diagrams
- Async collaboration through comments
Key elements:
- Context and Scope - What are we solving?
- Goals and Non-Goals - What success looks like
- Design - How we'll solve it
- Alternatives - What we considered and rejected
- Cross-cutting Concerns - Security, performance, monitoring
What I've been experimenting with: Taking Amazon's "force yourself to think it through" approach and combining it with Google's collaborative review culture. Your mileage may vary—different team dynamics seem to respond better to different approaches.
Documentation as Code: The Technical Implementation#
Treat documentation like any other critical infrastructure:
# .github/workflows/docs.yml
name: Documentation Infrastructure
on:
pull_request:
paths: ['docs/**', 'adr/**', 'rfcs/**']
jobs:
validate-documentation:
runs-on: ubuntu-latest
steps:
- name: Validate RFC format
run: |
# Check required sections exist
# Validate YAML frontmatter
# Ensure decision status is valid
- name: Check broken links
run: |
# Scan for dead internal links
# Verify external links return 200
# Flag links to deprecated services
- name: Generate architecture diagrams
run: |
# Auto-generate from PlantUML source
# Update system dependency graphs
# Create visual service maps
- name: Update search index
run: |
# Index new content for searchability
# Tag documents with metadata
# Update recommendation engine
Tool stack I've had reasonable success with:
- MkDocs Material - Beautiful, searchable documentation sites
- PlantUML/Mermaid - Version-controlled architecture diagrams
- ADR-tools - Command-line decision record management
- GitHub Actions - Automated validation and publishing
The DACI Framework for Documentation Decisions#
For any significant technical decision, I use Amazon's DACI framework to ensure clarity around the documentation process:
# RFC-042: Database Migration Strategy
## DACI Matrix
- **Driver:** Database Team Lead
- Responsible for gathering input and driving to decision
- Owns the timeline and process
- **Approver:** VP Engineering
- Makes the final call
- Accountable for the outcome
- **Contributors:** Backend Teams, SRE, Security, Data Engineering
- Provide input and expertise
- Will be impacted by the decision
- **Informed:** All Engineering, Product, Finance
- Need to know the outcome
- May need to adjust their plans
## Decision Timeline
- **Week 1:** Stakeholder interviews and requirements gathering
- **Week 2:** Technical evaluation and proof of concepts
- **Week 3:** Cost analysis and migration planning
- **Week 4:** Final decision and communication
I've found this framework helps avoid the "too many cooks" situation while still making sure people feel heard. Though honestly, getting the balance right takes some trial and error.
Scaling Documentation Culture: The Champion Network#
From what I've seen, documentation culture can't really be mandated from above—it seems to work better when it grows more naturally. But you can definitely create conditions that make it more likely to take root.
The Documentation Champion Approach#
I've had some success with having one "Documentation Champion" per team (usually works out to about 5-8 engineers per champion):
Responsibilities:
- Facilitate RFC reviews within their team
- Ensure new systems come with proper documentation
- Identify knowledge gaps and outdated information
- Coach team members on documentation standards
Time commitment: ~2 hours per week Rotation: Every 6 months to prevent burnout
Documentation Metrics That Actually Matter#
I've noticed many teams track things that don't necessarily correlate with documentation health. Here's what I've found more useful to measure:
interface DocumentationHealth {
// Leading indicators (predict future problems)
rfcParticipation: number; // % engineers participating in RFC reviews
docUpdateFrequency: number; // Average days since last update
knowledgeDistribution: number; // % of systems with >1 expert
// Lagging indicators (measure current state)
onboardingVelocity: number; // Days from hire to first commit
crossTeamQuestions: number; // Questions requiring cross-team knowledge
// Quality indicators (measure documentation value)
documentRelevance: number; // % of docs accessed in last 90 days
linkHealth: number; // % of internal links that work
searchSuccess: number; // % of searches that find answers
}
Monthly review questions:
- Which knowledge gaps caused delays this month?
- What questions were asked multiple times?
- Which documents are becoming stale?
- Where are people going outside our documentation system?
Times When Good Documentation Really Made a Difference#
When Documentation Saved Our Weekend#
During our biggest shopping weekend, our database migration hit a snag halfway through. The engineer who knew the rollback process best was enjoying a well-deserved vacation on the other side of the world.
Honestly, without the detailed runbooks (which we'd been diligent about testing and updating), we would have been scrambling for hours. Instead, the on-call team could follow the documented recovery process and get things back on track relatively quickly.
The business impact could have been significant, but more importantly for me, the team felt confident they could handle the situation even without the original expert available.
An Acquisition That Went Surprisingly Smoothly#
We acquired a team of about 50 engineers. Having been through acquisitions before, I was bracing for the usual 12-18 month integration slog of trying to understand their systems and practices.
What made this different was their engineering lead's approach to documentation. They had solid RFC and ADR practices, design docs for their major systems, and most importantly, the reasoning behind their architectural decisions was captured and accessible.
The integration still took effort—acquisitions always do—but it was months rather than the typical year-plus timeline. Their engineers could get productive on our systems much faster because we could understand theirs.
It really highlighted for me how much documentation quality can impact business outcomes beyond just day-to-day engineering productivity.
When Auditors Actually Complimented Our Documentation#
During a SOC2 Type II audit, the auditors wanted to understand our architectural decisions, particularly around data handling and access controls.
Instead of the usual scramble to reconstruct decision rationale, we had a few years' worth of ADRs documenting security-related architectural choices. The reasoning, alternatives we'd considered, and how we'd verified implementation were all there.
The audit process went much more smoothly than I'd expected. What really struck me was when one of the auditors mentioned that our documentation approach gave them confidence in our security practices at the architectural level.
It was one of those moments where you realize that good documentation practices have benefits beyond just internal team efficiency.
The Documentation ROI Calculator I Built#
After having the "documentation is overhead" conversation one too many times, I put together a rough calculator to think through the numbers:
function calculateDocumentationROI(teamSize: number, avgSalary: number) {
const engineerHourlyCost = avgSalary / (52 * 40); // ~$150/hour for $300k engineer
// Monthly time savings per engineer (conservative estimates)
const monthlySavings = {
fasterOnboarding: 15, // hours saved vs tribal knowledge
reducedInterruptions: 10, // hours not spent answering questions
betterDebugging: 12, // hours saved with proper runbooks
fasterDecisions: 8, // hours saved vs meetings/research
avoidedRework: 6, // hours saved vs repeating mistakes
};
const totalMonthlySavings = Object.values(monthlySavings)
.reduce((sum, hours) => sum + hours, 0);
const annualSavings = totalMonthlySavings * engineerHourlyCost * teamSize * 12;
// Documentation investment: 4 hours per engineer per month
const documentationCost = 4 * engineerHourlyCost * teamSize * 12;
return {
annualSavings,
documentationCost,
netBenefit: annualSavings - documentationCost,
roi: ((annualSavings - documentationCost) / documentationCost) * 100
};
}
// Example: 30-person team at $300k average salary
// Annual savings: ~$950k
// Documentation cost: ~$180k
// Net benefit: ~$770k
// ROI: ~428%
Obviously, these numbers are rough estimates and your situation might be quite different. But it's been helpful for me to think about documentation investment in terms of time saved rather than time spent.
Documentation Tools: Which One for What?#
Having tried various tools over the years, I've developed some opinions about what works well in different situations. Your team's needs might be different, but here's what I've observed:
Confluence: The Enterprise Classic#
When it works:
- Jira integration is critical
- Corporate compliance requires it
- Non-technical stakeholders need access
How to use it properly:
/spaces
/ENG # Engineering space
/RFC # Templated page tree for RFCs
/ADR # Date-based ADR archive
/Runbooks # Categorized operation docs
/PRODUCT # Product space (for PRDs)
Pro tip: Add dates to Confluence page titles: [2024-01-22] Database Migration RFC
. Search doesn't work but at least you get chronological ordering.
Anti-patterns:
- Putting everything in one space (search hell)
- Not using templates (inconsistent formats)
- Not deleting old pages (use archive labels)
Notion: Modern and Flexible#
When it shines:
- You want to use database views
- RFC tracking in Kanban boards
- Rich media and embeds for documentation
Database-based setup:
// RFC Database structure
interface NotionRFC {
title: string;
status: 'Draft' | 'Review' | 'Approved' | 'Rejected';
author: Person;
reviewers: Person[];
impactedTeams: MultiSelect;
decisionDate: Date;
tags: MultiSelect;
}
Strengths:
- Different views (Table, Board, Timeline, Calendar)
- Rich template system
- AI integration (automatic summarization)
- Version history and collaboration
GitBook: Developer-First Approach#
Where it excels:
- Open source projects
- API documentation
- Version-controlled documentation
Git integration:
# .gitbook.yaml
root: ./docs/
structure:
readme: README.md
summary: SUMMARY.md
redirects:
previous/page: new-folder/new-page.md
Advantages:
- GitHub/GitLab sync
- Markdown native
- Can go through code review
- Different versions per branch
Obsidian: Knowledge Graph Approach#
When to use:
- Building interconnected knowledge networks
- Personal knowledge management
- Zettelkasten methodology
Enterprise usage:
[[2024-01-22-database-migration]]
Related: [[postgres-best-practices]] | [[migration-checklist]]
Tags: #rfc #database #approved
Power of graph view: Visually shows which systems are related to each other.
SharePoint/Teams Wiki: Microsoft Ecosystem#
When it's mandatory:
- Organizations using Microsoft 365
- Security policies block 3rd party tools
- IT department won't allow anything else
Best practices:
/sites/Engineering
/Shared Documents
/Architecture
/ADR
/2024
01-use-kubernetes.md
02-migrate-to-postgres.md
/Processes
/RFC-Template.docx
Survival tactics:
- Don't use OneNote as a wiki (search disaster)
- Use checkout/checkin for version control
- Set up approval workflows with Power Automate
GitHub/GitLab Wiki: Code-Adjacent Documentation#
Ideal usage:
- Repository-specific documentation
- Contributing guidelines
- Development setup
Structure:
.wiki/
Home.md
Architecture/
Decision-Records.md
System-Overview.md
Operations/
Deployment.md
Rollback.md
Backstage: Developer Portal#
For enterprise scale:
- Service catalog
- API documentation
- Tech radar
- Cost tracking
catalog-info.yaml:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payment-service
description: Handles payment processing
annotations:
docs: https://docs.internal/payment
pagerduty: PD123
spec:
type: service
owner: platform-team
lifecycle: production
Tool Selection Matrix#
Use Case | First Choice | Alternative | Avoid |
---|---|---|---|
Engineering RFCs | GitHub + MkDocs | GitBook | SharePoint |
Product Documentation | Notion | Confluence | Word Docs |
API Docs | GitBook | Backstage | Wiki |
Runbooks | MkDocs | Confluence | OneNote |
Knowledge Base | Obsidian | Notion | Folders |
Service Catalog | Backstage | Custom | Excel |
Migration Strategy#
From Confluence to MkDocs:
# 1. Export Confluence space
confluence-export --space ENG --format markdown
# 2. Transform to MkDocs structure
python transform_confluence.py --input export/ --output docs/
# 3. Setup redirects for old URLs
# mkdocs.yml
plugins:
- redirects:
redirect_maps:
'old-page.md': 'new-structure/page.md'
Hybrid Approach (Most Common in Reality)#
Most organizations use multiple tools:
documentation_stack:
decisions:
tool: GitHub + ADR-tools
reason: "Version control and code review"
product_specs:
tool: Notion
reason: "Easy for PMs, rich formats"
runbooks:
tool: Confluence
reason: "On-call engineers are familiar"
api_docs:
tool: GitBook
reason: "Auto-sync with OpenAPI specs"
knowledge_base:
tool: Obsidian
reason: "Connected knowledge graph"
Something I've learned: It's really helpful to be clear about where different types of documentation live. When someone asks "Where's the RFC?" there should ideally be one obvious answer, not a treasure hunt across multiple systems.
An Implementation Approach That's Worked for Me#
Phase 1: Foundation (Months 1-2)#
Week 1-2: Infrastructure Setup
- Deploy MkDocs with search
- Create RFC/ADR templates
- Set up automated validation pipeline
- Establish document approval workflow
Week 3-4: Champion Training
- Select documentation champions
- Train on templates and processes
- Set up regular review cadence
- Create feedback mechanisms
Week 5-8: Pilot Team
- Choose 1-2 teams for pilot
- Migrate critical knowledge
- Run first RFC reviews
- Gather feedback and iterate
Phase 2: Adoption (Months 3-6)#
Month 3: Mandate and Standards
- Require RFCs for architectural changes
- No new services without documentation
- Weekly RFC review meetings
- Documentation review in code review
Month 4-5: Knowledge Migration
- Audit existing critical knowledge
- Prioritize based on risk and impact
- Systematic migration to new format
- Retire old documentation systems
Month 6: Culture Integration
- Documentation goals in performance reviews
- Recognition for good documentation
- Documentation debt in planning
- Cross-team RFC participation
Phase 3: Optimization (Months 6-12)#
Month 7-9: Automation
- Auto-generate system documentation
- Intelligent document recommendations
- Broken link detection and fixing
- Search analytics and improvement
Month 10-12: Scaling
- Roll out to entire engineering organization
- Advanced analytics and metrics
- Integration with other systems (Slack, JIRA, etc.)
- Continuous improvement processes
Documentation Principles I've Come to Value#
Through various experiences (some more painful than others), I've settled on a few principles that seem to guide good documentation decisions:
1. Documentation as Time Investment, Not Time Cost#
I've found that time spent on solid documentation tends to pay back in multiples. When someone writes a clear ADR, it often prevents the team from having the same architectural debate multiple times over the following months.
2. Consistency Usually Trumps Creativity#
I've learned that consistent templates and processes tend to scale better than letting everyone find their own approach. When documents follow similar patterns, it's much easier for people to find information across different teams and projects.
3. Context Often Matters More Than Implementation Details#
Code shows you what's happening, comments explain how, but decision documents capture why. I've found that the "why" is usually what survives refactoring, migrations, and rewrites—it's the institutional memory that's hardest to reconstruct later.
4. Updated Documents Beat Perfect Documents#
I'd rather have a decent document that gets updated regularly than a perfect document that becomes stale. Building processes that make it easy to keep things current seems more valuable than trying to get everything right the first time.
5. Focus on Usage, Not Creation#
Instead of counting documents written, I've found it more useful to look at outcomes: how quickly new team members get productive, whether people can find answers to common questions, how often we're re-explaining the same concepts. The goal is making knowledge accessible, not just creating more content.
Your Next Steps: Start Small, Think Systems#
You don't need to overhaul everything at once. I'd suggest starting with something small but visible:
This Week:
- Pick one critical system that caused recent confusion
- Write a simple 1-page ADR explaining one architectural decision
- Share it in your team channel and ask for feedback
This Month:
- Create a basic RFC template for your team
- Set up a simple documentation site (even a GitHub wiki works)
- Establish a weekly 30-minute "documentation review" in your team meeting
This Quarter:
- Train 2-3 documentation champions
- Require RFCs for all significant changes
- Measure onboarding time and cross-team questions
- Calculate your documentation ROI
Documentation as Competitive Advantage#
What I've come to appreciate is that documentation isn't just about preserving knowledge—it's about building organizational capabilities that can grow beyond any individual contributor.
Competitors might copy features or even hire key people, but the institutional knowledge, decision context, and ability to bring new team members up to speed quickly—that's much harder to replicate.
I think of good documentation as a form of technical leverage that compounds over time. It's one of the things that can help a team evolve from a collection of individual contributors into a learning organization.
Whether documentation investment makes sense depends on your situation, but I've found the teams that invest in it tend to move faster and make better decisions over time.
If this resonates with your experience, maybe start with one small experiment and see how it goes. Your future self—and your future teammates—might appreciate the effort.
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!