Building a Scalable GitHub Actions Platform for a Large-Scale Microservices Architecture
A practical guide to building an org-level shared GitHub Actions platform covering architecture decisions, security governance, adoption strategy, and the 7 biggest mistakes we made along the way.
Abstract
When CI/CD pipelines grow organically across dozens of repositories, you end up with duplicated YAML, inconsistent security practices, and a constant stream of support requests. This post documents how we built an organization-level shared GitHub Actions platform for a large e-commerce platform running around 20 microservices across multiple teams. We cover the architecture decisions, security governance model, adoption strategy, and the concrete metrics that resulted: build times dropping from ~45 minutes to ~12 minutes, a 70% reduction in CI-related support tickets, and 85%+ adoption within six months. We also share the 7 biggest mistakes we made, because those taught us more than the things that went right.
Introduction
GitHub Actions is deceptively simple to get started with. Copy a YAML file, add a few steps, and you have a working pipeline. The problem is that this simplicity does not scale. Once you have dozens of repositories for different microservices, each maintained by different teams with their own flavor of build, test, and deploy workflows, you end up with a maintenance burden that quietly consumes engineering capacity.
We found ourselves in exactly this situation: hundreds of workflow files with subtle differences, inconsistent security practices, build times that varied wildly, and a growing backlog of CI-related support tickets. The platform engineering team was spending more time answering "how do I do X in GitHub Actions?" questions than building platform capabilities.
This post documents how we designed, built, and rolled out an org-level shared actions platform. The goal is not to prescribe a single correct approach but to share what worked, what did not, and the trade-offs behind every decision.
Why We Needed an Org-Level Shared Actions Platform
The symptoms were clear before we even started measuring:
- Duplication everywhere: The average workflow file was ~500 lines of YAML, with roughly 80% of it identical across repositories. Teams copy-pasted from each other and diverged over time.
- Inconsistent security posture: Some repos pinned action versions by SHA, others used
@latest. Some configured OIDC for AWS, others still used long-lived access keys stored as secrets. - Slow builds: Average build time was around 45 minutes. Teams had added steps over time without considering caching, parallelism, or runner selection.
- Support burden: The platform team received roughly 30 CI-related tickets per week, mostly about configuration, debugging failures, and "works on my machine" issues.
- Onboarding friction: New projects took days to set up CI/CD because there was no standard template, and the tribal knowledge lived in Slack threads.
We needed a platform that would give teams a "golden path" for CI/CD while preserving the flexibility to customize when necessary.
Note: A "golden path" is a well-supported, opinionated default. Teams can deviate, but the supported path should cover 80%+ of use cases with minimal configuration.
Architecture Decisions & Trade-offs
Every architecture decision involved trade-offs. Here is how we evaluated the major ones.
Composite Actions vs. Reusable Workflows vs. Workflow Templates
This was the first and most consequential decision. GitHub Actions offers three mechanisms for sharing CI/CD logic, and they serve different purposes:
Our decision: We use all three, each for its purpose:
- Composite actions for reusable building blocks (setup Node.js with caching, run linting, build Docker images)
- Reusable workflows for standardized pipelines (build-test-deploy for a Node.js service, deploy-to-ECS)
- Workflow templates for bootstrapping new repositories with a sensible starting configuration
The key insight: composite actions compose well. We build reusable workflows from composite actions, so the workflow itself is thin orchestration logic while the actions contain the implementation.
Monorepo vs. Multi-Repo for Shared Actions
Our decision: Monorepo. The discoverability and cross-cutting change benefits outweigh the blast radius concern, especially when combined with strict branch protection and automated testing. We mitigate the blast radius by releasing individual actions with independent semver tags.
Repository Structure
Versioning Strategy
Versioning is where security and developer experience collide. We use a layered approach:
Our policy:
- External third-party actions: SHA pinning required. No exceptions. Dependabot handles update PRs.
- Our own shared actions: Semver tags for production, major tags for development environments.
- Never
@main: Even for internal actions, referencing a branch directly is not permitted in production workflows.
Warning: Using
@mainor@latestfor third-party actions is a supply chain attack vector. A compromised upstream repository can inject malicious code into every workflow that references it. Always pin by SHA for external actions.
Self-Hosted vs. GitHub-Hosted Runners
Our decision: Hybrid. GitHub-hosted larger runners for most workloads, self-hosted runners in our VPC for jobs that need private network access (integration tests against staging databases, deployments to private ECS clusters). We use ephemeral self-hosted runners on ECS Fargate to avoid the stale-environment problem.
Implementation Deep Dive
Composite Action Example: Node.js Setup with Caching
This action replaces roughly 30 lines of duplicated YAML across repositories with a single step:
Reusable Workflow: Node.js Service Pipeline
This is the "golden path" workflow for Node.js services. It composes multiple shared actions and reduces per-repo pipeline YAML from ~500 lines to ~50:
What a Consumer Repository Looks Like
This is the entire CI/CD configuration for a typical Node.js service. Compare this to the 500-line files we started with:
That is roughly 20 lines of YAML. The team gets build caching, security scanning, OIDC-based AWS authentication, and a standardized deploy process without configuring any of it.
Automated Release Pipeline
We use a release workflow in the shared-actions monorepo that creates semver tags for individual actions when changes are merged to main:
Security & Governance Layer
Security at scale is not optional. We enforce it through multiple layers so that individual teams do not need to think about it.
OIDC for AWS Authentication
Long-lived AWS credentials stored as GitHub secrets are a liability. We replaced all of them with OIDC federation, scoped to specific repositories and environments:
The sub claim condition is critical. It restricts which repositories and environments can assume the role. A repository named our-org/random-fork cannot assume production roles, even if it somehow obtained the workflow configuration.
Supply Chain Security
We implemented a multi-layer supply chain security strategy:
Key configurations:
- StepSecurity Harden Runner: Monitors outbound network calls during workflow execution. If a compromised action tries to exfiltrate secrets to an unknown endpoint, it gets flagged.
- Dependabot for actions: Automatically creates PRs when pinned action SHAs have newer versions. This keeps us up to date without sacrificing security.
- OpenSSF Scorecard: Runs weekly on the shared-actions repo to surface security weaknesses in our own practices.
CODEOWNERS and Branch Protection
The shared-actions repository has strict governance:
Branch protection rules:
- Require 2 approving reviews (at least 1 from platform team)
- Require status checks to pass (all action tests must succeed)
- Require signed commits
- No force pushes, no deletion of
main - Dismiss stale reviews on new pushes
Minimal Token Permissions
Every workflow starts with the most restrictive permissions and explicitly opts in to what it needs:
Tip: Set
permissions: {}at the workflow level and then grant only what each job needs. This follows the principle of least privilege and makes the security posture auditable at a glance.
Adoption & Measurement Strategy
Building a platform is the easy part. Getting multiple teams across a large microservices organization to use it is the real challenge.
Inner Source Contribution Model
We explicitly chose an inner source model over a top-down mandate. The platform team maintains core actions, but any engineer can contribute:
The contribution process:
- RFC issue: Describe the problem and proposed action. Platform team provides feedback on scope, naming, and existing overlap.
- Implementation: Contributor opens a PR with the action, tests, and documentation.
- Review: Platform team reviews for consistency, security, and composability. CODEOWNERS ensures the right people review.
- Release: Merged PRs trigger automated releases with proper semver tags.
- Announcement: New actions are announced in the engineering Slack channel with a usage example.
This model was critical for adoption. When the mobile team contributed a mobile-build action, their peers adopted it far more readily than if the platform team had built it.
Migration Playbook
We created a structured migration guide. The key was not forcing teams to migrate everything at once:
- Phase 1: Replace credential management with OIDC (security win, no workflow changes needed)
- Phase 2: Adopt
setup-nodeorsetup-pythoncomposite actions (easy swap, immediate caching benefits) - Phase 3: Move to reusable workflows for standard service pipelines
- Phase 4: Adopt repository rulesets for security scanning
Each phase was independently valuable, which meant teams could migrate incrementally.
DORA Metrics Dashboard
We track the core DORA metrics plus platform-specific KPIs:
Note: These improvements did not come solely from the shared actions platform. Caching, runner optimization, and parallelism contributed significantly. The platform made it easy to adopt all these optimizations consistently.
Lessons Learned & The 7 Biggest Mistakes
These are the mistakes that cost us the most time. Each one is something we would do differently if starting over.
Mistake 1: Building Too Much Before Getting Feedback
We spent weeks building a comprehensive set of shared actions before any team used them. When we finally shipped, the abstractions did not match how teams actually structured their projects. We had to rewrite several actions after real usage revealed incorrect assumptions.
What works instead: Ship the smallest useful action first. We should have started with setup-node alone, gotten 5 teams using it, and then expanded.
Mistake 2: Overly Abstract Reusable Workflows
Our first reusable workflows tried to handle every possible configuration through inputs. The node-service.yml workflow had 23 inputs. Teams found it harder to understand than writing their own YAML.
What works instead: Fewer inputs, more opinionated defaults. Our current workflows have 4-6 inputs. If a team needs significantly different behavior, they compose from our actions rather than parameterizing the workflow.
Mistake 3: Ignoring Workflow Debugging Experience
When a reusable workflow fails, the error appears in the calling workflow's logs, but the actual steps are in the reusable workflow's definition. This confused teams during debugging, especially when they could not see the intermediate steps clearly.
What works instead: Add verbose logging to composite actions with clear step names. Use ::group:: and ::endgroup:: log commands to create collapsible sections. Include the shared action version in the log output so debugging can identify exactly which version is running.
Mistake 4: No Breaking Change Policy
We shipped a v2 of setup-node that changed the caching strategy without realizing it would break repositories with non-standard node_modules locations. This caused failures across 15 repos simultaneously.
What works instead: Semantic versioning with a documented breaking change policy. Major version bumps require a migration guide and a two-week deprecation notice. We now run an automated compatibility check that tests new action versions against a sample of consumer repositories before releasing.
Mistake 5: Underestimating Runner Costs
We initially defaulted all jobs to ubuntu-latest-16core runners for speed. The GitHub Actions bill grew much faster than anticipated. Not every job benefits from larger runners; dependency installation is often network-bound, not CPU-bound.
What works instead: Default to standard runners and opt in to larger runners per-job with documented justification. We profile new actions to determine whether larger runners actually improve build times before recommending them.
Mistake 6: Making Security Annoying Instead of Invisible
Our first security scanning implementation added 8 minutes to every pipeline and produced noisy reports with false positives. Teams started adding if: false conditions to skip the security steps, which defeated the entire purpose.
What works instead: Security scanning should be fast and have low false-positive rates. We moved to incremental scanning (only scan changed files on PRs, full scan on main), tuned the rulesets to eliminate persistent false positives, and got scanning time under 90 seconds. Adoption went from ~40% to 95% once it stopped being a bottleneck.
Mistake 7: No Deprecation Path for Old Patterns
When we released the shared platform, we did not have a plan for removing the old workflow files from repositories. Some repos ran both old and new pipelines for months, wasting compute and creating confusion about which results to trust.
What works instead: Create a migration CLI tool that can detect old patterns, generate migration PRs, and track migration progress across the organization. We built a simple script that opens automated PRs to remove deprecated workflow files once the new pipeline is confirmed working.
Results, Metrics & Future Roadmap
Quantified Outcomes
After six months of incremental rollout:
- 85% adoption rate: 34 of 40 repositories migrated to shared actions. The remaining 6 have legitimate reasons for custom pipelines (specialized hardware, non-standard build systems).
- Build time reduction: Average dropped from ~45 minutes to ~12 minutes, primarily through standardized caching, parallelized test execution, and right-sized runners.
- 70% reduction in CI support tickets: From ~30 to ~9 per week. The remaining tickets are mostly about genuinely novel requirements rather than "how do I configure caching."
- Pipeline YAML reduction: From ~500 lines per repository to ~50 lines. This is the metric teams feel most directly because it reduces their cognitive load.
- Security posture: 100% of active repositories use OIDC for AWS authentication. Zero long-lived AWS credentials in GitHub secrets.
Architecture Overview
Future Roadmap
We are investing in three areas:
- Dynamic pipeline generation: Instead of static YAML, generate workflow configurations based on repository metadata (language, deployment target, compliance requirements). This could further reduce per-repo configuration to near-zero.
- Ephemeral environment per PR: Using the shared deploy action to spin up a preview environment for every pull request, with automatic cleanup after merge.
- Cost attribution: Tagging GitHub Actions minutes by team, service, and workflow type to give engineering managers visibility into their CI/CD spend and help identify optimization opportunities.
Starting This Journey
For teams considering a similar effort, here is the sequence that worked for us:
- Start with one high-value action (caching or security scanning) and get 3-5 teams using it.
- Measure before and after: build times, support tickets, adoption rate. Numbers drive organizational buy-in.
- Invest in the contribution model early. If only the platform team can modify shared actions, you have created a bottleneck.
- Security should be invisible, not an obstacle. If teams work around your security controls, the controls are failing.
- Plan for deprecation from day one. Every v1 will eventually become a v2, and you need a path to get there.
The shared actions platform has been one of the highest-leverage investments our platform engineering team has made. The upfront effort was significant, but the compounding returns in developer productivity, security consistency, and operational reliability have more than justified it.
References
- GitHub Actions Reusable Workflows - Official documentation on creating and consuming reusable workflows across repositories
- GitHub Actions Composite Actions - Guide to building composite actions that bundle multiple steps
- GitHub Actions Security Hardening - Comprehensive security best practices for GitHub Actions workflows
- GitHub OIDC for Cloud Providers - Configuring OpenID Connect for keyless cloud authentication
- StepSecurity Harden Runner - Runtime security agent that monitors and controls outbound traffic from GitHub Actions
- OpenSSF Scorecard - Automated tool for assessing open source project security health
- DORA Metrics - The four key metrics for measuring software delivery performance
- GitHub Repository Rulesets - Enforcing organization-wide workflow and merge requirements via repository rulesets
- GitHub Actions Larger Runners - Documentation on configuring and using larger GitHub-hosted runners
- GitGuardian GitHub Actions Security Cheat Sheet - Comprehensive checklist for securing GitHub Actions pipelines
- GitHub Actions Workflow Syntax - Complete reference for workflow YAML syntax including permissions and concurrency
- InnerSource Commons - Patterns and practices for applying open source methodologies within organizations
- GitHub Actions Caching - Strategies for caching dependencies to reduce build times