Splitting TechDocs Out of Our Monolithic Backstage Deployment

At Roadie, we operate Backstage at a significant scale. Each customer receives a fully isolated, single-tenant Backstage deployment running in its own Kubernetes namespace. This architecture gives customers strong security boundaries, predictable isolation, and the freedom to customize their instance without affecting others.

But, this model also introduces some operational complexity. Architectural decisions that are harmless at a small scale can cause issues when every tenant is running their own Backstage stack. One such decision was how we deployed TechDocs as part of the same backend service as everything else.

In this article, we'll talk about why our original approach stopped scaling, how we redesigned it, and what improved when we split TechDocs out of the monolithic backend.

Our Original Architecture

In our original setup, each Roadie tenant ran a complete Backstage application composed of all frontend and backend plugins bundled together. On the backend side, everything was executed within a single Node.js process.

This meant that for each customer, a dedicated Kubernetes namespace was created to ensure isolation, a single Backstage backend pod was deployed into that namespace, and all backend plugins like Catalog, Scaffolder, TechDocs, and Auth were loaded into the same backend service.

From an architectural standpoint, this resulted in a classic monolith. Every backend plugin shared the same runtime, memory space, CPU limits, and lifecycle. This design was simple to operate and reason about early on, and it served us well for a long time. But, as customer usage patterns evolved, issues began to appear.

Original Monolithic Backstage Deployment Architecture

The Problem: Resource Contention

As more customers adopted TechDocs and began publishing larger documentation sites, we started receiving alerts that were difficult to explain at first glance.

These incidents typically involved brief periods during which the Backstage backend became unavailable, CPU usage exceeded configured limits, and Kubernetes restarted backend pods from resource exhaustion. What made this particularly challenging was that the failures were intermittent and tenant-specific. Many tenants were completely unaffected, while a handful experienced repeated disruptions, which made it harder to pinpoint the issue.

After analyzing metrics, logs, and pod-level behavior, a pattern emerged. In every affected case, the TechDocs backend plugin was consuming a disproportionate amount of CPU and memory.

Why TechDocs Was the Culprit

We noticed that issues only occurred for tenants with particularly heavy TechDocs usage. This included large documentation sites with many pages and assets, repositories containing multiple documentation sets, and frequent rebuilds triggered by ongoing documentation updates.

This behavior is expected when you look at what TechDocs does under the hood. The backend is responsible for fetching documentation files, rendering Markdown content, and running documentation generators like MkDocs. These tasks are inherently resource-intensive, especially during large or frequent builds.

When TechDocs runs in the same process as the rest of the Backstage backend, resource spikes are not contained. CPU saturation or memory pressure caused by documentation builds directly impacts unrelated functionality, including catalog ingestion, scaffolder workflows, and authentication. The monolithic design of TechDocs became a liability.

The Decision: Split TechDocs Out

To restore stability and regain operational control, we decided to extract TechDocs from the monolithic backend and deploy it as a separate Backstage backend application. The main reason for this was isolation. We wanted TechDocs to operate independently so that its workload characteristics would not interfere with the rest of the system. At the same time, we wanted to avoid breaking existing APIs or introducing fragile custom integrations.

Our TechDocs requirements were simple:

It needed to be independently deployable so it could evolve on its own schedule.
It needed to be discoverable by the core backend without hardcoded configuration.
It needed to scale independently, based on documentation workload rather than overall backend traffic.

The New Architecture

In the new design, each tenant runs two distinct Backstage backend services instead of one.

The first is the core Backstage backend. This service is responsible for handling catalog ingestion, Tech Insights, authentication, and other core APIs.

The second is a dedicated TechDocs backend. This service runs only the TechDocs-related functionality and handles documentation builds and rendering.

The two services communicate using Backstage’s built-in discovery mechanism. Improved Backstage Deployment After Splitting Out TechDocs

Results

Cleaner Cluster Allocations

By running TechDocs in its own pod, we gained fine-grained control over its resource profile. CPU and memory limits are now explicitly tuned for documentation workloads, and scaling rules can be applied only where documentation usage justifies it.

This prevents overprovisioning the core backend while still allowing TechDocs to scale aggressively when needed.

Improved Stability

Isolating TechDocs eliminated an entire class of failures. Documentation builds no longer put pressure on unrelated backend functionality. Catalog ingestion, scaffolder executions, authentication flows, and core API availability remain stable even during peak documentation activity.

For customers, this translates directly into fewer outages and a more predictable Backstage experience.

Easier Debugging and Operations

From an operational perspective, separating TechDocs clarified boundaries. Resource spikes are now immediately attributable to the correct service. Logs are easier to interpret, and incidents can be diagnosed and mitigated quickly.

This separation also simplifies future tuning and capacity planning. With TechDocs isolated, we can reason about its resource usage independently from the rest of the backend and make decisions based on real workload characteristics rather than worst-case assumptions. CPU and memory requests can be adjusted specifically for documentation builds, and autoscaling policies can be tuned around build frequency, repository size, and peak documentation activity. This also makes forecasting easier: growth in documentation usage no longer forces us to overprovision the core backend. Instead, we can scale and optimize each service independently, reducing wasted capacity while maintaining predictable performance.

What We Gained Overall

Splitting TechDocs out of the monolith forced us to formalize a clear boundary between core platform responsibilities and workload-specific plugins, which in turn improved how we think about backend composition overall. What started as a targeted stability fix turned out to have broader implications for how we structure the backend.

As a result, we now have a repeatable pattern for extracting heavy backend plugins into independent services when their resource profiles or failure modes warrant it. Backend deployments are slimmer, responsibilities are better defined, and each service can be sized and scaled according to the work it actually performs rather than the worst-case behavior of a single plugin. This makes the system easier to reason about both during normal operation and when something goes wrong.

Takeaways

TechDocs was the most obvious candidate for separation due to its workload profile, but this architecture opens the door to further modularization where it makes sense. Backstage provides the primitives needed to support this kind of design. At scale, using them becomes less of an optimization and more of a requirement.

If you're running Backstage in a multi-tenant or high-scale environment and are seeing similar symptoms, it's worth examining your heaviest backend plugins and questioning whether they belong in the same process as the rest of Backstage.