Roadie BTS: Running Backstage at Scale
By Jian Reis • February 20th, 2025
At Roadie, we know that platform teams rely on their Internal Developer Platform (IDP) to be always available, performant, and stable. When an IDP goes down, developer productivity suffers. That’s why we’ve been working behind the scenes on a series of improvements to make Roadie Backstage implementations more reliable and resilient, especially at scale. These aren’t necessarily flashy new features, but they make a huge difference in ensuring everything runs smoothly.
Our engineering team tackled a few key problem areas - memory leaks that were leading to crashes, liveness probes that were restarting services unnecessarily, and background jobs that were sometimes slowing things down. Let’s dig into what was happening, how we fixed it, and why it matters.
The Challenge: Unstable Backends, Memory Leaks, and Unnecessary Restarts
Like any complex platform, Backstage and Roadie’s multi-tenant setup requires ongoing maintenance to stay performant. Over time, we started noticing some recurring issues. Some customers were seeing intermittent backend errors. Our monitoring showed that some instances were restarting more than they should, and memory usage in certain environments was creeping up in a way that suggested something wasn’t being cleaned up properly.
Diagnosing the Issues: A Deep Dive Into Backstage Stability
The first step was figuring out exactly what was happening. We leaned heavily on our monitoring tools, using Grafana dashboards, CloudWatch logs, and Prometheus metrics to track memory usage, CPU performance, and garbage collection behavior. We also took a lot of heap dumps—essentially snapshots of what was in memory at a given time—to compare them over time and spot patterns.
Some issues were straightforward and easy to resolve quickly, but others were far more subtle. While some fixes were immediate, the more nuanced memory leaks and stability concerns required deeper investigation. These problems took time to surface and were harder to track down, requiring careful analysis across days or even weeks of data. Debugging them meant running long-term comparisons of heap dumps, identifying small but persistent changes, and testing fixes incrementally to ensure they actually solved the issue without introducing new problems.
Fixing Memory Leaks
One of the biggest culprits turned out to be a global constant that was continually appended to with the same value, causing unbounded growth in memory usage. Instead of replacing old data, the service kept adding to the same constant over time, gradually increasing memory usage until the instance would crash.
const A_CONSTANT = [];
export class CustomProcessor implements CatalogProcessor {
async preProcessEntity(
entity: Entity,
location: LocationSpec,
): Promise<Entity> {
// The following causes a memory leak
A_CONSTANT.push('blah');
}
}
Additionally, express request handlers were starting un-awaited promises, which in certain edge cases could remain unresolved, holding onto memory and preventing proper cleanup.
const makeLargeObject = async () => {
// assign large object and sleep.
}
router.get('/consume-memory', async (_, res) => {
makeLargeObject(); // unwaited promises can cause memory leaks
return res.status(200).send();
});
Backstage’s default in-memory cache behavior also contributed to the problem. Expired cache items were only removed when accessed, meaning if something wasn’t retrieved after expiring, it would just sit there, taking up space indefinitely. To fix all of this, we cleaned up how constants were stored, replaced the problematic setTimeout loops with Backstage’s built-in scheduler, made sure all promises were handled correctly, and implemented a scheduled cache cleanup process to prevent memory bloat.
Tuning Kubernetes Liveness Probes
Liveness probes are a great feature in Kubernetes, but they need to be carefully configured to reflect the actual health of a service. Previously, our probes were reporting success too early in the startup cycle, meaning that broken or incomplete processes were being marked as healthy and starting up successfully—even when they weren’t fully ready. To fix this, we separated out readiness and liveness probes, ensuring that Kubernetes only considered a service healthy once it had completed its startup process correctly.
Worker Separation for a More Reliable API
Backstage, by default, runs background jobs in the same process as the API. That means if something in the background is taking a while - say, a Tech Insights check - it could impact API performance. We saw cases where this led to slow response times or, in extreme cases, crashes. The fix here was relatively straightforward in theory: move background jobs into separate worker processes so they no longer affect the API’s stability. Now, even if something in the background is taking longer than expected, it won’t slow down the user-facing parts of the platform.
The Impact: A More Stable and Resilient Roadie
The results of these changes were immediate. One of the most noticeable improvements was the immediate end of Sunday night crashes. Previously, during the week, our continuous updates to customer Backstage instances meant that services were restarted regularly, which helped keep memory leaks from accumulating. But over the weekend, with no updates happening, memory leaks would build up unchecked, leading to crashes by Sunday night or Monday morning with some level of predictability. Now, with these fixes in place, memory usage remains stable throughout the week, and these weekend crashes have disappeared.
Backend restarts are also happening much less frequently, and overall system stability has improved across the board. API performance is more predictable, memory usage is under control, and customers are seeing fewer errors and disruptions.
For platform teams using Roadie, this means a smoother experience with fewer unexpected issues. If you’re running an IDP, you shouldn’t have to worry about whether it’s up and running—it should just work. And that’s what these improvements were all about.
Lessons Learned & What’s Next
One of the biggest takeaways from this work is just how important it is to keep a close eye on certain key metrics. If Kubernetes is restarting your pods more than usual, something’s probably wrong. If Backstage’s processing queue is growing, that’s a sign that background tasks aren’t keeping up. And if global constants aren’t being handled properly, they can lead to slow, creeping performance issues that only become obvious after weeks or months.
We also learned a few best practices along the way. Freezing global constants with Object.freeze
helps prevent unintended modifications. Handling all promise rejections properly ensures that errors don’t silently accumulate in memory. And separating background jobs from API processes is a simple but powerful way to improve overall system stability.
Looking ahead, we’re continuing to refine our observability and monitoring, making sure that stability issues can be detected even earlier. We’re also exploring ways to improve how Backstage handles large-scale processing workloads so that performance remains smooth even as usage grows.
Roadie: Making Backstage Better for Platform Teams
The goal at Roadie is to make sure platform teams get the best possible experience with Backstage. That means not just adding new features, but constantly refining and improving what’s already there. These stability fixes are part of that ongoing effort, ensuring that Backstage is as reliable and hassle-free as possible for teams who depend on it.
If you’re using Roadie today, these improvements are already live and working for you. If you’re thinking about adopting Backstage and want to ensure you’re getting a battle-tested, highly available implementation, we’d love to chat.