Scaling Backstage

There are multiple challenges that arise when the volume of data in the Backstage grows to 1,000s and 10,000s of entities, ranging from performance to ease of use. We’ll explore these in this article as well as suggesting possible ways around them for your own Backstage deployments.

The Backstage developer portal is an excellent tool for platform teams, as well as engineers, to keep a handle on their software, maintain compliance statuses and spin up new services. Unfortunately the open source Backstage is known for its difficult set up time and overall cumbersome maintainability.

This pain is often made worse when the catalog within a Backstage instance gets large. Below are a few high-level pointers that we have come across during our journey to support bigger engineering organizations. We’ve scaled our installation to support organizations with multiple tens of thousands of entities over the last year. We’ll dig deeper into individual topics, based on interest, on further articles.

Let us know about your optimization problems or questions in either Roadie or Backstage Discord channels!

Handling the Catalog data

We have written more extensively about catalog performance and how to improve that in a separate blog post.

When developing on top of Backstage, you are always building on the foundation of solid catalog data. This makes the CatalogAPI usually the most used API on both back- and frontend of the application. It may be that the entities in the system grow large (looking at you API specs) or that there is just a large quantity of them (looking at you automatically ingested AWS resources). Therefore it is important to retrieve only the actual necessary fields that are displayed to the user and limit the amount of entities being fetched.

The default CatalogClient has the option to retrieve only relevant fields through the API. Do use it. Also make sure to use the pagination if possible and hit the correct endpoints with your catalog client. Retrieving less data is always going to be cheaper than retrieving more data. It really makes a big difference whether you want to JSON parse/stringify the biggest API docs in the world multiple times, along with the rest of the catalog data or you just want to use ineffective string manipulations for sorting purposes.

For some cases we at Roadie have needed to introduce our own endpoints and catalog queries to improve the performance in larger catalogs. These endpoints could be as simple as creating pointed subset queries directly against the database table to identify only needed entities. Or possibly only returning partial entity data with preformatted response shapes. Having the ability to do use-case specific queries for relevant data, and use the better performance usually present in the database layer makes a big difference at times.

Processors vs. Providers

In the early days of Backstage the approach to ingest entities into the catalog was by using Processors to retrieve data from third party sources. This is still a remnant within the product, and is (unfortunately) still used by some integrations. The main purpose of processors nowadays is to enhance the entities, but same caveats on their usage are still present.

CNCF maintainers of the project introduced Providers to Backstage at a later stage. These providers allow more maneuverability to schedule and modify the payloads that you are sending to the catalog. Being able to chunk the ingested entities into smaller buckets, having the ability schedule the intervals with more (or less) granularity and having better visibility to the internals of the catalog is a big benefit when tweaking the catalog ingestion to work optimally.

In many cases the problem may still remain though. The providers may have the need to emit locations or other intermittent data before it is finally stored into the system as a full entity. And in those cases the entity may need to go through the processing pipeline again.

When you encounter issues that may be related to this approach, make sure that your processors are nimble beasts and are definitely not blocking the event loop. Small milliseconds make a difference here. The system is both processing a lot of data and doing it multiple times. User experience may also suffer when processing times get large or some immediately expected entities are clogged up behind a large processing queue.

In Roadie we have taken an even more performant alternative approach for few specific entities that we natively support and ingest. We want more fine-grained control to serve our larger users better and have created an alternative, self-contained processing module to do the processing for some specific use cases.

Scaffolder

By default the Scaffolder within a Backstage project runs in the same process as the rest of the application. This is by design, but there is an escape hatch that can be used to externalize this from the codebase. Backstage is built as a modular monolith and in theory has the possibility to be spread out into multiple services.

There is a fair amount of work to achieve that but the payoff is usually there. The decision to make here is to identify the tradeoffs that your company is willing to sacrifice. Is it ok that only a single scaffolder run is manageable at one time? Does it matter if the rest of the Backstage application begins to show signs of slowness when other processes are running?

The Scaffolder, and larger Tech Insights installations, take a lot of CPU cycles from the underlying hardware which may negatively interfere with the user experience. Blocking the event loop is the big no-no in the Node.js world when it comes to performance. If you are running heavy tasks within the same process that you are using to serve your users, you may encounter bad times. If bad times appear, consider externalizing some of the chunkier pieces of your instance. These may include the Scaffolder, Tech Insights, Search indexing, Cost Insights and the catalog processing loop.

Roadie has extracted the larger, more resource hungry processes into ephemeral standalone processes to avoid eating up the event loop cycles of the main application. In our case these are running in AWS, where we host your Roadie instances, as ECS tasks or Lambda functions, depending on the use case. With the backend system fully out for the Backstage project, it should be much easier to spin out supporting services into their own processes and leave the catalog alone to do what it does best, showing entities to the users in a performant manner.

Perceived Frontend Performance

Of course, performance is relevant only to the users if they are able to feel it. This is present in standard Backstage installations on the frontend layer of the application. Does your catalog load fast? Do you have a ton of frontend plugins installed and your bundle sizes are big? Do you need to rebuild your tech docs every time you navigate to the docs page?

For the frontend resources, there are multiple well-known performance tricks that can be included in the build process and hosting solutions that you are using to serve your frontend app. In the end, the Backstage frontend is a single-page application with all the known benefits and caveats. All the data displayed will need to be retrieved from somewhere before they enter the Javascript runtime to be rendered on the screen.

In most cases getting the data we want to display means API calls. For some that is ok, like getting cheap values from fast endpoints, but for some the roundtrip to the server is not worth it. You can embed relevant information to the index.html that is either served from the backend (in newer Backstage installations) or pre-built during the deployment process. You can also use localstorage to your advantage, in fact Backstage does use this for some of its data, but not necessarily for caching purposes unfortunately.

YAMLs and scaling maintainability

The canonical and recommended approach by the CNCF open source maintainers of Backstage is to use catalog manifest files, usually called catalog-info.yaml within code repositories to store entity data for the catalog. In a large amount of cases this is the wrong approach. You maybe able to keep domain and system entities up to date easily, since they are small in numbers and change rarely. For other kinds/types of entities we have seen with multiple of our customers that maintaining and keeping those YAML files up to date in an engineering organization is difficult.

A better approach to ingest entities in many cases is to automate the process of at least initial entity information from a more robust source of truth. In the end, trusting humans to update a random file in their repository just for the sake of updating it seems unlikely to succeed 100% of the time.

In modern engineering organization there are multiple different good sources to use as the canonical starting point for your entity data. For users and groups you have your Oktas or Azure Entra Ids. For repositories you have your GitHub APIs. For components, APIs and resources you have your running instances, your exposed OpenAPI endpoints and your K8s or cloud provider APIs.

Ingesting the relevant data automatically from these allows you to trust that the relevant information is up to date and mirrors correctly the software that is actually being developed within your organization and what is running in environments.

That is in the end the purpose of a developer portal, a mapping of software that your are providing to your customers, not a mapping of software that you have at some point written into a well-formatted text file.

Rate limits

There are downsides to automating your catalog ingestion as well. Backstage relies heavily on integrations towards third party APIs and this causes some implications on how up to date it can keep the catalog information. Being so reliant on other services and wanting to be the single pane of glass to display that information means that you need to be aware of the limitations of this system.

Backstage by default is a pull-based system which contacts third parties using API tokens or other authentication information and retrieves relevant data. Usually in the form of plugins, this isn’t a massive issue since the actual concurrent user count is relatively small. Even for the bigger clients we don’t usually see high 3 digit morning rushes. The just-in-time nature of retrieving data on runtime to display from third parties on the familiar frontend thus works well.

On the other hand, Backstage also stores data internally. Data that it gathers automatically from third parties and uses to generate insights or enhance entities. These processing loops usually run at a schedule and try to slurp in as much as they can. Herein lies the problem where rate limits are introduced in the system.

Monitoring rate limits against different system is extremely important and helps you identify when you getting close to the edge to make the downstream service angry. Backstage offers a good set of monitoring primitives to expose metrics from your providers. You can for example set up open-telemetry to gather the information you need. Exposing rate limit information either by querying it on a loop periodically or directly embedding it to your fetch client implementations. The former approach gives you the ability to manually tweak your calling schedules to accommodate your integrations, the latter may give you the ability to automatically slow down the calling loop to satisfy the limits.