How to Define Engineering Standards
By Sam Nixon • September 13th, 2024At Roadie we don’t claim to be experts in writing software standards - we’re done it ourselves, both for Roadie and before as part of other startups, scaleups and large companies, but we don’t claim to be world-beaters at it. What we can claim is that we’ve seen many, many companies go through the journey to create standards and then apply them and we have seen what works.
Why standardise?
Defining and adopting engineering standards is essential for organizations as they scale.
Teams involved in Platform initiatives normally find themselves as the first intrepid explorers in this territory for larger organizations.
Building tooling for multiple different teams and departments requires consistency and a coherent set of practices. Only then can teams coordinate, share and build scalable, maintainable, and secure software together.
Without standardization, answering basic questions becomes impossible and progress is painfully slow, a problem that often hits home at times of peak stress. How can you know which teams operate publicly facing software that has critical vulnerabilities if only half of the teams are using a dependency scanning tool? How can you roll out a new security requirement when engineering teams are each using one of 5 different security tools?
Defining Engineering Standards
Engineering standards are formal guidelines that outline how code should be written, how systems should be designed, and how processes should be executed. These standards ensure that all engineers are working toward the same quality benchmarks.
That means you need consensus between teams about what exactly should be in those standards. In order to do that, even before you decide what the standards are that you’d like to focus on, it’s good to have a plan for how you can agree together on what they should be.
Strategies for Agreeing on Engineering Standards
Getting a team to agree on a set of engineering standards can be challenging but is crucial for their success. Here are some strategies to help facilitate agreement:
- Start with your own SDLC - if you have one - and/or Industry Best Practices:
- Software Development Lifecycle and Production Readiness documents often effectively contain a lot of standards recommendations. They’re extremely useful as an input into a formal set of software standards and the two should be synced closely together.
- Use industry standards like OWASP for security or WCAG for accessibility as a baseline. This helps reduce subjective debate by relying on well-known benchmarks. For example, it’s hard to argue that secure logging isn’t important when it shows up prominently in the OWASP table.
- Collaborate across teams and functions:
- Involve engineers, product managers, and operations teams in defining the standards.
- Run workshops where everyone can voice their opinions, then converge on a decision.
- Appoint a champion:
- Usually this is a member of engineering leadership who is responsible for driving this process and the eventual rollout.
- Occasionally this can be a group though, like an Architecture guild.
- Gradually implement:
- Start small, either:
- A minimal set of standards and build from there as the team gets more comfortable. For example, you can start by enforcing code formatting standards and then gradually add performance or security checks.
- A full set of standards with only a few initial checks so that the team can get comfortable with the whole suite of standards
- These strategies allow teams to give feedback and for the standards to evolve before they are fully enforced
- Start small, either:
- Regularly review and update:
- An engineering standards document should never stand still.
- Once the standards are set, encoded them into your systems.
- Hold regular reviews (e.g., fortnightly or monthly) to review progress of each team
- Create a regular cadence of review for the standards themselves, based on team feedback and new technology trends and recommit. Once a year is often enough here.
- As part of these reviews, use data to demonstrate the value of each standard (e.g., reduced production errors, improved system uptime).
Common Standards
To make this concept concrete, let’s consider some common standards teams might define (nb: this is by definition a non-exhaustive list):
- Logging and Monitoring: Log levels, message formats, error tracking, use of a centralized tool, use of alerts.
- Security: Authentication, authorization, encryption, secure coding practices, and dependency vulnerability management.
- Performance: Response times, load management, and scalability.
- Reliability: Redundancy strategies, failure handling, backup and recovery.
- Code Quality: Style guides, review processes, formatting rules, and readability standards.
- Documentation: API documentation, code comments, and README files.
- Testing: Code coverage, test automation, and test environment standards.
- Version Control: Branching strategies, commit message guidelines, and pull request processes.
- Deployment: Continuous Integration/Continuous Deployment (CI/CD) pipelines, rollback procedures, and environment configurations.
- Accessibility: WCAG 2.2 guidelines, color contrast, text-to-speech, keyboard navigation
You don’t want to over-elaborate at this point. It is important to end up with ~8-10 different areas to focus on.
Nice vocabulary to use when setting standards
Must
. Used to define mandatory items. i.e.A service must use a logger
Should
. Used to define items which are reasonably expected to exist. If a services choses not to adopt this standard, the expectation is that they justify why not.May /
Could / Will
. Used to define items which are more aspirational or for services that are consider mature.
An example: AcmeCorp.com
Let’s imaginee AcmeCorp.com are a well-known platform selling books, clothes, food, laptops, paddling pools and power tools around the world. They’re an anything store, if you will.
Availability and reliability are key to their business, so they spend a lot of time thinking about how to measure and improve that for the software they build.
Previously, teams would simply assert that their service was reliable, performance, secure etc, but aside from anecdotally looking at the past weeks/months/years to validate that assertion, it was hard to prove or disprove. It was also hard to compare across services.
To help that, a cross-functional group at AcmeCorp.com have agreed a series of standards that they believe will ensure their service stand up to considerable load during peak periods, and that if outages or incidents do happen that they’ll recover quickly.
Area | Standard |
---|---|
Monitoring | Health checks for critical components must be defined and an ideal state determined. |
Service state must be constantly observed and recorded and dashboards should be created to show this data. | |
Monitoring should have metrics that describe how effective a service is. These metrics are available and easily viewable on a dashboard. | |
Events should be exported and/or sampled and collected in addition to other metrics | |
Availability | Service availability must be determined programmatically. |
Expected and unexpected behavior for a given service must be defined in tests and alerts. | |
Basic SLIs should be defined and used to calculate SLO targets. This should include the number of number of good events / total number of events are being recorded | |
SLOs should be actively measured, calculated, and displayed in a dashboard | |
Error budgets may established and a policy outlining what to do when service runs out of budget is established. | |
SLOs (and error budget policy where appropriate) should be documented in a prominent location where teams and stakeholders can easily review. | |
Logging | Logs must show startup, shutdown, and errors. |
Logs must have have a rotation and retention policy defined. | |
Logs from all hosts must be sent to a centralized location. | |
Logging pipeline must be resilient to transient failures and should be fully recoverable when ingestion returns to a healthy state. | |
Alerting | Basic health checks must be attached to alerts when failing. |
A dashboard must display all alerts currently firing. | |
The body of any alert must contain information that is be needed to diagnose and fix the problem. | |
An official on-call rotation for high-priority alerts must be configured and activated. | |
High-priority alerts should tuned such that they don’t fire outside of business hours unless necessary. If resolution of an issue can wait until business hours, it should not page the on- call engineer. | |
High-priority alerts should be triggered only for urgent, actionable issues that require a human’s intervention. | |
Scalability | Operating manuals for service scaling must be up to date and consumable by newly onboarded or tangentially-familiar engineers. |
Service must handle unexpected increases in load without manual effort, up to a known threshold. | |
Unexpected increases and decreases in load must be handled automatically. | |
Unexpected increases in load above a known threshold may be handled automatically. | |
Owners of a service may run regular scaling exercises to test scaling assumptions. | |
Service may be able to deprioritize features and load when needed. | |
Resiliency and Recovery | Run books must exist that outlines steps of recovering from loss of capacity. |
Owners should have conducted testing on outages to validate recovery run books and quantify performance degradation. | |
Owners should demonstrate manual recovery is possible with minimal performance degradation (within established threshold) | |
Owners may demonstrate automatically recovery is possible with minimal performance degradation (within established threshold) |
Breaking Standards Down Into Scorecards and Checks
Once the standards are defined, they need to be actionable, measurable, and concisely group so that teams can understand them. This is where checks and scorecards come into play.
Scorecards
Scorecards allow teams to measure in only a few data points how well they are adhering to their engineering standards across a project or organization.
Scorecards should flow naturally from your standards and be fairly simple to define. Name them things that align with those standards and are comprehensible as a bucket of actionable checks against those standards.
For example:
Security
is a good, simple, easy to understand name. If you wanted to create levels for your scorecards to have some sense of progression, you could saySecurity - Level 1
Secure Coding Standards
might be a good option if you wanted to go to a more granular level with your scorecards.
Try and end up with ~10.
Checks
For each scorecard/standard, you need to break it down into one or more specific checks. A “check” is a verifiable condition that can be automated or manually enforced.
Just like Scorecards, Checks should be named things that are comprehensible but crucially should also be actionable.
For example:
Node version should be >18
is a clear true/false statement about what the expectation is for a given service that uses Node.js. It is also clear what needs to be done in order to pass that check.- Similarly
CODEOWNERS should be enabled
draws an even more direct line to what needs to be done for a given service to pass a check.
At Roadie we use our Tech Insights plugin to build these checks - the backend for which we also open sourced.
Whether you’re using Roadie, OSS or hand-rolling these checks, it’s important to have an idea of what a computationally enforceable check would look like for each of your standards.
Automating and visualising standards
Last but not least, you need some way to repeatedly check standards are being adhered to.
This usually comes in two forms:
- You can programatically check to see whether documentation exists. This is often in the form of a runbook at a given path, i.e.
/docs/runbooks/recovery
- A third party tool is used to capture data that can then be interrogated programmatically. For example, this can be something as simple as an SLO existing in Datadog for a given service.
Both data sources can be used to confirm at scale whether or not a given service is correctly adhering to a given standard.
Teams no longer have to assert compliance, they simply need to ensure that the evidence is currently surfaced to prove that they comply.
Many scorecarding solutions do exactly this for you, without the need for teams to individual wire up different systems. For example, Roadie customers use Tech Insights plugin to provide standardised, automated checks across their entire software catalog with minimal or no intervention from individual teams.
Returning to AcmeCorp
Using the example of AcmeCorp.com again, let’s take one of their areas are turn it into a Scorecard with a series of checks. They use Datadog for their dashboards and Sentry for their logging so they can both provide sources of truth for their checks.
Scorecard | Underlying Standard | Checks | Data source |
---|---|---|---|
Monitoring | Health checks for critical components must be defined and an ideal state determined. | Service has >1 health checks defined | Repo file that contains healthcheck test results |
Health checks for critical components must be defined and an ideal state determined. | Healthcheck test results return current status codes | Repo file that contains healthcheck test results | |
Service state must be constantly observed and recorded and dashboards should be created to show this data. | Service has a Datadog dashboard to record service health | Datadog | |
Monitoring should have metrics that describe how effective a service is. These metrics are available and easily viewable on a dashboard. | >1 service metric is defined in Datadog | Datadog | |
Monitoring should have metrics that describe how effective a service is. These metrics are available and easily viewable on a dashboard. | Datadog metric monitor is configured | Datadog | |
Events should be exported and/or sampled and collected in addition to other metrics | >1 event has been sent to Sentry in the last day | Sentry |
Making a fix easy to implement
The final stage of implementing engineering standards is to make adherence as simple and easy as possible. If a check or scorecard failure is hard to achieve then teams will take longer to resolve it, the rate at which they resolve it will be lower, and standards will ultimately suffer.
Ask yourself:
- How many steps does it take to bring a service into compliance for a given standard? How can that be reduced?
- How can common or shared systems be leverage to help multiple teams into compliance? Using the AcmeCorp example above, a templated Dashboard for Datadog could help all teams skip design and production steps when setting up a Monitoring dashboard.
- How can the overall cycle time from error to resolution be reduced? Can common fixes be added to shared repositories or How To guides be written to help teams?
Where possible, implement quick fix options
At Roadie we use the Backstage scaffolder to automate many of the fixes for our scorecards. To take a simple example, one of our engineering standards is that branch protection must be enabled on any repository we create. If a service is linked to a repository without branch protection it fails that check. To resolve it, we have a 3-second Scaffolder template that can modify the GitHub settings associated with the repository. The only thing the team needs to do is look at the check and click a button.
Conclusion
Defining engineering standards is critical for ensuring that software systems are built and maintained with quality, security, and scalability in mind. Breaking these standards down into checks and scorecards allows teams to monitor compliance and ensure continuous improvement. By following a collaborative approach to defining standards and making them measurable, teams can streamline their development processes and produce more reliable software.
Whatever you are focusing on, the key to success lies in making standards actionable, measurable, and adaptable.