Back in 2018, we published a post about how we do on-call at Monzo. We've come a long way since then. And as more customers join and we offer more services, it’s increasingly important that we’re able to keep our systems running smoothly at all times.
But how do we do that in a human way? That doesn't put our engineers in uncomfortable situations they're not ready for, or risk people burning out?
In the past year we've made a few changes designed to create a supportive environment for our on-call engineers, that still help us respond to incidents effectively.
These changes have helped make the rotas so popular we’ve ended up with a waiting list for engineers who want to join! So we'll share them here for other organisations looking to design human-centred on-call processes.
The first line of defence
As before, our first line of defence is still made up of a primary and shadow on-caller. These are the folks that get paged when something with our platform breaks, one of our internal tools malfunctions or when we’re not entirely sure what the problem is. For example, our customer operations team will escalate to them if they’re seeing some uncharacteristic behavior with payments.
Primary on-callers are usually experienced engineers who are familiar with leading and coordinating incidents with varying technical expertise.
Shadow on-callers are folks who are new to being on-call (either at Monzo or otherwise) and wish to learn more by participating in incidents and observing the primary on-callers. We know on-call can feel daunting so this rota gives them an opportunity to see incidents first-hand, ask questions and gradually increase their confidence handling on-call tasks. The ultimate goal is for a shadow on-caller to get comfortable enough to graduate to the primary rota.
Every engineer can join the rota, to build their skills and share the workload
As was the case previously, our on-call rotas are staffed by paid volunteers across a variety of engineering teams (more on that later).
The last time we opened slots on our shadow rota, we had over double the number of slots we needed to fill. We’re going to write a separate blog post about how we run the shadow rota and what has made it successful.
We have two rotas, to make sure we distribute the work fairly
We have two rotas for scheduling purposes.
In-hours: 10am-6pm Monday to Friday. This rota is staffed with members of the Platform team. We’re planning to expand to other engineering teams soon. The main job of the engineers staffing this rota is to help find the right team to redirect the issue in case they’re not already involved.
Out-of-hours: everything else (6pm-10am Monday to Friday and weekends)
Both the primary and shadow rota are staffed with 8 volunteer on-callers each and must acknowledge a page message within 15 minutes. Based on the data from Pager Duty, all of the pages are acknowledged in under 5 minutes!
On-call can be demanding, so we pay engineers when they're on-call
We believe it’s important to not only recognise, but also compensate people for the disruption being on-call outside of work hours can cause. As a result, all out-of-hours on-call at Monzo is paid.
Both primary and shadow on-call engineers get £500 per week. We don’t pay per incident as we don’t want to incentivise the wrong metrics.
Introducing a second line of defence: the specialists
As we've grown very rapidly, so have our systems to serve our customers. It’s been challenging for our primary on-callers to have context on every single aspect of our system, so we’ve introduced a second layer. This is where the specialist on-callers come in.
Our specialist on-callers serve as the second line of defence and bring much-needed domain expertise to incidents. This in turn helps lighten the load on our primary on-callers, who can instead focus on leading and coordinating our response.
Every engineer in their team can join the rota, to build their skills and share the workload
Our specialist rota is also staffed by volunteers, but we approach scheduling in a different way to the primary on-call rota. We have a rota for each specialist area, for example payments.
Since the goal is to provide domain expertise, the rota is staffed by members of a particular team. The payments specialist rota is staffed by the members of the payments team.
They run all week from 6pm on a Friday to 6pm the next Friday.
Engineers on specialist rotas agree to acknowledge a page within 1 hour.
On-call can be demanding, so we pay engineers when they're on-call
We pay everyone on the specialist rota £300 per week as the probability of getting paged is much lower than being on the primary rota.
Bringing in the specialists
The primary on-caller can bring in a specialist whenever they need support. For example, an alert about a database node being down is fired and the primary and shadow on-callers are paged. They acknowledge the page and start working to mitigate the problem with the help of runbooks.
If, after a little while, they feel out of their depth, they escalate to the Cassandra specialist (our main storage backend). All of this is handled by tooling we built in-house and integrated with Slack. The specialist then jumps in and works with them to mitigate the impact.
In such cases, the primary and shadow on-callers are typically responsible for leading the incident and coordinating overall response: making sure the relevant parts of the business and stakeholders are kept in the loop, the different bits of the investigation are not being duplicated across engineers, etc. This let’s the specialist on-caller focus completely on mitigation and technical investigation.
The specialists can also page another team if, for example, the incident impacts multiple different systems each requiring their own domain knowledge.
This tool, called Response, is a Django app and we’ve open sourced the code so you can try it out too.
Routing domain specific pages directly to the specialist
It’s very important for us to make sure that our engineers are well-rested and on-call doesn’t have a long lasting detrimental impact on their lives. Part of that is making sure the first line on-callers are only paged when we need them. In some cases when we can identify a domain-specific issue, we can route pages straight to specialists.
For example, if a cron job related to ordering of debit cards fails at 3am, we page the engineer on-call for the payments team directly (this is configured in the alert directly) rather than waking up the primary and shadow on-callers first and having them escalate to the payments specialist.
Currently, we have this in place for alerts for which it is straightforward to define team ownership. This isn’t true for all alerts within our system as of now and a lot of incidents still wake up the first line on-callers who then escalate to the relevant team after doing an initial investigation. But we expect to keep improving this in time.
Over time we expect the responsibilities between the first line and specialist on-callers to reverse: almost everything should route directly to the team who owns the system and/or are best placed to handle the issue, and the first line rotation becomes a safety net for incidents and alerts that we can't easily determine a team for.
This is our north star that we’re actively working towards. But it’ll take us some more time to get there because:
People answering the page need to learn how to deal with incidents. First line on-callers know how to do this, but specialists are currently there for domain knowledge only. We need to make sure they’re well-trained to lead incidents and communicate effectively with different parts of the business.
We need to train folks in our customer operations teams how to identify the right team to page.
Our systems aren't yet mature enough to be able to isolate the specific team when an alert fires. For example, a failing microservice might cause an alert to first several levels higher in the call graph.
Benefits of the specialist rota
The specialist rota has helped to make sure the people on the primary and shadow rotas don’t feel overwhelmed with the amount of knowledge and context they need to have to be an effective on-call engineer.
We don’t limit the number of rotas people can be part of. You can be part of either primary/shadow as well as your team’s specialist rota. This lets people from other teams who aren’t part of the primary or shadow rotas to get a sense of on-call with much more relaxed constraints and helps break knowledge silos within the team.
Human centred on-call
All of these initiatives are designed to create a supportive environment for our on-call engineers. To avoid people burning out or putting them in uncomfortable situations before they’re ready.
As well as providing many routes for engineers to tell us how they’re finding the experience of being on call, we encourage people to take time in lieu if they’ve been paged out-of-hours or over the weekend. More often than not, other people volunteer to take over the pager when they realise someone’s had a particularly rough week on-call. These small things make all the difference sometimes.
We’ve learnt a lot while going through this process, we continue to iterate, and will write more posts to share as much as we can with the wider community. We’d love to hear how you approach on-call, and if you try out the Response app, we’d love to hear how you get on!