Split oncalls by platform, not project

In a product team that is shipping a product across many platforms, oncalls should be split by platform, and not just be a single rotation.

Oncall rotations are important – they both mean there’s always someone around to fix problems as they happen, and also usually keep things fair by ensuring each person is oncall for an equal amount of time.

When a new team ships their new product, and sets up the oncall, there’s often a tendency to try and create a single oncall rotation. The discussion goes something like this: “we only have 3 backend engineers, 2 iOS engineers, and 2 Android engineers. If we split, we’ll be oncall every 2 weeks”. That is true, but realistically it’s not as simple as that.

Imagine a serious problem happens on iOS while a backend engineer is oncall, or a backend problem happens while an iOS engineer is oncall. More often than not, what actually happens in that situation is they message a person from the relevant platform – it could be the most senior person, or someone they know best, or just the one who appears awake/active. If not, the entire team might receive a message asking who can help, and many people might be disturbed. Worst case somebody else has to be woken up – we now have two people awake in the middle of the night, rather than one. This creates a situation where anyone is in effect “oncall” at any time, rather than it being clear who should be responding to problems. I don’t think this is a fair approach to take.

Another argument is often “my platform will have a very busy oncall, and the others will be quiet”. Unfortunately that might be true as well, but it has the same problems I describe above. These tasks will still find their way back to an engineer from the platform in question, in most cases, without a clear system of who should receive tasks about their platform in a particular week. It’s possible things will look like they’re working well for a while, but a particular member of a platform might be taking the vast majority of the load, which again is an unfair situation.

To play devil’s advocate, I don’t think splitting by platform is an ideal solution if indeed the team is too small. The ideal would be that everyone has sufficient knowledge of all platforms that they can fix most problems themselves, and only pass things on to a person with more knowledge when absolutely needed. This feels more doable in smaller companies/ startups, but harder in big ones where engineers typically only have time to work on one (maybe two) platforms.

There are some other solutions. Firstly the team could be grown, to reduce how often each member is oncall. Smaller teams usually work more efficiently though. Secondly the oncall could be shared with another team, with two or more products being supported by the combination of those engineers. This can work, but it’s usually too hard for everyone to have sufficient knowledge across that much surface area. Finally an oncall could be created purely for triage. That person can check it’s a real problem, check which platform it’s on, and pass on to another hidden oncall for the particular platform. While this can work, it’s pretty unwieldy to manage, and isn’t much fun for the person only triaging.

I’ve tried all of these approaches, and I believe splitting by platform is the best answer. If nothing else it keeps things fair, and reduces how many people are woken up in the middle of the night. Now to decide whether it starts on a Monday or a Friday…