Tuesday, January 23, 2024

Why the on-call?

Almost everyone who is a software engineer in a consumer product company goes for on-call.

At some companies, it's mandatory, at some it's voluntary. Also, some companies pay extra allowance for being on-call while some consider it part of the job.


Last month, I went on-call again after joining a new company. It has quite a comprehensive on-call onboarding process. I was shadowing primary members(the on-call person) for almost 4 weeks before graduating to a primary on-call.


I was nervous. Heck! I was scared. I work with the Platform team and the sheer number of components our team manages can be overwhelming. Although as part of my team onboarding, I went through all the documents and CodeLabs that help you get the feel of the system, nothing can prepare you for an actual on-call.


Luckily the first day was very uneventful. No alerts :). I mostly spent my time re-reading the playbooks that we have for common scenarios. 


On the second day, someone reached out to me about a query related to infrastructure provisioning. Honestly, I was clueless I had not worked on that component at my current company. The first thing that I did was search for similar queries in our Slack thread and GitHub issue. Incidentally, the same issue was faced by two engineers in the last 6 months. I spent close to 2 hours reading the past issue, PRs, and Slack threads to understand the issue in detail. 


What happened?


First, I learned about a completely new part of the system that I would have never interacted with in my daily tasks.

Second, I found out the points of contact and other resources to look for in case of such issues.

Third, it increased my confidence to explore and understand a completely new issue.


The "Good" on-call


A good on-call is where you work with the systems close enough to understand the weaknesses and areas of improvement in a system. Most of the organizations (at least the ones I have worked with) have many sub-teams within the Platform division. 

One of the major benefits of good on-call is the ability to touch cross-team systems. E.g. during an incident, a faulty HA Proxy config was causing requests to fail while reaching the Kubernetes cluster.


Usually, I care only about the traffic once it reaches the boundary of the cluster network. But I got on a call with the network team, raised a PR for the HA Proxy config, changed the Route53 entry, and updated the Istio Destination rule all within 1 hour. 

In another instance, the Kafka consumers were not able to reach the Kafka endpoint from selected pods. I spent almost 2 days working with the Data Platform team and not only fixed the issue but ended up learning so many other details about Kafka configurations, client optimization, etc.



The "Bad" on-call


This one is so obvious. 30 non-actionable alerts. People are asking you to join random calls all day. You are firefighting similar issues without fixing the underlying root cause. There is too much focus on "operational excellence" and not so much on engineering health. I have been part of bad on-call processes and by the end of the week, I was so frustrated that I used to take 1-2 days off just to calm down.



How to improve your company's on-call culture?


Pay them extra for being on-call!!! Waking up at 2 AM to respond to a PagerDuty alert is not fun. Not for me, not for anyone else. Recognising the fact that waking up at 2 AM is not "part of the job" and compensating the engineers is the first step in motivating your team to be on-call. 


The other way to improve on-call culture is to regularly filter out actionable and non-actionable alerts. When teams get started, they usually set alerts for everything. Most of it ends up becoming noise. Such alerts are called non-actionable as there is not much you can do to alleviate them. You can either reconfigure the threshold or just remove them. I mean who needs an alert if your service's request per second increases from 650k RPS to 700k RPS? Is your system not scalable enough? Maybe it's time to revisit the architecture. 


The third important way to improve the on-call culture is to allow other teams to be self-sufficient. This is more relevant to Platform teams where every service is dependent on the underlying Platform. Having playbooks for most common use cases allows product teams to self-serve the majority of the issues without involving multiple teams.

Disclaimer: All the views/opinions on this blog are personal and do not represent those of my employer(s), past or present.

No comments:

Post a Comment

Making YouTube usable again

 YouTube has evolved from a platform for educational and entertaining content into a space filled with ads, distractions, and the ever-growi...