Culture is not just a feel-good thing. Most reliability leaders actively invest in culture because—as research proves year after year—it significantly improves the overall performance of their teams. The latest DORA State of DevOps Report, a Google Cloud initiative, is titled “Culture is everything” and highlights that teams with a good culture perform 30% better than others.
Now, what does ‘culture’ mean in the reliability context? In this article, you’ll get insights into what a generative on-call culture looks like and tips on how to reinforce it in your organization.
What is On-Call Culture and Why It Matters
All your friends are going on a fun weekend trip, but you can’t go because, well, you’re on call. If you end up not getting paged, you’d feel frustrated for missing the trip. But getting an alert isn’t a very fun prospect either.
Being on-call is a delicate subject because it disrupts the personal lives of your team. It’s taking the workplace into your household, bringing it to brunch with your children, and sleeping with it next to your partner. That’s why caring about on-call culture is so important.
An on-call culture will form within your team, even if you don’t do anything about it. But what kind of culture are you working with? Will it drive results and keep your responders happy?
Westrum’s typology of organizational culture, a standard framework in industry research, recognizes three types of culture:
- Pathological (power-oriented): In this kind of culture, information is restricted or distorted before it is shared. Leaders prioritize their power over other teams. And if something goes wrong, someone will be blamed as a scapegoat.
- Bureaucratic (rule-oriented): Information and collaboration are controlled by policies, procedures, and formal structures. Blame is distributed according to roles and criteria.
- Generative (performance-oriented): Information flows freely and is available across the organization. Leaders prioritize outcomes over personal results and reflect on mistakes as a team.
Needless to say, teams with a generative culture perform significantly better than pathological or bureaucratic ones—up to 30% more, according to Google. While this sounds ideal, fostering a culture that enables teams to meet deadlines and adhere to security and compliance requirements while pursuing their creativity and feeling psychologically safe is not easy.
What can you do, then, to improve the on-call culture in your team? We’ve compiled a few tips below from our conversations with SRE leaders in our Humans of Reliability series.
{{cta-on-call}}
Key Elements of a Successful On-Call Culture
Improving the on-call culture in your team is not about pizza parties or TED talks. You’ll need to build an environment that invites people to perform at their best and communicate their challenges. That culture driver is not abstract; it’s made up of the way you choose to communicate and operate.
1. Clear Communication
Westrum’s main component for determining the type of organizational culture revolves around how information flows. Thus, cultivating a generative on-call culture starts by examining how your communication workflows are functioning at the moment.
Furthermore, communication is not only at the core of culture. As Zameel Syed, VP of SRE at Aircall, argues, communication is at the center of incident response. Thus, ensuring your tools, processes, and even vibe make it easier to share insights with people across all levels of your organization is essential for cultivating a generative culture and an effective incident response practice.
Best Practices for Clearer On-Call Communication:
- Centralize communication on a single channel: When you’re on call and get an alert, you don’t want to figure out where to report what you’re doing or request backup. Most teams choose to set up their incident response workflow in Slack (through Rootly) so there’s always a clear communication channel.
- Establish clear protocols and codify them in a platform: Incidents are high-stress issues that often involve checking a plethora of systems and parameters. If left to each person to figure out every time something pops up, errors are bound to happen. Instead, you can use an incident management tool to define runbooks per incident type, severity, and team. Tools like Rootly can guide your responder through the actions they need to perform in each case, without overwhelming them with unnecessary information that doesn’t apply to their incident.
- Automate as much as possible: Ultimately, the responder wants to resolve the incident as soon as possible so they can go back to sleep. Having your on-call person fill in forms or write summaries is not the best use of their time. Make sure to require the least amount of fields to be filled and offload summaries to GenAI so you can keep as many people in the loop without burdening your on-call team.
- Schedule frequent check-ins: Perhaps Adam is upset because he had to skip his mom’s birthday due to an incident last weekend. It may sound trivial to you while reading this blog post, but emotional disturbances can lead to significant performance loss and eventually churning if left unchecked. It’s best to keep in close contact with your responders, especially after a busy on-call period.
{{subscribe-form}}
2. Fair Scheduling
Scheduling on-call rotations is one of the trickiest parts as it may create tensions among your team. Balancing times is delicate and must take into account team preferences and business needs. To hit your SLOs, you’ll likely have to set up on-call rotations with 24/7 coverage.
There are a few common on-call rotation schedules, like follow-the-sun and round-robin escalation policies, that can help reduce the burden your on-call team has to take on during non-business hours.
But even once you have a perfect schedule set up, distributing on-call rotations fairly is not a static affair. For example, you’ll need to update shifts after someone has gone through a particularly tiring one. Fair schedules also imply making on-call easier for your responders by keeping alert fatigue at bay.
Best Practices for Fairer On-Call Schedules:
- Rotate on-call duties fairly: Nights, weekends, and holidays are tougher shifts to cover than 9-to-5 on a weekday. Thus, a fair distribution of on-call time doesn’t mean an arithmetic division of the available shifts among your responders. As much as possible, try to accommodate their needs and context, which may include family circumstances or religious observance.
- Allow flexibility and coverage exchanges: Even if you manage to schedule the perfect rotations, your responders are humans and may run into, well, life. Modern on-call scheduling tools like Rootly On-Call let your responders ask for coverage from colleagues in a few clicks, for when they forgot they had a dental appointment while they were on call.
- Fine-tune your alerting sources: Responders can easily find themselves under alert fatigue after getting paged several times throughout their sleep. Finding the balance between not missing important alerts and not overwhelming your responders is an ever-evolving process for all SRE teams. Keep your alert-to-incident ratio in check often.
- Establish per-shift limits: Most organizations set a hard limit on the number of incidents that a single person responds to throughout their shift. The number is usually two incidents maximum per shift. The reasoning behind this is that addressing an incident is a considerable task, and after two, the likelihood of there being a larger issue with the system is high and would require broader coordination.
- Provide adequate rest periods: Day-to-day on-call shifts should always be followed by enough time off-call so your responders can recover. Make special arrangements to grant time off or care packages to people who had especially difficult shifts.
- Ensure backup support: It is common to have primary and secondary schedules. Your primary on-call responder may get into an accident, for example. Or, maybe a challenging incident pops up and the primary on-call responder needs help to tackle it.
3. Fair Compensation
A labor law requirement in some regions and a strategic choice for many organizations, compensating responders for on-call shifts has become more commonplace.
Given that on-call is an effective workload put on responders, it is reasonable for them to expect a form of compensation. There are different ways to compensate on-call responders, and not all of the options are necessarily cash handouts.
- Additional pay or stipends: The simplest and most common compensation option is adding a fixed amount as compensation for being on call to the employment contract of responders. This additional pay is fixed no matter how much time a responder spends on call or resolving incidents throughout the year.
- On-call vs active resolution rates: Perhaps one of the most popular approaches to on-call compensation is paying a fraction of the equivalent salary ‘hourly rate’ for being on call, and a higher ‘hourly rate’ when the responder is actively engaged in solving an incident outside of business hours. Make sure your incident response tool helps you keep track of these times, like Rootly On-Call does, or you’ll end up dealing with ugly spreadsheets and drama.
- Compensatory time off: Whether at the discretion of the manager or by counting the time spent resolving incidents, some teams opt to leverage time off instead of a monetary incentive. Some teams do prefer to get this instead of extra cash.
- Different rates per shift: A more elaborate cash-based approach some teams take is to pay a specific amount per night shift on a weekday vs night shift on a weekend. This is difficult to implement and track, but for teams, it brings a feeling of fairness into the effort they contribute while on call.
4. Adequate Incident Response Tools
A good on-call and incident response solution is invisible. Your responders don’t think about it or wonder whether the tool is acting weird or has a glitch. Modern on-call and incident response tools like Rootly are intuitive and act within your existing workflows.
Best Practices for On-Call Tools:
- Choose a tool that reduces cognitive load: The last thing a responder needs is to learn the intricacies of yet another tool. Legacy tools like PagerDuty often require full-blown training to use, but wasting mental cycles on an on-call tool is not ideal for any SRE team. Instead, go for tools that are easy to use but that have advanced options available when you need them.
- Streamline on-call and incident response workflows: Traditionally, you’ll have an alerting solution like PagerDuty and Opsgenie send you a push notification at 3 a.m. and leave you in the dark. Instead, modern tools like Rootly page you with relevant information and let you get an incident resolution process started right away.
5. Comprehensive Training
Being on-call and resolving incidents are skills that need to be learned. In a generative culture, everyone is encouraged to learn more about the field and challenge your existing way of working.
Best Practices for On-Call Training
- Schedule regular shadow rotations: Shadowing is not only for beginners. Quarterly on-call shadowing can help teammates learn from each other when addressing an incident and more easily see areas of improvement in the incident response process.
- Provide learning resources: The field of SRE is constantly evolving. We’re lucky to work in a field with many smart people who are willing to share their insights through books and other resources. We’ve compiled a list of the best SRE resources in 2024 so far, in case you want to check it out.
- Engage in simulated tests: At first, it feels silly, but the idea is to test out whether the process you drafted who knows when actually works in the way your organization runs today. Perhaps you’ll notice the scenarios you outlined can no longer be solved as you envisioned, or that it relied on a role that no longer exists.
- Provide a learning time budget: Sure, your organization likely offers a symbolic monetary budget to buy books and courses. But if you really want your teams to get better by learning, you’ll have to give them the time to do it.
6. Supportive Environment
Incident resolution is a touchy subject. You may mitigate an issue today, but ultimately you’ll want to find the root cause, which will be linked to an author. Blaming an individual for an incident is a huge attack on everyone’s psychological safety. Generative on-call cultures don’t point at guilty people but reflect as a team on how to prevent an issue from happening again.
However, a supportive environment goes way deeper than a blameless incident retrospective.
Best Practices to Develop a Supportive Environment
- Recognize the efforts your responders make: Being on call implies renouncing part of their rest, and even if it’s somehow compensated, it’s still a big deal on a personal level. Do not take their efforts for granted, and make sure they know how much you and the company value their contributions.
- Encourage feedback: Make an effort to be an open person to whom any of your responders can come and complain about anything, from the shifts they got this month to the faulty alert deduplication a colleague implemented.
- Provide mental health support: Talk with HR to ensure your responders have the resources they need, like access to quality mental health professionals.
Being On-Call Doesn’t Have to Suck
Building a good on-call culture can help your team perform at its best and improve your overall reliability together. It’s important to put special focus on the human aspect of being on call, acknowledging that what you’re asking from your team is significant.
Provide all the tools and setup they need to excel while on call, to make it as easy as possible for them. Hundreds of companies like LinkedIn, Canva, and NVIDIA trust Rootly to support their on-call and incident response culture. Book a demo with one of our reliability experts to learn how Rootly can help you drive a generative culture.