Podcast: Break Things on Purpose with Gremlin | Building Rootly with JJ Tang

Podcast: Break Things on Purpose with Gremlin | Building Rootly with JJ Tang

Our co-founder JJ reflects on building the fastest-growing incident management platform and the surprising learnings.

JJ Tang

JJ Tang

April 24, 2022
9 min read
What Does AIOps Mean for SREs? It’s Complicated.

What Does AIOps Mean for SREs? It’s Complicated.

AIOps can bring some value to SREs, but it’s important to maintain healthy perspective about the limitations of AIOps.

JJ Tang

JJ Tang

March 11, 2022
4 min read
Importance of Good Incident Communication

Importance of Good Incident Communication

From alerting to during to post incident, great communication is the key to effective incident response.

JJ Tang

JJ Tang

February 4, 2022
6 min read
A Primer on the History and Evolution of Incident Management to Today

A Primer on the History and Evolution of Incident Management to Today

Many of the concepts SREs take for granted about incident management originated with efforts to fight fires in California in the 1970s.

JJ Tang

JJ Tang

January 21, 2022
4 min read
A Site Reliability Engineer’s Guide to the Holiday Season

A Site Reliability Engineer’s Guide to the Holiday Season

SREs face special challenges during the holidays. Here’s how to manage them.

JJ Tang

JJ Tang

December 17, 2021
4 min read
Who Needs Site Reliability Engineers (SREs)?

Who Needs Site Reliability Engineers (SREs)?

Although every company can benefit from SREs, some need SREs more than others.

JJ Tang

JJ Tang

December 3, 2021
4 min read
History of SRE: Why Google Invented the SRE Role

History of SRE: Why Google Invented the SRE Role

A history of Site Reliability Engineering from its origins at Google in 2003 to the present.

JJ Tang

JJ Tang

November 19, 2021
5 min read
SLA vs. SLO vs. SLI: Understanding the Similarities and Differences

SLA vs. SLO vs. SLI: Understanding the Similarities and Differences

An explanation of the meaning of SLA, SLO and SLI, and how SREs should use each concept to manage reliability.

JJ Tang

JJ Tang

November 5, 2021
4 min read
An Introduction to Incident Response Roles

An Introduction to Incident Response Roles

Learn about the key roles within an incident response team, as well as optional incident roles you may not have thought about.

JJ Tang

JJ Tang

October 22, 2021
5 min read
What SREs Can Learn from Facebook’s Largest Outage

What SREs Can Learn from Facebook’s Largest Outage

An SRE’s analysis of the October 2021 Facebook outage.

JJ Tang

JJ Tang

October 8, 2021
5 min read
What is an SRE?

What is an SRE?

A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.

JJ Tang

JJ Tang

September 9, 2021
5 min read
Making Your On-call and Incident Management Program Stick

Making Your On-call and Incident Management Program Stick

Maintenance of your incident management practice is as important as creation - find out what you can do to keep your engineering organization strong and consistent year over year.

JJ Tang

JJ Tang

August 20, 2021
5 min read