"Our goal is to make it easy for employees to come in and run an incident without needing deep technical knowledge about the system. Rootly has made this easier by allowing us to automate a lot of the “hand-holding" someone needs when they’re first navigating an incident."
Chris Inch
Director of Engineering
Wealthsimple is one of Canada’s fastest growing and most trusted financial services companies offering a full suite of simple, sophisticated financial products across managed investing, do-it-yourself trading, cryptocurrency, tax filing, spending and saving. With offices in Toronto and New York City, and a large remote workforce of over 1,000 employees, Wealthsimple is on a mission to help everyone achieve financial freedom.
Technologies: Ruby, JavaScript, React, Java, GraphQL, PostgreSQL
DevOps: GitHub, Docker, Datadog, PagerDuty, AWS
Business Tools: Slack, Jira
Early in 2021, there were some notable events in the financial world – like the rise of meme stocks – which resulted in us experiencing high traffic events that put a lot of stress on our systems. We knew we needed to double down on our incident management to ensure we’re always reliable and available to our clients, especially during major market events. We were (and still are) using PagerDuty for alerting and paging people, but we required a super-powerful, feature-rich solution for managing incidents. A lot of incident management was taking place in ad hoc Slack channels and we wanted a strong source of truth for every incident—and potential incident—that we encountered.
We started building some automation around our manual processes, but knowing we had to formalize an automation approach, we found ourselves in the classic build vs. buy dilemma. We considered continuing to build out our internal tools, but ultimately realized it would require significant time and investment to reach the standard that we wanted , and even more time to maintain it.
We required a fairly detailed matrix of features, so we explored a few potential options. We considered leveraging PagerDuty even further, or incident response solutions in other tooling we were already using like Datadog, but those didn’t meet our needs. We started focusing more on dedicated incident management solutions, like Rootly, FireHydrant, Incident.io, and a few more. In the end, Rootly was the option that best met our needs while providing the flexibility we needed as a high growth company.
We’re continuously growing and innovating, and we want our incident management tooling and process to grow with us, while still serving our immediate-term needs. Rootly offered that flexibility from the get-go. We also wanted to refine our approach to incident management in general. We had several conversations about this with Rootly very early on. We want to be highly effective in how we respond to incidents, while also prioritizing the culture around incident response in general and Rootly has really been able to support us with that in a positive way. The team is responsive and knowledgeable in the domain, so we can tap into that expertise to help guide our processes as we evolve.
For the long term success of the company and the individuals that work here, we put employee wellness at the forefront of our approach. It would be really easy to tap the same capable people across the company as our ongoing incident responders given their repeated ability to resolve incidents quickly. However, we know that’s not fulfilling for our employees, nor does it support their growth and development.
Our goal is to make it easy for employees to come in and run an incident without needing deep technical knowledge about the system. Rootly has made this easier by allowing us to automate a lot of the “hand-holding" someone needs when they’re first navigating an incident. Incidents are a great learning opportunity and collaborative experience, so knowing Rootly automation is there to guide and prompt the team throughout offers us a lot of confidence.
Prior to Rootly, we relied on a lot of training materials to set incident commanders up for success. As we implemented Rootly, we were able to take that time-consuming training and build it right into our incident response process. Instead of sitting through hours of training videos and Google Meets, now our new commanders can learn right in Slack. We set up Rootly’s /incident test
command to create our own built-in interactive tutorial to guide people through the process of running an incident. We also encourage everyone to do a test incident when their on-call shift starts so they can feel confident and ready if a real incident happens. We love that Rootly’s flexible configuration makes it possible for us to get creative like this.
Because Rootly makes it so easy to get set up as an incident commander now, we’re able to have a large team of about 60 or 70 people available, and give them very short shifts once every couple of months. This really minimizes the burden of being on call and balances employee wellbeing.
Alongside incident commanders, our Client Success team is still always involved in any incident that impacts our customers. Having a tool that increases visibility and automatically reminds us to do things like update the status page (which thanks to Rootly we can do right in Slack) helps ensure the needs of our frontline teams are being prioritized. When there are demanding or complex technical problems, it can be burdensome for incident commanders and subject matter experts to keep non-technical responders updated. As the team who is responsible for keeping our most important stakeholders—our customers—informed, we want to do everything we can to help them manage those conversations with confidence, and Rootly’s automations help us do that.