In this episode of Humans of Reliability, Sylvain talks with Dileshni Jayasingha, VP of Technology at commonsku, about building incident management in a company that had strong uptime but no formal process. Dileshni shares how commonsku introduced incident tooling, practiced communication internally before going public, and used AI and observability to empower non-engineering teams while reducing engineering burnout.
Key Topics Discussed
- Introducing formal incident management in a mature, profitable SaaS
- Why commonsku rolled out an internal status page before going external
- Adapting incident updates and postmortems for non-technical audiences
- Using AI to improve incident communication
- Giving customer success teams access to observability and production data
- Reducing unplanned engineering work through better tooling and trust
- Guardrails, reviews, and enabling non-engineers to contribute safely
Key Quotes
“Incident management is a muscle. You need to build it before you actually need it.”
“If you can’t explain a massive pull request, you probably shouldn’t be asking someone to review it.”
In Dileshni’s Words
You recently joined commonsku. What did you find when it came to incident management?
I started at commonsku about eight months ago. It’s an order and workflow management platform in the promotional products industry, which was new for me — I came from the incident management world.
commonsku has been around for almost 14 years. It’s a successful, profitable company with really good uptime — about three nines of availability. But there wasn’t a formal incident management process in place. They hadn’t really planned for the worst-case scenario where you’d want everything written down and clearly defined. That’s one of the main things I’ve been working on since joining.
What did incident response look like before you introduced formal tooling?
We had standard observability and monitoring tools, but capturing incidents, paging people, and doing root cause analysis was all manual.
If something went wrong, someone would call around or try to reach senior engineers directly. It worked, but it was very manual — and that’s where having a proper incident management platform makes a big difference.
What was the first step you took to change this?
The first thing I did was explain why we need a formal incident management process — not because things are bad, but to prepare for the worst.
We’re a very customer-centric company, so communication during incidents really matters. Customers shouldn’t be left wondering what’s going on.
We also made observability tooling accessible beyond engineering. Before, only engineers had access. I wanted anyone in the company to be able to check what was happening if a customer raised a concern. Then we provided education — for engineers, customer success, and anyone interested.
Why did you start with an internal status page before going public?
I wanted teams — not just engineering, but customer success as well — to build the muscle of communicating clearly and consistently during incidents.
Coming from PagerDuty, I was used to very technical updates. But this industry is different. When I wrote our first status update, our CEO told me it was too technical. That was really valuable feedback.
The internal status page gave us a safe place to practice cadence, tone, and templates before exposing that communication publicly.
How did you adapt postmortems for non-technical customers?
After our first significant incident, we shared a public status page and followed up with a post-incident summary.
That’s when we realized the write-up needed a brand voice and less technical detail. My senior manager and I iterated on it several times until we felt it clearly explained what happened, what we’d change, and how we’d prevent it in the future — without overwhelming customers.
Now we have a baseline template we can reuse going forward.
How are you using AI in this process?
We actually use ChatGPT to help with post-incident communication.
One really useful approach is defining personas. Some customers want deep technical detail — CPU issues, automation changes, specifics. Others just want the high-level story. Using AI helps us tailor messaging for different audiences, and we can reuse those prompts over time.
Why didn’t you automate status page updates right away?
When you’re early in formal incident management, manual updates are important.
You need to understand what information is ready to share, what’s still unclear, and how incidents evolve. Starting with automation right away would make me nervous.
The ideal state is automation, but only after teams build the habit and judgment to know what should be communicated and when.
Why did you give observability access to customer success teams?
Customer success is on the front line with customers. Giving them tools like Datadog gives them autonomy.
Instead of engineering becoming a bottleneck, CS can answer many questions themselves. We’ve seen cases where a support rep investigated a login issue, drilled into Datadog logs, and replied to the customer without needing engineering at all. That’s a win for everyone.
How did customer success react to getting access to these tools?
They were really excited and ramped up quickly.
We gave them read access to Datadog and even to our read replica database. We showed them how to use tools like Claude to help write queries. They embraced it.
Framing it as upskilling rather than extra work made a big difference, especially in today’s job market.
How did you roll this out in practice?
We ran hands-on workshops tied directly to real support tickets.
First, we showed basic Datadog views — latency, error rates, traffic. Then we took actual customer issues and worked through them together using the tools.
We did the same with database access: real tickets, real queries. I also shared learning resources — videos, courses, documentation — so people could choose what worked best for them.
How did engineering feel about this change?
They were very supportive.
They’re happy not to be interrupted for routine queries anymore. Engineers even started documenting common queries to help others. Overall, it’s been surprisingly positive across the board.
What’s the long-term vision for this approach?
The end state is non-engineers feeling confident enough to make small pull requests.
Every team has a backlog of small UI or workflow changes that never get prioritized. If non-devs can safely make those changes — with reviews and guardrails — that’s powerful.
We’re slowly working toward that.
What guardrails make this safe?
Small changes, strong test coverage, and fast rollbacks.
We continuously ship small changes so rollback is easy. We rely on automated tests to catch issues. And a human still reviews code before it reaches production — that’s not going away.
Guardrails matter more than gatekeeping.
Final thought on code reviews and large PRs?
If you have a massive pull request and can’t explain it clearly, you probably shouldn’t be asking someone else to review it.
.avif)





















