Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Ensuring reliability: SLOs, on-call process, and postmortems

Posted on Jun 7 Hello there, my name is Pavel Pritchin, and I’m CTO at Dodo Engineering, part of Dodo Brands. My previous role was Head of SRE, and since then, the reliability of our IT system Dodo IS has been one of my responsibilities. Today I’d like to share which practices help us to ensure the stability of our system and even share some templates that anyone can use at their company.Dodo Brands is a franchise business, and the IT system developed by Dodo Engineering team is provided as software as a service for partners. The head company covers the cost of development and maintains the stability of the Dodo information system (Dodo IS). We introduced the Service Level concept to ensure system reliability and set Service Level Objectives (SLOs). There are also processes in place to maintain reliability.Our SLOs:Different teams can add their custom stability goals. For example, someone may have a target value for release frequency.It's not enough to set the objective; it must also be maintained. For example, the SLO for errors has an overall target SLO for the entire system of 99.9%.Dodo IS’ services status dashboard:The critical services have established SLOs, typically 99.99%. Each service has an owner team whose task is to monitor and fix stability issues. The screenshot shows a summary dashboard that allows the service owner team to write actions to fix the problem with their service. Even fixing minor deviations from the target value helps maintain the overall stability of the Dodo IS.To maintain service level, it's not enough to analyze problems after the fact. It's also necessary to respond quickly to service failures. Here comes the on-call process to deal with it.The critical services have established SLOs, typically 99.99%. Each service has an owner team whose task is to monitor and fix stability issues. The screenshot shows a summary dashboard that allows the service owner team to write actions to fix the problem with their service. Even fixing minor deviations from the target value helps maintain the overall stability of the Dodo IS.To maintain service level, it's not enough to analyze problems after the fact. It's also necessary to respond quickly to service failures. Here comes on-call process to deal with it.Every development team is responsible for the services of its domain. Duty rotations include separate shifts during work hours, nonworking hours, and on weekends or holidays. For critical services, we implemented 24/7 duty shifts.All system services are divided into three levels of criticality. They are called pools: A, B, and C. We have an escalation system outside of business hours for services in pools A and B. Pool C includes all other services regardless of their criticality. A, B, and C pools have their target MTTA, SLO, and compensation coefficients. We watch every service and are ready to fix any issue.Each on-call engineer undergoes workshops and other training and has a deep knowledge of the services they work with every day. In case of a service failure, the on-call engineer should start to look into the problem within 5 minutes, and the incident management pipeline handles the situation. For their work on incidents and being on-call for critical services, on-call engineers receive various compensations.Compensations and target metrics for on-call engineers in different on-call pools:Let's see how it works at night or on weekends (at the right part of the scheme). The monitoring signal comes to the 1st support line. 1st line decides whether the failure is critical or if these are minor fluctuations or flops alerts. If the issue is severe, 1st line escalates it to 1 of 8 on-call engineers in pools A and B. On-call engineer can call those on duty at that moment or get to people in another pool if there are problems in related systems and services. We developed an escalation system for cases where no one answers, so we can still find engineers to fix the issue as quickly as possible.Incident management with on-call process:But once again: it's not enough to just find a capable engineer, find the reason for the incident, and fix the problem. If we don't analyze it and won't create and follow some plan to get rid of the root cause, it may happen again and again, bringing new problems to our business. That's why we use postmortems.After incidents, a "postmortem" review always takes place. The practice of postmortems is used to identify the root cause of the problem. It's essential to identify systemic issues in the architecture and design of services and fix them. Without this, we cannot maintain the Service Level because the number of problems will increase over time.One of the main difficulties in working with postmortems is conducting a qualitative analysis of what happened. For this, the structure of a template is essential. The template should lead us from describing facts to solving concrete decisions. Helping questions like "What helped us during the incident," "What went wrong" or "What went like clockwork" should push for deep insights. You also need general information specific to all failures: date, downtime duration, business loss in money, and affected services. General information allows you to do a meta-analysis and look at trends and tendencies.Here you can find the template of postmortem which we use at Dodo after every incident: https://www.notion.so/dodobrands/Delivery-driver-app-doesn-t-work-Network-failure-2c315f993e324dddb9c37cd41ae1d291?pvs=4You can also learn best practices of postmortem review from the authors of the practice in the SRE book from Google, just as we did: https://sre.google/sre-book/postmortem-culture/As a result, stability support works in conjunction with processes:Thus, we can guarantee the specified level of stability for the Dodo IS.If you have any questions on how we work with SLO, on-call, and postmortems, feel free to contact me in comments or directly: we at Dodo Engineering are always happy to share our experience!To discover more about Dodo IS and top QSR innovations in Dodo Brands, follow us on LinkedIn, dev.to, and Medium!Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well Confirm For further actions, you may consider blocking this person and/or reporting abuse Dennis Persson - May 28 Jasmin Virdi - May 26 Atsushi Suzuki - Jun 3 amlan - May 5 Once suspended, dodoengineering will not be able to comment or publish posts until their suspension is removed. Once unsuspended, dodoengineering will be able to comment and publish posts again. Once unpublished, all posts by dodoengineering will become hidden and only accessible to themselves. If dodoengineering is not suspended, they can still re-publish their posts from their dashboard. Note: Once unpublished, this post will become invisible to the public and only accessible to Pavel Pritchin. They can still re-publish the post if they are not suspended. Thanks for keeping DEV Community safe. Here is what you can do to flag dodoengineering: dodoengineering consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy. Unflagging dodoengineering will restore default visibility to their posts. DEV Community — A constructive and inclusive social network for software developers. With you every step of your journey. Built on Forem — the open source software that powers DEV and other inclusive communities.Made with love and Ruby on Rails. DEV Community © 2016 - 2023. We're a place where coders share, stay up-to-date and grow their careers.



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

Ensuring reliability: SLOs, on-call process, and postmortems

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×