www.squadcast.com
Open in
urlscan Pro
63.35.51.142
Public Scan
Submitted URL: https://bit.ly/3PaK0e4
Effective URL: https://www.squadcast.com/sre-best-practices/runbook-template?utm_source=LinkedIn-group&utm_medium=comment
Submission: On March 07 via manual from PL — Scanned from PL
Effective URL: https://www.squadcast.com/sre-best-practices/runbook-template?utm_source=LinkedIn-group&utm_medium=comment
Submission: On March 07 via manual from PL — Scanned from PL
Form analysis
0 forms found in the DOMText Content
📢 Upcoming Webinar! Leverage Squadcast & ServiceNow Integration to Simplify Incident Management! 🌟 Product PLATFORM Unified Incident Management Combine on-call and incident response for efficient operations. Service Reliability Management Enhance reliability with automation, and analytics. Workflows Reduce manual work and resolve incidents faster. Continuous Learning with AI & ML Leverage reliability insights to fine-tune systems and protocols. Enterprise Grade Incident Management Advanced reliability tools designed for scale. ON-CALL MANAGEMENT Event Intelligence The latest industry news, updates and info. Schedules & Rotation Learn how our customers are making big changes. Noise Reduction Get up and running on new features and techniques. RELIABILITY WORKFLOWS SLO & Error Budgets We’re always looking for talented people. Join our team! Service Health We’re always looking for talented people. Join our team! Incident Analytics & Reliability Insights We’re always looking for talented people. Join our team! INCIDENT RESPONSE Enhanced Collaboration Learn about our story and our mission statement. Runbooks News and writings, press releases, and press resources. Postmortems We’re always looking for talented people. Join our team! Status Pages We’re always looking for talented people. Join our team! CONTINUOUS LEARNING Past Incidents Get up and running on new features and techniques. RELIABILITY AUTOMATION PLATFORM Consolidate and automate workflows, while leveraging deep analytics for data-led decisions and continuous improvements. Overview We've just released an update! Check out the all new dashboard view. Pages now load faster. Changelog Bidirectional Integration with Squadcast FEATURED INTEGRATION IntegrationsPricingCustomers Resources Documentation Community Continuous Learning with AI & ML Leverage reliability insights to fine-tune systems and protocols. Changelog The latest industry news, updates and info. Events Learn how our customers are making big changes. AI & ML Analytics Get up and running on new features and techniques. Blog Learn about our story and our mission statement. Developers News and writings, press releases, and press resources. SRE Best Practices News and writings, press releases, and press resources. Incident Response Tools Log in Log in Book a Demo Start For Free × Name * Email * Phone Number * * United States+1 * United Kingdom+44 * * Afghanistan (افغانستان)+93 * Albania (Shqipëri)+355 * Algeria (الجزائر)+213 * American Samoa+1684 * Andorra+376 * Angola+244 * Anguilla+1264 * Antigua and Barbuda+1268 * Argentina+54 * Armenia (Հայաստան)+374 * Aruba+297 * Australia+61 * Austria (Österreich)+43 * Azerbaijan (Azərbaycan)+994 * Bahamas+1242 * Bahrain (البحرين)+973 * Bangladesh (বাংলাদেশ)+880 * Barbados+1246 * Belarus (Беларусь)+375 * Belgium (België)+32 * Belize+501 * Benin (Bénin)+229 * Bermuda+1441 * Bhutan (འབྲུག)+975 * Bolivia+591 * Bosnia and Herzegovina (Босна и Херцеговина)+387 * Botswana+267 * Brazil (Brasil)+55 * British Indian Ocean Territory+246 * British Virgin Islands+1284 * Brunei+673 * Bulgaria (България)+359 * Burkina Faso+226 * Burundi (Uburundi)+257 * Cambodia (កម្ពុជា)+855 * Cameroon (Cameroun)+237 * Canada+1 * Cape Verde (Kabu Verdi)+238 * Caribbean Netherlands+599 * Cayman Islands+1345 * Central African Republic (République centrafricaine)+236 * Chad (Tchad)+235 * Chile+56 * China (中国)+86 * Christmas Island+61 * Cocos (Keeling) Islands+61 * Colombia+57 * Comoros (جزر القمر)+269 * Congo (DRC) (Jamhuri ya Kidemokrasia ya Kongo)+243 * Congo (Republic) (Congo-Brazzaville)+242 * Cook Islands+682 * Costa Rica+506 * Côte d’Ivoire+225 * Croatia (Hrvatska)+385 * Cuba+53 * Curaçao+599 * Cyprus (Κύπρος)+357 * Czech Republic (Česká republika)+420 * Denmark (Danmark)+45 * Djibouti+253 * Dominica+1767 * Dominican Republic (República Dominicana)+1 * Ecuador+593 * Egypt (مصر)+20 * El Salvador+503 * Equatorial Guinea (Guinea Ecuatorial)+240 * Eritrea+291 * Estonia (Eesti)+372 * Ethiopia+251 * Falkland Islands (Islas Malvinas)+500 * Faroe Islands (Føroyar)+298 * Fiji+679 * Finland (Suomi)+358 * France+33 * French Guiana (Guyane française)+594 * French Polynesia (Polynésie française)+689 * Gabon+241 * Gambia+220 * Georgia (საქართველო)+995 * Germany (Deutschland)+49 * Ghana (Gaana)+233 * Gibraltar+350 * Greece (Ελλάδα)+30 * Greenland (Kalaallit Nunaat)+299 * Grenada+1473 * Guadeloupe+590 * Guam+1671 * Guatemala+502 * Guernsey+44 * Guinea (Guinée)+224 * Guinea-Bissau (Guiné Bissau)+245 * Guyana+592 * Haiti+509 * Honduras+504 * Hong Kong (香港)+852 * Hungary (Magyarország)+36 * Iceland (Ísland)+354 * India (भारत)+91 * Indonesia+62 * Iran (ایران)+98 * Iraq (العراق)+964 * Ireland+353 * Isle of Man+44 * Israel (ישראל)+972 * Italy (Italia)+39 * Jamaica+1876 * Japan (日本)+81 * Jersey+44 * Jordan (الأردن)+962 * Kazakhstan (Казахстан)+7 * Kenya+254 * Kiribati+686 * Kosovo+383 * Kuwait (الكويت)+965 * Kyrgyzstan (Кыргызстан)+996 * Laos (ລາວ)+856 * Latvia (Latvija)+371 * Lebanon (لبنان)+961 * Lesotho+266 * Liberia+231 * Libya (ليبيا)+218 * Liechtenstein+423 * Lithuania (Lietuva)+370 * Luxembourg+352 * Macau (澳門)+853 * Macedonia (FYROM) (Македонија)+389 * Madagascar (Madagasikara)+261 * Malawi+265 * Malaysia+60 * Maldives+960 * Mali+223 * Malta+356 * Marshall Islands+692 * Martinique+596 * Mauritania (موريتانيا)+222 * Mauritius (Moris)+230 * Mayotte+262 * Mexico (México)+52 * Micronesia+691 * Moldova (Republica Moldova)+373 * Monaco+377 * Mongolia (Монгол)+976 * Montenegro (Crna Gora)+382 * Montserrat+1664 * Morocco (المغرب)+212 * Mozambique (Moçambique)+258 * Myanmar (Burma) (မြန်မာ)+95 * Namibia (Namibië)+264 * Nauru+674 * Nepal (नेपाल)+977 * Netherlands (Nederland)+31 * New Caledonia (Nouvelle-Calédonie)+687 * New Zealand+64 * Nicaragua+505 * Niger (Nijar)+227 * Nigeria+234 * Niue+683 * Norfolk Island+672 * North Korea (조선 민주주의 인민 공화국)+850 * Northern Mariana Islands+1670 * Norway (Norge)+47 * Oman (عُمان)+968 * Pakistan (پاکستان)+92 * Palau+680 * Palestine (فلسطين)+970 * Panama (Panamá)+507 * Papua New Guinea+675 * Paraguay+595 * Peru (Perú)+51 * Philippines+63 * Poland (Polska)+48 * Portugal+351 * Puerto Rico+1 * Qatar (قطر)+974 * Réunion (La Réunion)+262 * Romania (România)+40 * Russia (Россия)+7 * Rwanda+250 * Saint Barthélemy (Saint-Barthélemy)+590 * Saint Helena+290 * Saint Kitts and Nevis+1869 * Saint Lucia+1758 * Saint Martin (Saint-Martin (partie française))+590 * Saint Pierre and Miquelon (Saint-Pierre-et-Miquelon)+508 * Saint Vincent and the Grenadines+1784 * Samoa+685 * San Marino+378 * São Tomé and Príncipe (São Tomé e Príncipe)+239 * Saudi Arabia (المملكة العربية السعودية)+966 * Senegal (Sénégal)+221 * Serbia (Србија)+381 * Seychelles+248 * Sierra Leone+232 * Singapore+65 * Sint Maarten+1721 * Slovakia (Slovensko)+421 * Slovenia (Slovenija)+386 * Solomon Islands+677 * Somalia (Soomaaliya)+252 * South Africa+27 * South Korea (대한민국)+82 * South Sudan (جنوب السودان)+211 * Spain (España)+34 * Sri Lanka (ශ්රී ලංකාව)+94 * Sudan (السودان)+249 * Suriname+597 * Svalbard and Jan Mayen+47 * Swaziland+268 * Sweden (Sverige)+46 * Switzerland (Schweiz)+41 * Syria (سوريا)+963 * Taiwan (台灣)+886 * Tajikistan+992 * Tanzania+255 * Thailand (ไทย)+66 * Timor-Leste+670 * Togo+228 * Tokelau+690 * Tonga+676 * Trinidad and Tobago+1868 * Tunisia (تونس)+216 * Turkey (Türkiye)+90 * Turkmenistan+993 * Turks and Caicos Islands+1649 * Tuvalu+688 * U.S. Virgin Islands+1340 * Uganda+256 * Ukraine (Україна)+380 * United Arab Emirates (الإمارات العربية المتحدة)+971 * United Kingdom+44 * United States+1 * Uruguay+598 * Uzbekistan (Oʻzbekiston)+998 * Vanuatu+678 * Vatican City (Città del Vaticano)+39 * Venezuela+58 * Vietnam (Việt Nam)+84 * Wallis and Futuna+681 * Western Sahara (الصحراء الغربية)+212 * Yemen (اليمن)+967 * Zambia+260 * Zimbabwe+263 * Åland Islands+358 Please fill in all the required fields. Thank you! Your submission has been received! Oops! Something went wrong while submitting the form. Chapter 5: RUNBOOK TEMPLATE: BEST PRACTICES & EXAMPLE December 27, 2022 15 min Chapters Introduction: SRE Best Practices Chapter 1: SLA vs SLO Chapter 2: Reliability vs Availability Chapter 3: DevOps vs SRE Chapter 4: O11y Chapter 5: Runbook Template Chapter 6: Microservices Security Chapter 7: On-Call Rotation Chapter 8: Canary Deployment Chapter 9: Golden Signals Chapter 10: SRE Tools Chapter 11: Runbook Automation < Previous | Next > Fundamentally, a runbook is a set of instructions that — when followed precisely — result in a system producing a specific outcome or reaching a desired state. For example, a runbook can define a process to restore a network device to a working state. As modern IT infrastructures continue to grow in complexity and scale, triaging potential incidents becomes more and more time-consuming. Runbooks help reduce mean time to resolve (MTTR) by providing engineers with a proven recovery path, and automation helps scale the benefits. A platform-agnostic runbook template provides process stability and reliability, and an automation strategy can provide the confidence and repeatability needed to recover quickly. This article will deep dive into runbook templates and help you to provide some order to the chaos of a potential disaster scenario. Integrated full stack reliability management platform Try for free Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets Manage incidents on the go with native iOS and Android mobile apps Seamlessly integrated alert routing, on-call, and incident response Try for free RUNBOOK TEMPLATE BASICS There are some basic details you should include in a well-structured runbook template. The goal is to be complete but concise. That means a runbook should provide readers with all the detail and context needed to complete the task but not overwhelm them or overload the document with unnecessary details that can become confusing. The table below details the key components of a quality runbook template. Runbook template components Runbook component Description Example Task ID This is usually a reference and a link to the ticket created in the organization's project management system or incident board. (Jira, Asana, Trello). This essentially tells the reader where to search for more information and where to log any details pertaining to the runbook's execution. INC-101 Task Name A quick description of the task (2 to 3 words). Employee Offboarding Task Description A longer description of the task. This doesn’t need to go into too much detail and should not specify how the task should be performed at a technical level. Employee has been dismissed for misconduct and needs to be removed from all relevant systems. Task Details Steps required to execute this task. This is the core of the runbook. Each detail or step should be outlined in a simple format. The required action should be described, the reason for the action should be described, and if required a step on how to validate and/or troubleshoot the step. Step 1. Power on the machine. Step 2 .Input credentials … Step n. Power off the machine. Team executing this task Team responsible for this task. DevOps Task Owner Team member responsible for executing the task or coordinating the team. Alice@example.com Time to complete this task Particularly useful when performing a task which will affect production systems. There should be an expected value provided along with an actual value when the action has been completed. Estimated time: 10 - 20 minutes Started: 11/11/22 11:00:00 Completed: 11/11/22 11:11:00 Status A status provides all stakeholders insight into the issue or task in question. ASSIGNED, IN_PROGRESS, BLOCKED, or COMPLETE TRIGGERING A RUNBOOK The first iteration, or the first few iterations of a runbook, will likely be triggered by a manual process. For example, tasks to recover a website that has crashed or offboard an employee for HR should be performed by a human before being automated. As the process improves, the runbook may be triggered via an API or ticketing system. Cloud monitoring solutions like AWS CloudWatch are great examples of services that can detect issues in a production system, highlight them using interactive graphs, and even trigger an automated response. As the runbook evolves, the automated response can start to handle some of the responsibilities of the engineer in charge of executing the runbook, and may eventually automate the process in its entirety. Of course, a monitoring solution can be separate from a particular technology or provider. Custom monitoring solutions require more effort but can be just as effective. These solutions can be as basic as hooking up an out-of-the-box graphing solution like Grafana to a MySQL database or as complex as a Python script that builds an entire secondary region architecture and tweets when it is complete. A RUNBOOK EXAMPLE As an example, we’ll use the case of an employee whose contract has been terminated for misconduct. The company has outlined the steps IT should take once they receive the email notifying them of the termination. This set of steps is essentially a runbook. The job of the IT team is to document this process and provide instructions clear enough to empower a repeatable and reliable result. Task ID ACME-INC-108 Task Name Employee Offboarding - Elmer Fudd Task Description Employee has been dismissed for misconduct. Any active credentials need to be revoked, users need to be offboarded from all internal systems, and recent activity needs to be reviewed. Task Task Details For full details, see the instructions below. • Disable user account from the Acme management portal • Remove from GitHub • Revoke AWS keys • Download activity logs from the Acme management portal • Download activity logs from AWS • Store activity logs • Audit activity log Team executing this task DevSecOps. Task Owner joe.bloggs@acme.com Time to complete this task Estimated time: 40 - 60 minutes Started: 01/11/22 14:20:00 Completed: TBD Status IN_PROGRESS 1. Disable their user account for the internal system - The former employee has access to the internal sales and marketing system, and their credentials should be expired and/or account deleted so they can no longer access confidential information. 2. Disable their GitHub account - The former employee is part of the company's GitHub organization. They should be removed from the organization as soon as possible so they can no longer access intellectual property like source code. 3. Disable their AWS keys - The former employee has access to the AWS system as they required database access from time to time. Their AWS keys should be revoked so that they can no longer access the AWS infrastructure. 4. Download their usage logs from both AWS and the internal system - To ensure no malicious actions were carried out in their final days, the company would like insight into what actions the customer was taking in their final days. This includes AWS CloudTrail logs to gain insight into their activity on AWS and the activity logs from the internal system to gain insight into what data they accessed or modified before leaving. 5. Store their usage logs in S3 - The data gathered in the previous step should be stored in S3 so it can be easily reviewed and any findings from the review can be validated at a later date. 6. Investigate/audit usage logs - Finally, once the data has been stored in S3, it should be reviewed for malicious or suspicious activity. This could include accessing or modifying resources not usually associated with the employee’s role, or even unusual log-in times could be indicative of suspicious activity. Given these requirements, the IT team is charged with going through the process in detail and documenting the actions required to accomplish each objective. Their deliverable is a well-documented, minimal set of easily reproducible steps to be added to the Task Details section of the runbook. AUTOMATING THE RUNBOOK This simple example of a runbook requirement may seem trivial, but even a small mistake in executing the actions could lead to disastrous results. And there’s a reason the phrase “I’m only human” is so common. Humans make mistakes. That's an inevitability that should be taken into account when creating runbook steps. Screenshots or diagrams to go with complex instructions can help, but automating the task is ultimately the most reliable way to ensure a predictable result. Let's go step-by-step using the example above and see how such a process could be automated using a script or set of scripts. The task details would then become a lot simpler and point the reader to the script(s) to run, explain how to run them, and advise how to validate their success. 1. Disable their user account for the internal system - Most modern web applications will contain a REST API that can be programmatically invoked via simple scripts that can trigger most actions (potentially more) than those that can be triggered via the frontend user interface. The start of our automated solution would involve a call to an API to disable the user account. 2. Disable their GitHub account - GitHub is an example of a web application with such an API. Similar to step one, our script can make a call to the GitHub API to remove the user from the company organization. 3. Disable their AWS keys - Automated solutions are a huge part of the AWS ecosystem, and to empower its users it provides an API that can be interacted with using software development kits (SDKs) written in multiple different languages, as well as a command line interface (CLI) that can perform almost any action offered by the various AWS services. We can use the API or CLI in our script to revoke the user’s keys programmatically. 4. Download their usage logs from both AWS and the internal system - This step is simply two more API calls. First, we can invoke AWS CloudTrail to download the AWS user logs. Then invoke our internal systems API to download any relevant user activity. 5. Store their usage logs in S3 - Again a simple API call. The AWS SDK and the AWS CLI allow you to copy files to and from Amazon's simple storage service. 6. Investigate/audit usage logs - This can be done in a variety of different ways, but a simple script that searches the logs for certain words or patterns can quickly detect unusual activity. As the runbook evolves, this script may also evolve and do things such as link into custom machine learning services that can learn and detect suspicious patterns. A script that can automate our example runbook. RECOMMENDATIONS FOR DESIGNING A RUNBOOK TEMPLATE Of course, our runbook is not perfect and may take time to reach an ideal state. In fact, it may never reach a final state and simply continue to adapt and evolve. Below are some runbook template recommendations that can help you get the most out of your runbooks. Don’t try to automate everything on day one Attempting to script every step from day one can lead to confusion and even mistakes. It’s important to perform the task manually at least once to fully understand and explain the process being automated. Document clearly A picture truly paints a thousand words. Use screenshots and diagrams so that a reader can follow along with the process and be confident that everything is executing as expected. Remember to validate Once the runbook has been followed, you should validate that the system is in the desired state. In some cases, this can be a single check. In other cases, validation may be necessary on a step-by-step basis. Validation steps should be included with the runbook steps. Know how much automation is too much Think about the consequences of automating the runbook. Sometimes it may be wise to require some manual intervention, even if it's just to trigger the automation process. For example, a temporary network blip may trigger a response to spin up a production infrastructure in a secondary region and switch all production traffic to this region. In a time-consuming, expensive, and customer-impacting case like this, it may make sense for a human to first decide if the blip is cause for concern or if they are happy that service has returned to normal in an acceptable time frame. CONCLUSION Runbooks are invaluable to a growing enterprise. It is inevitable that as a solution grows that things will go wrong sometimes. Using a quality runbook template can bring order to the chaos of solution engineering. By following a familiar structure the runbook reader can put aside the stress of reinventing the wheel, overengineering a solution, or preparing a business-ready document. With a runbook in place, all they need to do is follow the steps. Further, a structured format opens up the possibility of process automation. As recommended steps become more refined and reliable, it makes them easier to automate, either via third-party solutions or custom scripts. Some companies have begun to realize the importance of establishing this structure quickly and have built runbook solutions to provide out-of-the-box runbook templates that can save organizations months of trial and error. Integrated full stack reliability management platform Platform Blameless Lightstep Squadcast Incident Retrospectives ✔ ✔ ✔ Seamless Third-Party Integrations ✔ ✔ ✔ Built-In Status Page ✔ On Call Rotations ✔ Incident Notes ✔ Advanced Error Budget Tracking ✔ Try For free Platform Incident Retrospectives Seamless Third-Party Integrations Incident Notes Built-In Status Page On Call Rotations Advanced Error Budget Tracking Blameless ✔ ✔ FireHydrant ✔ ✔ ✔ Squadcast ✔ ✔ ✔ ✔ ✔ ✔ Try For free Like this article? Subscribe to our LinkedIn Newsletter to receive more educational content Subscribe now Like this article? Subscribe to our Linkedin Newsletter to receive more educational content Subscribe now CONTINUE READING THIS SERIES Introduction: SRE Best Practices Chapter 1: SLA vs SLO Chapter 2: Reliability vs Availability Chapter 3: DevOps vs SRE Chapter 4: O11y Chapter 5: Runbook Template Chapter 6: Microservices Security Chapter 7: On-Call Rotation Chapter 8: Canary Deployment Chapter 9: Golden Signals Chapter 10: SRE Tools Chapter 11: Runbook Automation Produced in partnership with Inbound Square Product Features Integrations Pricing Mobile Incident Management Product Demo COMPARE PagerDuty Alternative Opsgenie Alternative Solutions SRE Tools IT Alerting IT Incident Management Status Page Runbooks How to Reduce MTTR Modern Incident Response Platform Incident Postmortem Company About Us Partners Contact Us Careers Support Getting Started Submit a Ticket Service Status Resources Blog Case Studies Developer Resources Community SRE Best Practices Error Budget Calculator * Privacy Policy * Responsible Disclosure * GDPR * Terms of Use * Security & Compliance Copyright © Squadcast Inc. 2017-2024