How they SRE

A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

Introduction

How They SRE is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.

Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.

Note to readers: This list refers to some of the articles, posts, videos, tools, and techniques published before 2015. Please use such material with caution as there may be recent advances in technology and practices which offer better alternatives and perspectives.

Topics

Site Reliability Engineering
Hiring and Building SRE teams
SRE Culture
DevOps
Monitoring & Observability
Alerting
Incident Response & Post-Mortem
On-Call
Testing in Production
Chaos Engineering
Automation
Performance

Organizations

Achievers

Blog Posts

Enter the Abattoir - Building 'à la carte' gitops tooling
Scaling Production Globally — The service mesh facelift (Part-1)
Scaling Production Globally - Solving observability problems for developers (Part-2)
Load Testing Kubernetes: Building a Framework (Part-1)
Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)

Airbnb

Blog Posts

Automated Incident Management Through Slack
Detecting Vulnerabilities With Vulnture
Alerting Framework at Airbnb
When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb
Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb
Production Secret Management at Airbnb
Automating Data Protection at Scale, Part 1
Automating Data Protection at Scale, Part 2
Automating Data Protection at Scale, Part 3

Algolia

Blog Posts

May 30 SSL incident
A Journey Into SRE

Alibaba Cloud

Blog Posts

Why Are the Top Internet Companies Choosing SRE over Traditional O&M?
Architecture and Practices of Bilibili's Real-time Platform

Asana

Blog Posts

How Asana uses Asana: Security incident response
How Asana ships stable web application releases
Analysis of recent downtime & what we’re doing to prevent future incidents
Developer environment: Achieving reliability by making it fast to reset

ASOS

Blog Posts

Playing the blame-less game
A day in the life of… Cat S (Head of Reliability Engineering)
An AKS Performance Journey: Part 1 — Sizing Everything Up
An AKS Performance Journey: Part 2 — Networking It Out
Cyber Security @ ASOS.com
Security Operations 24x7
The skills we look for in Cyber Security Incident Response

Atlassian

Blog Posts

Best practices for change management in the age of DevOps
Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code
How to export Kubernetes events for observability and alerting
Incident Postmortem Template

BackMarket

Blog Posts

How Back Market SREs prepared for Black Friday

Baidu

Videos

Anomaly Detection on Golden Signals
NetRadar: Monitoring the Datacenter Network
Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity

Basecamp

Blog Posts

Inside a CODE RED: Network Edition
Three Basecamp outages. One week. What happened?
Basecamp 2 and Basecamp 3 search outage report
Reducing Incident Escalations at Basecamp

Books

Shape Up

Bloomberg

Videos

Capacity Planning and Performance Enhancement with Page Reference Sampling
Why SREs can't afford to NOT do Chaos Engineering
Tracing Real-Time Distributed Systems
The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation
Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest

Booking.com

Blog Posts

How Reliability and Product Teams Collaborate at Booking.com
Incidents, fixes, and the day after
Troubleshooting: A journey into the unknown

Videos

SLOs for Data-Intensive Services
Benefits of Taking the Less Traveled Road with Containers Infrastructure

Capital One

Blog Posts

Automate Application Monitoring with Slack
Automate AWS Infrastructure with Boto 3: AWS Health Check
Active-Active Shared-Nothing Database Architecture
The 3 R’s of SREs: Resiliency, Recovery & Reliability
5 Steps to Getting Your App Chaos Ready
4 Real-World Scenarios That Read Like Chaos Engineering Experiments
Embrace the Chaos … Engineering
3 Lessons Learned From Implementing Chaos Engineering at Enterprise
A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy
Secure Docker Containers Require Secure Applications
4 Steps for Pairing the Cloud and DevOps to Improve Resiliency
Container Ready Applications with Twelve-Factor App and Microservices Architecture
Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS
Architecting for Resiliency
Continuous Chaos — Introducing Chaos Engineering into DevOps Practices
The Mon-ifesto Part 1: Metrics

Major incidents & analysis reports

Information on the Capital One Cyber Incident
A Case Study of the Capital One Data Breach

Videos

Banking on Continuous Delivery - Capital One
Continuous Chaos in DevOps - Capital One
DevOps at Capital One: Focusing on Pipeline and Measurement
Automating the Management of the Operational Health of Cloud Accounts at Scale

Coinbase

Blog Posts

Open Sourcing Coinbase’s Secure Deployment Pipeline

DAZN

Blog Posts

Site Reliability at DAZN

DBS

Blog Posts

Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far
Debunking the seven most popular Site Reliability Engineering myths
How To Use SRE To Cultivate A Blameless Culture In The Workplace
Site Reliability Engineering at DBS Bank
Automating Configuration Management at Scale
How DBS dispelled the myths of Chaos Engineering
Double, Double Toil and Trouble

Videos

SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS

DeepSource

Blog Posts

Redis diskless replication: What, how, why and the caveats
How to setup Vault with Kubernetes
Breaking down zero downtime deployments in Kubernetes

Dream11

Blog Posts

Deployment At Scale: Story Behind Dream11’s In-House Blue-Green Deployment Platform ‘OneClick’.
Enhancing security and trust with AWS WAFv2
Lessons learned from running GraphQL at scale
Break circuits, save Kong 🦍
Finding Order in Chaos: How We Automated Performance Testing with Torque
Maintaining hyper-sonic releases at Dream11
To Scale In Or Scale Out? Here’s How We Scale at Dream11
Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11

Dropbox

Blog Posts

Dropbox Engineering Career Framework - Reliability Engineer (SRE)
Atlas: Our journey from a Python monolith to a managed platform
Monitoring server applications with Vortex
Athena: Our automated build health management system

Videos

Service Discovery Challenges at Scale

eBay

Blog Posts

Resiliency and Disaster Recovery with Kafka
SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue
SRE Case Study: Mysterious Traffic Imbalance
Zero Downtime, Instant Deployment and Rollback

Video

Madaari: Ordering for the Monkeys

Epic Games

Video

AWS re:Invent 2018: Epic Games Uses AWS to Deliver Fortnite to 200 Million Players

Etsy

Blog Posts

Improving the Deployment Experience of a Ten-Year Old Application
How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020
Your brain on progress
Etsy’s Debriefing Facilitation Guide for Blameless Postmortems
Opsweekly: Measuring on-call experience with alert classification
Demystifying Site Outages
Blameless PostMortems and a Just Culture
Measure Anything, Measure Everything

Videos

Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe
Migrating a Monolith to the Cloud

Expedia

Blog Posts

Automating Performance Standards
Error Budget Policy - Part 1 - Adoption at Expedia Group
Error Budget Policy - Part 2 - Practices at Expedia Group
Using Fault-Injection to Improve our new Runtime Platform’s Reliability
Learning from Incidents at Expedia Group
Improving Vrbo Homepage Loading Experience
Troubleshooting 502 errors: ECS Checklist
Getting Started with Elasticsearch
All about ISTIO-PROXY 5xx Issues
Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?
How to Keep Your Kubernetes Deployments Balanced Across Multiple zones
Are Your Dropwizard Latency Metrics Misleading You?
The Cost of 100% Reliability
Creating Monitoring Dashboards
Using Bash for DevOps

Fastly

Videos

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
Resilience Engineering Mythbusting

Getaround

Blog Posts

How we handle incidents at Getaround
Evolution Of Our Continuous Delivery Process

GitHub

Blog Posts

How GitHub uses GitHub Actions and Actions larger runners to build and test GitHub.com
The GitHub Security Lab’s journey to disclosing 500 CVEs in open source projects
CodeQL team uses AI to power vulnerability detection in code
Addressing GitHub’s recent availability issues
Building organization-wide governance and re-use for CI/CD and automation with GitHub Actions
Enabling branch deployments through IssueOps with GitHub Actions
Using ChatOps to help Actions on-call engineers
Partitioning GitHub’s relational databases to handle scale
Increasing developer happiness with GitHub code scanning
Why (and how) GitHub is adopting OpenTelemetry
Improving large monorepo performance on GitHub
Deployment reliability at GitHub
Improving how we deploy GitHub
Building On-Call Culture at GitHub
Reducing flaky builds by 18x
The evolving role of operations in DevOps
Getting started with DevOps automation
MySQL High Availability at GitHub

Major incidents & analysis reports

GitHub Availability Report: August 2023
GitHub Availability Report: July 2023
GitHub Availability Report: June 2023
GitHub Availability Report: May 2023
GitHub Availability Report: April 2023
GitHub Availability Report: March 2023
GitHub Availability Report: February 2023
GitHub Availability Report: January 2023
GitHub Availability Report: December 2022
GitHub Availability Report: November 2022
GitHub Availability Report: October 2022
GitHub Availability Report: September 2022
GitHub Availability Report: August 2022
GitHub Availability Report: July 2022
GitHub Availability Report: June 2022
GitHub Availability Report: May 2022
GitHub Availability Report: April 2022
GitHub Availability Report: March 2022
GitHub Availability Report: February 2022
GitHub Availability Report: January 2022
GitHub Availability Report: December 2021
GitHub Availability Report: November 2021
GitHub Availability Report: October 2021
GitHub Availability Report: September 2021
GitHub Availability Report: August 2021
GitHub Availability Report: July 2021
GitHub Availability Report: June 2021
GitHub Availability Report: May 2021
GitHub Availability Report: April 2021
GitHub Availability Report: March 2021
GitHub Availability Report: February 2021
GitHub Availability Report: January 2021
GitHub Availability Report: December 2020
GitHub Availability Report: November 2020
GitHub Availability Report: August 2020
GitHub Availability Report: July 2020
Introducing the GitHub Availability Report
February service disruptions post-incident analysis
October 21 post-incident analysis
February 28th DDoS Incident Report
Incident Report: Inadvertent Private Repository Disclosure

Videos

One on One SRE

GitLab

Blog Posts

This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...
My week shadowing a GitLab Site Reliability Engineer
Update: Elasticsearch lessons learnt for Advanced Global Search
Lessons in iteration from a new team in infrastructure
How we optimized infrastructure spend at GitLab
How we scaled async workload processing at GitLab.com using Sidekiq
Inside GitLab: How we release software patches
What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab
How we used delayed replication for disaster recovery with PostgreSQL

GoCardless

Blog Posts

Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial
How we compress Pub/Sub messages and more, saving a load of money
Fear-free PostgreSQL migrations for Rails
Observability at GoCardless: a tale of API performance improvement
Debugging the PostgreSQL query planner
Zero-downtime Postgres migrations - the hard parts
In search of performance - how we shaved 200ms off every POST request

Major incidents & analysis reports

Incident review: Service outage on 25 October 2020, Vault TLS expiry
Incident review: API and Dashboard outage on 10 October 2017

GoDaddy

Blog Posts

Kubernetes Gated Deployments
Kubernetes External Secrets
Kubernetes - A Practical Introduction for Application Developers
An Intuitive Node.js Client for the Kubernetes API

Gojek

Blog Posts

Introducing Skynet: Infrastructure as Code for Gojek
Scaling Our Geo-Search Service For 10x Load
Why We Swear by the RCA
How We Upgrade Kubernetes on GKE
How We Monitor Apache Airflow in Production

Goldman Sachs

Blog Posts

Observability at Scale
Enabling Highly Available Trino Clusters at Goldman Sachs
Infrastructure and the Command Chain Pattern
Mobile CICD with EC2 macOS
Announcing CatchIT - Source Code Secret Scanner
Building Platforms for Data Engineering

Google

Blog Posts

Pitfalls and Patterns in Microservice Dependency Management
SRE Practices & Processes
Google site reliability using Go
Three months, 30x demand: How we scaled Google Meet during COVID-19
SRE Classroom: Distributed PubSub
How SRE teams are organized, and how to get started

Videos

What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google
Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google
Pragmatic Automation’ with Max Luebbe of GCP
Must Watch! - Google SRE YouTube Playlist
Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit
Implementing Distributed Consensus
The SRE I Aspire to Be
SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours
Zero Touch Prod: Towards Safer and More Secure Production Environments
All of Our ML Ideas Are Bad (and We Should Feel Bad)
The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It
Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program
Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way
Practical Instrumentation for Observability
What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services
Unified Reporting of Service Reliability
How to Trade off Server Utilization and Tail Latency
Keeping the Balance: Internet-Scale Loadbalancing Demystified
From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services
Mindfulness in SRE: Monitoring and Alerting for One's Self
Pragmatic Automation
Sublinear Scaling in Practice: The 1k SRE Project
Strategies to Edit Production Data
The Curse of SRE Autonomy and How to Manage It
Scaling SRE Organizations: The Journey from 1 to Many Teams
SRE Classroom - How to Design a Distributed System in 3 Hours
Using PRDs and User Journeys to Design User-Friendly Tools
How Google SRE and Developers Work Together
SREcon21 - Experiments for SRE

Grab

Blog Posts

Our Journey to Continuous Delivery at Grab (Part 1)
Our Journey to Continuous Delivery at Grab (Part 2)
Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)
Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)
Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering
Orchestrating Chaos using Grab's Experimentation Platform
How We Designed the Quotas Microservice to Prevent Resource Abuse
How We Scaled Our Cache and Got a Good Night's Sleep

Grammarly

Blog Posts

Scaling AWS Infrastructure to Support Multiple Regions
Security Operations in an AWS Environment

Gusto

Blog Posts

Service Level Objectives for On-call Peace of Mind
Debugging Sidekiq Poison Pills

Halodoc

Blog Posts

Site Reliability Engineering for Native mobile apps

Heroku

Blog Posts

The Adventures of Rendezvous in Heroku’s New Architecture
Incident Response at Heroku

IBM

Blog Posts

What is Site Reliability Engineering (SRE)?
AIOps tools and solutions

Indeed

Blog Posts

Indeed SRE: An Inside Look
Being Just Reliable Enough
Automating Indeed’s Release Process
Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com

Videos

Are We Getting Better Yet? Progress Toward Safer Operations

Khan Academy

Blog Posts

How Khan Academy Successfully Handled 2.5x Traffic in a Week
Evolving our content infrastructure

Blog Posts

Rethinking site capacity projections with Capacity Analyzer
Insights into a Product SRE team at LinkedIn
Hiring SREs at LinkedIn
Open source update: School of SRE
Fixing Linux filesystem performance regressions
Production testing with dark canaries
Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform
Iris mobile: An open source, mobile interface for incident management
LinkedOut: A Request-Level Failure Injection Framework
Eliminating toil with fully automated load testing
The Makeup of Successful Geographically-Distributed SRE Teams: Part 1
The Makeup of Successful Geographically-Distributed SRE Teams: Part 2
Project STAR*: Streamlining Our On-Call Process
Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
Resilience Engineering at LinkedIn with Project Waterbear
Hiring SREs at LinkedIn, 2017
Open Sourcing Iris and Oncall
Building the SRE Culture at LinkedIn
Failure is Not an Option
MTTD and MTTR Are Key
What Gets Measured Gets Fixed

Videos

Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler
9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE
Weathering the Storm: How Early Warnings Save the Farm
Unconference: Unsolved Problems in SRE
Leading without Managing: Becoming an SRE Technical Leader
Why Does (My) Monitoring Suck?
Traffic Forecasting and Stress Testing Infrastructure
Collective Mindfulness for Better Decisions in SRE
TCP—Architecture, Enhancements, and Tuning
Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up
Understanding Business Metrics Can Make You a Better SRE
Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
Differences in SRE Implementations across Companies

Tools

On-Call

Loggi

Blog Posts

The Release Manager model
SRE Teams #8: Loggi

Loveholidays

Blog Posts

Dynamic alert routing with Prometheus and Alertmanager
Making loveholidays 18% faster with HTTP/3
Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code
The 5 principles that helped scale loveholidays
Realtime Fastly logs with Grafana Loki for under $1 a day

Macquarie

Blog Posts

Our DevSecOps journey with Golang
Pipeline Configuration as Code with Kotlin
DevOps and Segregation of Duties
Macquarie embraces DevOps
Scaling a Kubernetes Platform across the Enterprise

Mattermost

Blog Posts

Monitoring Cloud Environments at Scale with Prometheus and Thanos
How We Use Sloth to do SLO Monitoring and Alerting with Prometheus

Meituan (美团)

Blog Posts

The development and practice of SRE in the cloud (云端的SRE发展与实践)

Mercari

Blog Posts

Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems
What the Microservices SRE Team are doing as SRE Evangelists
What it’s like to work as an embedded microservices SRE
The Merpay SRE Team: Past and future
Embedded SRE at Mercari
What the SRE team wants to achieve with the development team
DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?
How do we share troubleshooting skills
Datadog Dashboard at Scale w / Terraform

Blog Posts

Improving Meta’s SLO workflows with data annotations
SLICK: Adopting SLOs for improved reliability
More details about the October 4 outage
Update about the October 4th outage

Videos

A Customer Service Approach to SRE
How (Not) to Scale a Project: A Post-Mortem
Releasing the World's Largest Python Site Every 7 Minutes
Using ML to Automate Dynamic Error Categorization

Microsoft

Videos

SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft
Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft
Sustainable Software Engineering & SREs
Study on Human Factors and Team Culture to Improve Pager Fatigue
Prioritizing Trust While Creating Applications
Building Resilience: How to Learn More from Incidents
A Tale of Two Postmortems: A Human Factors View
Availability—Thinking beyond 9s
Ironies of Automation: A Comedy in Three Parts
The Ops in Serverless

MIRO

Blog Posts

Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics
Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture
Reliable load testing with regards to unexpected nuances

Monzo

Blog Posts

Autoscaling Monzo: How we optimise our platform to be just the right size
How we’ve evolved on-call at Monzo
How we respond to incidents
How we monitor Monzo

Videos

Eventually Consistent Service Discovery

Tools

Response

Netflix

Blog Posts

Achieving observability in async workflows
Building Netflix’s Distributed Tracing Infrastructure
Lessons from Building Observability Tools at Netflix
Edgar: Solving Mysteries Faster with Observability
Telltale: Netflix Application Monitoring Simplified
Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix
Introducing Dispatch
Applying Netflix DevOps Patterns to Windows
ChAP: Chaos Automation Platform
Starting the Avalanche
Netflix Chaos Monkey Upgraded
Chaos Engineering Upgraded
Automated Failure Testing
From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
Introducing Atlas: Netflix’s Primary Telemetry Platform
FIT: Failure Injection Testing
Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis
Lessons Netflix Learned from the AWS Outage
Scryer: Netflix’s Predictive Auto Scaling Engine

Major incidents & analysis reports

Post-mortem of October 22, 2012 AWS degradation

Videos

AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)
When /bin/sh Attacks: Revisiting "Automate All the Things"
How Did Things Go Right? Learning More from Incidents
Monitoring and Tracing @Netflix Streaming Data Infrastructure
Real user performance monitoring at Netflix scale ‐ Martin Spier
AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is
AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)
Netflix: Multi-Regional Resiliency and Amazon Route 53
Designing Services for Resilience: Netflix Lessons
South Bay SRE Meetup - Netflix Cloud Performance Team
AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)
How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows
Mastering Chaos - A Netflix Guide to Microservices
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)
SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs
From Sys Admin to Netflix SRE
Application Resilience Engineering and Operations at Netflix with Hystrix
Injecting Failure at Netflix
LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability
Incident Management at Netflix Velocity

Podcasts

Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems

Tools

Dispatch

New Relic

Blog Posts

Defining Modern Software Roles: SREs at New Relic
10 Things Everybody Needs to Know About Site Reliability Engineering (SRE)
What Tools Do Site Reliability Engineers Use?
A Day in the Life of a New Relic SRE
7 Habits of Highly Successful Site Reliability Engineers
Adopting the practice of SRE
Using modern observability to establish a data-driven culture

Nubank

Blog Posts

How we deal with technical incidents
How we do On-Call Rotations at Nubank
How we scale our data platform efficiently and reliably
Why We Killed Our End-to-End Test Suite
Automatic retraining for machine learning models: tips and lessons learned

OpenAI

Blog Posts

March 20 ChatGPT outage: Here’s what happened
OpenAI SRE and scaling explained easy.
Scaling Kubernetes to 2,500 nodes
Scaling Kubernetes to 7,500 nodes
Scaling AI Infrastructure at OpenAI

PayPal

Blog Posts

Triggered: Incident #1234 (incident process needs fixing)
Implementing Observability in a Service Mesh
PostgreSQL at Scale: Database Schema Changes Without Downtime
Scaling GraphQL at PayPal

Videos

SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal
SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal
Detecting Service Degradation and Failures at Scale through Distributed Log Processing
Operating Elasticsearch with Ease at Scale
Ensuring Site Reliability through Security Controls

Picnic

Blog Posts

Micrometer and the Modern Observability Stack
Monitoring and Observability at Picnic

Blog Posts

Ensuring High Availability of Ads Realtime Streaming Services
Improving efficiency and reducing runtime using S3 read optimization
Scaling Kubernetes with Assurance at Pinterest
What we learned from an iOS app OOMs incident
How we designed our Continuous Integration System to be more than 50% Faster
Simplifying web deploys
Upgrading Pinterest operational metrics
Distributed tracing at Pinterest with new open source tools
Auto scaling Pinterest

Videos

Building Actionable Code Ownership
Evolution of Observability Tools at Pinterest
Automating OS/Platform Upgrades for Service Owners

Postman

Blog Posts

Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana

Prezi

Blog Posts

How to avoid global outage — Seamlessly migrating DaemonSet labels
In search of speed — debugging Elasticsearch performance
Prometheus at Prezi: replacing 10 years of anti-patterns

Red Hat

Blog Posts

From Ops to SRE: Evolution of the OpenShift Dedicated Team
5 Agile Practices Every SRE Team Should Adopt
7 Best Practices for Writing Kubernetes Operators: An SRE Perspective

Riot Games

Blog Posts

THE LEGENDS OF RUNETERRA CI/CD PIPELINE
STRATEGIES FOR WORKING IN UNCERTAIN SYSTEMS
IMPROVING THE DEVELOPER EXPERIENCE FOR OPERATING SERVICES
SCALABILITY AND LOAD TESTING FOR VALORANT
LEVERAGING GOLANG FOR GAME DEVELOPMENT AND OPERATIONS
CONTROLLED CHAOS WITH FAULT INJECTION TESTING
DOWN THE RABBIT HOLE OF PERFORMANCE MONITORING
PROFILING: THE CASE OF THE MISSING MILLISECONDS
PROFILING: REAL WORLD PERFORMANCE IN LEAGUE
PROFILING: OPTIMISATION
PROFILING: MEASUREMENT AND ANALYSIS
RUNNING ONLINE SERVICES AT RIOT: PART I
RUNNING ONLINE SERVICES AT RIOT: PART II
RUNNING ONLINE SERVICES AT RIOT: PART III
RUNNING ONLINE SERVICES AT RIOT: PART III: PART DEUX
RUNNING ONLINE SERVICES AT RIOT: PART IV
RUNNING ONLINE SERVICES AT RIOT: PART V
THE EVOLUTION OF SECURITY AT RIOT
RUNNING AN AUTOMATED TEST PIPELINE FOR THE LEAGUE CLIENT UPDATE
AUTOMATED TESTING FOR LEAGUE OF LEGENDS

Salesforce

Blog Posts

Looking at the Kubernetes Control Plane for Multi-Tenancy
Optimizing EKS networking for scale
Zero Downtime Node Patching in a Kubernetes Cluster
How, Not Why: An Alternative to the Five Whys for Post-Mortems
A Generic Sidecar Injector for Kubernetes
Implementation of a monitoring strategy for products based on microservices
10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use
Our Journey to a Near Perfect Log Pipeline
Optimizing Performance with Web Workers
Take A Moment To Refocus

Schibsted Media

Blog Posts

Reliability engineering for some of top 10 sites in Scandinavia

Scribd

Blog Posts

Learning from incidents: getting Sidekiq ready to serve a billion jobs
A testimonial for using PagerDuty at Scribd
Assigning pager duty to developers

Shopify

Blog Posts

Resiliency Planning for High-Traffic Events
Capacity Planning at Scale
Using DNS Traffic Management to Add Resiliency to Shopify’s Services
Four Steps to Creating Effective Game Day Tests
Implementing ChatOps into our Incident Management Procedure
StatsD at Shopify

Videos

Network Monitor: A Tale of ACKnowledging an Observability Gap
Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures
Advanced Napkin Math: Estimating System Performance from First Principles

Sky Betting and Gaming

Blog Posts

It’s Just a Monitoring Change
“What's the worst that could happen?”: A worked example of how we deal with live incidents
Rising from the Ashes
Crash! Bang! Wallop! Practice makes perfect
Performance Left Right and Center

Slack

Blog Posts

Slack’s Incident on 2-22-22
Infrastructure Observability for Changing the Spend Curve
Slack’s Outage on January 4th 2021
A Terrible, Horrible, No-Good, Very Bad Day at Slack
Deploys at Slack
Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Videos

Slack at the Edge
What Breaks Our Systems: A Taxonomy of Black Swans

Slalom Build

Blog Posts

How to Implement Service Level Objectives in New Relic APM
Beginners Guide to DevOps: How to Make It into the Industry
GitHub Actions: Beyond CI/CD
Why isn’t all test automation run on the pipeline?
The Many Shapes of Site Reliability Engineering
How to build a secure by default Kubernetes cluster with a basic CI/CD pipeline on AWS
Secret Management Architectures: Finding the balance between security and complexity
Detecting Malicious Requests with Keras & Tensorflow
The Lego Monolith — A Monolith Microservice Proof of Concept
Managing Secrets Using Hashicorp Vault
Packaging Spring Boot Applications for Deployment on Kubernetes
Immutable Infrastructure and Continuous Delivery in the Cloud

Soundcloud

Blog Posts

How to Successfully Hand Over Systems
Building a Healthy On-Call Culture
Alerting on SLOs like Pros
Hands-Off Deployment with Canary
Prometheus has come of age – a reflection on the development of an open-source project
Prometheus: Monitoring at SoundCloud
What I Learned in One Year as an SRE Trainee
Tests Under the Magnifying Lens

Spotify

Blog Posts

Matt Clarke: Senior Backend Infrastructure Engineer
Designing a Better Kubernetes Experience for Developers
Techbytes: What The Industry Misses About Incidents and What You Can Do
Automated Incident Response Infrastructure in GCP

Videos

Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance

Squarespace

Blog Posts

Under the Hood: Ensuring Site Reliability

Videos

Pushing through Friction
How to SRE When Everything's Already on Fire
Case Study: Implementing SLOs for a New Service
Creating a Code Review Culture

Stack Overflow

Blog Posts

“This should never happen. If it does, call the developers.”
Infrastructure as code: Create and configure infrastructure elements in seconds
Fulfilling the promise of CI/CD
A deeper dive into our May 2019 security incident
Guest Post - Failing over without falling over
How We Built Our Blog
Stack Overflow Frees Up Engineering Time with Netlify

Videos

Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline

Strava

Blog Posts

Scaling Club Leaderboard Infrastructure for Millions of Users
Distributed Tracing at Strava

Stripe

Blog Posts

Fast and flexible observability with canonical log lines
Fast builds, secure builds. Choose two.
Introducing Veneur: high performance and global aggregation for Datadog

Videos

How Stripe Invests in Technical Infrastructure
The AWS Billing Machine and Optimizing Cloud Costs

Target

Blog Posts

Ɔhaos Ǝnginǝǝring @ Target - Part 2
Ɔhaos Ǝnginǝǝring @ Target - Part 1
GoAlert - Your Future Open Source, On-Call Notification Product

Teads

Blog Posts

Scaling your on-duty team

Tinder

Blog Posts

The Ultimate Load Test
How We Improved Our Performance Using ElasticSearch Plugins: Part 1
How We Improved Our Performance Using ElasticSearch Plugins: Part 2
Tinder’s move to Kubernetes

Tokopedia

Blog Posts

Benefits of benchmarking with Go
Simulating Customized Chaos in Golang using Toxiproxy
How Tokopedia Rank Millions of Products in Search Page

Trivago

Blog Posts

How To Get Fooled By Metrics

Twilio

Blog Posts

Twilio SRE Gameday Template

Twitter

Blog Posts

Logging at Twitter: Updated
Deleting data distributed throughout your microservices architecture
Deterministic Aperture: A distributed, load balancing algorithm
MetricsDB: TimeSeries Database for storing metrics at Twitter
The Infrastructure Behind Twitter: Scale
The infrastructure behind Twitter: efficiency and optimization

Uber

Blog Posts

Founding Uber SRE
Disaster Recovery for Multi-Region Kafka at Uber
Engineering Failover Handling in Uber’s Mobile Networking Infrastructure
Optimizing Observability with Jaeger, M3, and XYS at Uber

Videos

A Tale of Two Rotations: Building a Humane & Effective On-Call
Testing in Production at Scale
A History of SRE at Uber’ with Rick Boone of Uber

Udemy

Blog Posts

Blameless Incident Reviews at Udemy
How Udemy does Build Engineering

upGrad

Blog Posts

Web Performance and Related Stories — upgrad.com
Beginner’s guide to web analytics
iOS Continuous Deployment with Bitbucket, Jenkins and Fastlane at UpGrad

VGW

Blog Posts

The SRE Incident Response game

Videos

Level Up Your Incident Response With Gameplay

Wikimedia Foundation

Videos

Testing Encyclopedias in Production
What Happens When You Type en.wikipedia.org?

Wix

Blog Posts

How We Improved Website Performance by Evolving Our Infrastructure
Wix Inbox Journey: 3 Approaches for Zero Downtime Database Migration
Moving Velo to Multiple Container Sites: The Why, The How and The Lessons Learned
Making Order in CI/CD Mess

Yelp

Blog Posts

The process: Implementing Yelp’s failover strategy

Videos

Yelp - What I Wish I Knew before Going On-Call

Zalando

Blog Posts

Tracing SRE’s journey in Zalando - Part I
Tracing SRE’s journey in Zalando - Part II
Tracing SRE’s journey in Zalando - Part III

Zerodha

Blog Posts

Infrastructure monitoring with Prometheus at Zerodha

Zomato

Blog Posts

Huddle Diaries – DevOps and Data Platform

SRECon Mix Playlist

Videos

Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE
Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps
Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure
Alaska Airlines - Capacity Prediction in External Services
BuzzFeed - Optimizing for Learning
BT - Challenges of Starting an SRE Team from Scratch in an Enterprise
Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions
Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken
IBM - Why Automating Everything Adds to Your Toil
Genesys - The Smallest Possible SRE Team
G-Research - My Life as a Solo SRE
Grafana Labs - SRE in the Third Age
Kenna Security - Building a Scalable Monitoring System
Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better
MessageBird - Autopsy of a MySQL Automation Disaster
Netlify - Perks and Pitfalls of Building a Remote First Team
ReactiveOps - Zero to SRE
Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19
Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations
The New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events
Twitter - Hiring Great SREs
United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value
Unity Technologies - Being Reasonable about SRE
Udemy - How to Do SRE When You Have No SRE
Vanguard - Cloudy with a Chance of Chaos
WeWork - Learning from Learnings: Anatomy of Three Incidents
Zendesk - Latency and Availability Error Budgets Done Right at Scale

Resources

Books

New! Enterprise Roadmap to SRE
Building Secure & Reliable Systems | Read free online version hosted by Google
Site Reliability Engineering | Read free online version hosted by Google
The Site Reliability Workbook from Google | Read free online version hosted by Google
Training Site Reliability Engineers | Read free online version hosted by Google
97 Things Every SRE Should Know | Complimentary Copy from Nginx
SLO Adoption and Usage in Site Reliability Engineering
Practical Site Reliability Engineering
Implementing Service Level Objectives
Chaos Engineering
Seeking SRE
Security Chaos Engineering
Chaos Engineering Observability
Database Reliability Engineering
What Is SRE?
Database Reliability Engineering: What, Why, and How?
Observability Engineering
Chaos Engineering: Site reliability through controlled disruption
Incident Metrics in SRE | Read free online version hosted by Google
Engineering Reliable Mobile Applications
Monitoring the SRE Golden Signals
Site Reliability Engineering: Philosophies, habits, and tools for SRE success | Portable version
97 Things Every Cloud Engineer Should Know
Real-World SRE
Hands-on Site Reliability Engineering

Events

SRECon Past Events
ChaosConf
SLOConf
- SLOConf 2021 Playlist
cdCon
- cdCon 2021 Playlist
- cdCon 2020 Playlist

Other Resources

Awesome Lists

Awesome SRE
Awesome Site Reliability Engineering Tools
Awesome Chaos Engineering
Awesome Monitoring
Awesome Observability
Awesome MLOps
ML-Ops.org

SRE Resources from various organizations

Google SRE Page
Google SRE Classroom
Google Cloud SRE Page
Microsoft SRE Page
School of SRE from LinkedIn
Stripe Increment Magazine Issue 16 on Reliability
AWS Observability Recipes
Awesome Sysadmin

Incidents & postmortems

The Verica Open Incident Database
Postmortem Templates
Incident Review and Postmortem Best Practices

Newsletters

SRE Weekly Newsletter
Chaos Engineering Newsletter
DevOps Weekly Newsletter

Credits

Inspired by Howtheytest from Abhijeet Vaikar
The list of organizations is referred from my other repo awesome-engineering
Banner image Cartoon vector created by vectorjuice - www.freepik.com

Other How They... repos

Howtheytest
Howtheydevops
Howtheyaws

Contribute

Contributions welcome! Read the contribution guidelines first.

License

To the extent possible under law, Unmesh Gundecha has waived all copyright and related or neighboring rights to this work.

If you decide to use this anywhere please give a credit to @upgundecha on twitter, also If you like my work, check out other projects on my Github.

Files

README.md

Latest commit

History

README.md

File metadata and controls

How they SRE

Introduction

Topics

Organizations

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Books

Videos

Blog Posts

Videos

Blog Posts

Major incidents & analysis reports

Videos

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Video

Video

Blog Posts

Videos

Blog Posts

Videos

Blog Posts

Blog Posts

Major incidents & analysis reports

Videos

Blog Posts

Blog Posts

Major incidents & analysis reports

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Blog Posts

Videos

Tools

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Videos

Blog Posts

Blog Posts

Videos

Tools

Blog Posts