Description
Description
The Site Reliability Engineering Practitioner® (SREP) Certification course introduces methods for scaling services reliably and cost-effectively within an organization. This training for SRE Practitioners delves into strategies aimed at enhancing agility, fostering cross-functional collaboration, and ensuring transparency regarding service health. The course emphasizes the principles of resilience through design, automation, and closed-loop remediation processes.
Training Objectives
- Successfully implement a flourishing SRE culture in your organisation.
- Manage the organisational impact of introducing SRE.
- Build security and resilience by design in a distributed
- zero-trust environment.
- Prepare for the DevOps Institute SRE Practitioner certification exam.
- Participation in unique exercises designed to apply concepts.
- Get sample documents
- templates
- tools
- and techniques.
- Access to additional value-added resources and communities.
- Continue learning and face new challenges with after-course one-on-one instructor coaching.
Course Outline
- Module 1: SRE Anti-Patterns<br />
- Rebranding Ops or DevOps or Dev as SRE<br />
- Users notice an issue before you do<br />
- Measuring until my Edge<br />
- False positives are worse than no alerts<br />
- Configuration management trap for snowflakes<br />
- The Dogpile: Mob incident response<br />
- Point fixing<br />
- Production Readiness Gatekeeper<br />
- Fail-Safe really?<
- Module 2: SLO is a Proxy for Customer Happiness<br />
- Define SLIs that meaningfully measure the reliability of a service from a user’s perspective<br />
- Defining System boundaries in a distributed ecosystem for defining correct SLIs<br />
- Use error budgets to help your team have better discussions and make better data-driven decisions<br />
- Overall, reliability is only as good as the weakest link on your service graph<br />
- Error thresholds when 3rd party services are used<
- Module 3: Building Secure and Reliable Systems<br />
- SRE and their role in Building Secure and Reliable systems<br />
- Design for Changing Architecture<br />
- Fault-tolerant Design<br />
- Design for Security<br />
- Design for Resiliency<br />
- Design for Scalability<br />
- Design for Performance<br />
- Design for Reliability<br />
- Ensuring Data Security and Privacy<
- Module 4: Full-Stack Observability<br />
- Modern Apps are Complex & Unpredictable<br />
- Slow is the new down<br />
- Pillars of Observability<br />
- Implementing Synthetic and End-user monitoring<br />
- Observability driven development<br />
- Distributed Tracing<br />
- What happens to monitoring?<br />
- Instrumenting using Libraries and Agents<
- Module 5: Platform Engineering and AIOPs<br />
- Taking a Platform Centric View solves Organisational scalability challenges such as fragmentation, inconsistency, and unpredictability<br />
- How do you use AIOps to improve resiliency?<br />
- How can DataOps help you in the journey?<br />
- A simple recipe to implement AIOps<br />
- Indicative measurement of AIOps<
- Module 6: SRE & Incident Response Management<br />
- SRE Key Responsibilities towards incident response<br />
- DevOps & SRE and ITIL<br />
- OODA and SRE Incident Response<br />
- Closed Loop Remediation and the Advantages<br />
- Swarming – Food for Thought<br />
- AI/ML for better incident management<
- Module 7: Chaos Engineering<br />
- Navigating Complexity<br />
- Chaos Engineering Defined<br />
- Quick Facts about Chaos Engineering<br />
- Chaos Monkey Origin Story<br />
- Who is adopting Chaos Engineering?<br />
- Myths of Chaos<br />
- Chaos Engineering Experiments<br />
- GameDay Exercises<br />
- Security Chaos Engineering<br />
- Chaos Engineering Resources<
- Module 8: SRE is the Purest form of DevOps<br />
- Key Principles of SRE<br />
- SREs help increase reliability across the product spectrum<br />
- Metrics for Success<br />
- Selection of Target areas<br />
- SRE Execution Model<br />
- Cultural and Behavioral Skills are key<br />
- SRE Case study





