Devops for Growth
107.5K views | +7 today
Follow
Devops for Growth
For Product Owners/Product Managers and Scrum Teams: Growth Hacking, Devops, Agile, Lean for IT, Lean Startup, customer centric, software quality...
Curated by Mickael Ruau
Your new post is loading...
Your new post is loading...

Popular Tags

Current selected tag: 'SRE - Site Reliability Engineering'. Clear
Scooped by Mickael Ruau
Scoop.it!

Why you should hire DevOps enablers, not experts

Why you should hire DevOps enablers, not experts | Devops for Growth | Scoop.it
Learning organizations smoothly morph as they adapt to new challenges—and they unlearn existing ways of working when those become limitations.
Mickael Ruau's insight:

We are regularly asked if we know any DevOps or site reliability engineering (SRE) experts available for hire. Our answer is, invariably, "Not really." It's a tough market out there.

DevOps and SRE (for large-scale software, at least) are critical approaches for success in modern software delivery and operations, as widely demonstrated every year in the State of DevOps report or the array of presentations at the DevOps Enterprise Summit.

But if you think you can achieve DevOps by hiring "DevOps experts," you are missing some contextual awareness. What exactly are you trying to improve in the first place?

 

If your software delivery is slow because of work you're handing off among multiple teams with diverse schedules and priorities, will a new hire really help?

 

We're not suggesting that you not hire people with diverse skills and backgrounds—that can be quite valuable to bring in new perspectives and approaches.

But conventional hiring based on expertise alone is ineffective and prevents organizations from developing the "learning muscles" that can help teams traverse the latest trends (DevOps, SRE, etc.) to their benefit at the right time, and in the right context.

 

Hiring experts for every need is like engaging in palliative care for organizational health. Preventive care would be to incorporate the necessary team structures and interactions—as well as a focus on people growth and sufficient slack—to effectively take in process, technology, and business changes.

 

Learning organizations smoothly morph as they adapt to new challenges, and they unlearn existing ways of working when they become limitations rather than enablers.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Starting an SRE Team? Stay Away From Uptime - DZone Performance

Starting an SRE Team? Stay Away From Uptime - DZone Performance | Devops for Growth | Scoop.it
A good SRE engineer will tell you your service is never down. A great SRE engineer will tell you that’s not what you should be measuring. In fact, they’ll tell you their job is customer service.

Site Reliability Engineering (SRE) has grown immensely popular with many of the world’s largest tech companies, like Netflix, LinkedIn and Airbnb employing SRE teams to keep their systems reliable and scalable.

Along the way, SRE engineers have become one of the most sought after engineering roles in tech.

The role is traditionally understood as ensuring that services are reliable and unbroken, but reliability and uptime aren’t perfect metrics. Perhaps what organizations should be asking themselves is what their customers think of their service.

Wandering down to your engineering department and asking your SRE team about customer satisfaction is a good place to start.

Their answer just might surprise you.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

DevOps en pratique : comment réduire le Mean Time To Repair –

DevOps en pratique : comment réduire le Mean Time To Repair – | Devops for Growth | Scoop.it
Bien souvent je travaille avec des équipes, agiles ou non, dont l’attention se porte naturellement sur la fourniture de nouvelles fonctionnalités. Or, dans cette course à la production de valeur pour rester compétitif face à la concurrence, les équipes doivent faire un choix. Elles se retrouvent régulièrement dans cette situation compliquée où il faut faire…
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Infrastructure Observability for Changing the Spend Curve

Infrastructure Observability for Changing the Spend Curve | Devops for Growth | Scoop.it
A deep dive on how we crafted an order of magnitude change in our spend (10x reduction compared to baseline growth) over the last two years with iterative understanding and changes in Slack’s Continuous Integration (CI) infrastructure.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

SRE @ Criteo, Why What Who How ?

SRE @ Criteo, Why What Who How ? | Devops for Growth | Scoop.it
Dans les entreprises du web à grand échelle, on ne parle plus de PROD mais de Site Reliability Engineering. Pourquoi un tel changement, que se cache-t-il derrière cette terminologie, qui sont les acteurs de cette mutation et comment embrasser le mouvement ? Point sur l'état de l'art et retour d'expérience sur cette mutation dans le contexte Criteo.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

The InfoQ eMag: Effective Software Delivery with Data-Driven Decision Making

The InfoQ eMag: Effective Software Delivery with Data-Driven Decision Making | Devops for Growth | Scoop.it
This eMag on Data-Driven Decision Making provides an overview of how the three main activities in software delivery can be supported by data-driven decision making to increase the effectiveness, efficiency and service reliability of a software delivery organization.
Mickael Ruau's insight:

 

This eMag on Data-Driven Decision Making provides an overview of how the three main activities in software delivery can be supported by data-driven decision making to increase the effectiveness, efficiency and service reliability of a software delivery organization.

The articles in this eMag come from the InfoQ series on Data-Driven Decision Making where Vladyslav Ukis shared his experiences from Siemens Healthineers; a large-scale distributed software delivery organization consisting of 16 software delivery teams located in three countries.

Each of the articles highlights an area where data-driven decision making can be applied:

  • In Product Management, Hypotheses can be used to steer the effectiveness of product decisions.
  • In Development, Continuous Delivery Indicators can be used to steer the efficiency of the development process.
  • In Operations, SRE’s SLIs and SLOs can be used to steer the reliability of services in production.

Free download

 

 
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

How to Build an SRE Team with a Growth Mindset

How to Build an SRE Team with a Growth Mindset | Devops for Growth | Scoop.it
In this blog post, we’ll cover what a growth mindset is and why it helps your SRE team, how to hire for a growth mindset, how to develop people into SREs with a growth mindset, and how a blameless culture empowers a growth mindset.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

DevOps, SRE, GitOps, Observability: My take on some current-ish buzzwords

DevOps, SRE, GitOps, Observability: My take on some current-ish buzzwords | Devops for Growth | Scoop.it
Blog posts about “What is DevOps” are a dime a dozen. I find myself repeating my 0.8 cent version of this, and other buzzwords that people knock aroun
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Love DevOps? Wait until you meet SRE

When responding to an incident, communication templates are invaluable. Get the templates our teams use, plus more examples for common incidents.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chapter 9 | Measuring Success in SRE | SRE Guide

Chapter 9 | Measuring Success in SRE |  SRE Guide | Devops for Growth | Scoop.it
Key takeaways or measuring program success in site reliability engineering including service level indicators (SLI), service level objectives (SLO), and service level agreements (SLA).
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Forty-Six Percent Of Google's SRE Principles Apply Directly To Your Enterprise — What About The Rest?

When Google published its Site Reliability Engineering (SRE) book — a detailed look at how it keeps production systems running — Forrester started getting a lot of questions. “Should I do this in my enterprise IT shop?” “I’m no unicorn — can I even do these things?” And perhaps most important: “What parts of the book are relevant?” To …
Mickael Ruau's insight:

To sum up the findings:

 

  • Forty-six percent of the principles in the book work out of the box — they’re sound advice for any IT organization. This includes creating SLOs (service level objectives) that augment SLAs (service level agreements), implementing error budgets, and monitoring the four “golden signals” (latency, traffic, errors, and saturation). Do these today. Your customers will thank you.
  • Fifty percent of the principles are good advice — but you’ll need to tweak them for your enterprise. This includes balancing tickets between operations and development, writing your own APIs to automate processes, and bringing down production systems to test resiliency. This isn’t bad advice per se, but your mileage may vary if you don’t alter them for your enterprise.
  • There’s a small number — 4% — that you should not execute. This mostly had to do with load balancing, which is not an invalid approach, but Google has some geographical architecture challenges that your enterprise probably does not.

In the end, we recommend applying most of the concepts with some tweaking. Focus on the service delivery, feature velocity, and automation concepts in the book. Focus less on the architecture sections, as Google’s challenges likely don’t mirror your own.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

How to Monitor the SRE Golden Signals - Faun

Site Reliability Engineering (SRE) is very popular lately, including the “Golden Signals” that you should be monitoring, but HOW do you actually get these data? This is a guide.
Mickael Ruau's insight:

There are three common lists or methodologies:

  • From the Google SRE book: Latency, Traffic, Errors, and Saturation
  • USE Method (from Brendan Gregg): Utilization, Saturation, and Errors
  • RED Method (from Tom Wilkie): Rate, Errors, and Duration

You can see the overlap, and as Baron Schwartz notes in his Monitoring & Observability with USE and RED blog, each method varies in focus. He suggests USE is about resources with an internal view, while RED is about requests, real work, and thus an external view (from the service consumer’s point of view). They are obviously related, and also complementary, as every service consumes resources to do work.

For our purposes, we’ll focus on a simple superset of five signals:

  • Rate — Request rate, in requests/sec
  • Errors — Error rate, in errors/sec
  • Latency — Response time, including queue/wait time, in milliseconds.
  • Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
  • Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

What is "observability"?

What is "observability"? | Devops for Growth | Scoop.it
Get a head start on answering the question, "what is observability?" with these articles
Mickael Ruau's insight:

Start here

Distributed Systems Observability – Cindy Sridharan

What the hell is observability? How is it any different than monitoring? Is it just the “devops” vs “sysadmin” debate all over again? This article answers these and so much more.

Monitoring Isn’t Observability – Baron Schwartz

This quote from the article sums it up quite well: Monitoring tells you whether a system is working, observability lets you ask why it isn’t working.

Monitoring, Analytics, Diagnostics, Observability, and Root Cause Analysis – Baron Schwartz

Now that we’ve introduced some nuance into our world, our terminology is getting overloaded. This post sets out some definitions of what everything means and how they differ.

How do I make my applications more observable?

Monitoring and Observability with USE and RED – Baron Schwartz

USE and RED are two methods for deciding what to instrument why. This article walks you through their meaning and usage.

Hierarchical Observability with RED – Baron Schwartz

One of the best parts about the RED Method comes when you instrument all of your services to emit the same data: it becomes soooo much easier to spot the troublesome service in a microservice/distributed architecture.

Best Practices for Observability – Charity Majors

The list of best practices at the end is worthwhile reading.

How to Monitor the SRE Golden Signals – Steve Mushero

Another method for deciding what to monitor is the Four Golden Signals, popularized by the Site Reliability Engineering book. This article series by Steve Mushero walks you through what the signals mean and how to gather them.

Who is actually doing observability?

There are a number of teams out there doing observability-like things, and some of the larger, more engineering-focused companies have more mature Observability teams that are focused on providing expertise and a platform to other teams.

Twitter

Observability at Twitter

Dating from 2013, Twitter was one of the first companies to work toward solving the problem of monitoring high-scale, distributed monitoring. For more details on their architecture (from 2016), see these posts: Observability at Twitter: technical overview, part I, Observability at Twitter: technical overview, part II

 

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

How to Analyze Contributing Factors - DZone DevOps

How to Analyze Contributing Factors - DZone DevOps | Devops for Growth | Scoop.it
SRE advocates addressing problems blamelessly. When something goes wrong, don't try to determine who is at fault. Instead, look for systemic causes.
Mickael Ruau's insight:

SRE advocates addressing problems blamelessly. When something goes wrong, don't try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don't fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we'll look at:

  • A definition for root cause analysis
  • A definition for contributing factor analysis
  • How to choose between RCAs and contributing factor analysis
  • Best practices for contributing factor analyses
  • How to incorporate learning from analyses back into development
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Observing and Understanding Failures: SRE Apprentices

Observing and Understanding Failures: SRE Apprentices | Devops for Growth | Scoop.it
My name is Tammy Bryant Butow. One of the cool things that I wanted to share was actually a program we created to help new SREs learn all of the skills they needed to observe and understand failures in production.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

What SREs Can Learn From Facebook’s Largest Outage - DZone DevOps

What SREs Can Learn From Facebook’s Largest Outage - DZone DevOps | Devops for Growth | Scoop.it
The October 2021 Facebook outage is a lesson in how even expertly planned systems can sometimes fail, despite having multiple layers of reliability built-in.
Mickael Ruau's insight:

 

What Happened: A “Cascade of Errors”

The outage wasn’t the result of one simple mistake or oversight. It was instead a “cascade of errors” that bred a critical disruption, as the New York Times put it.

That cascade started when an engineer ran a command that was supposed to assess capacity for Facebook’s data centers. For reasons that Facebook hasn’t fully explained (maybe it was a typo- a mistake that has caused more than one serious incident in the past- but we’re just guessing), the command disrupted the backbone network that connects Facebook’s data centers. Facebook’s DNS servers also became unreachable.

An auditing tool was supposed to have detected and blocked the errant command, but Facebook said that a “bug” prevented the tool from catching the issue.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Différencier les SRE, DevOps, SysAdmin et Cloud engineers

Différencier les SRE, DevOps, SysAdmin et Cloud engineers | Devops for Growth | Scoop.it
Il existe une confusion entre DevOps, SRE (Site Reliability Engineer), SysAdmin et Ingénieur Cloud. Nous éclaircissons ces métiers dans cet article.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

SLO : la puissance insoupçonnée des métriques | OCTO Talks !

SLO : la puissance insoupçonnée des métriques | OCTO Talks ! | Devops for Growth | Scoop.it

Lorsque l’on exploite un produit ou que l’on monte une infrastructure, il est normal de se poser la question “Est-ce que mon application fonctionne bien ?” En général, il est commun d’avoir deux réponses dans ce genre de cas : Mettre en place du monitoring illustrant le fonctionnement de mon application Mettre en place un système d’alerting pour être prévenu en cas de dysfonctionnement Cependant, rares sont les fois où l’on va se demander si les alertes positionnées sont pertinentes dans mon contexte (ex : redémarrage d’un conteneur) ou si les métriques remontées par mon dashboard préfabriqué me remontent les informations qui me seront vraiment utiles pour identifier un dysfonctionnement.

Mickael Ruau's insight:

Cet article a pour vocation de vous présenter la façon dont les Site Reliability Engineers ou SRE (terme défini dans la série de livres de Google Site Reliability Engineering) approchent les métriques de leurs applications. On y verra comment ils positionnent des objectifs factuels sur celles-ci afin de déterminer si l’application présente réellement la qualité de service attendue et comment ils font pour aller plus loin que la simple visualisation de celle-ci. Ces principes seront ensuite mis en lumière en vous présentant Keptn, une solution assez jeune sur le marché, mais qui illustre bien certaines des possibilités ouvertes par cette façon de faire. SLI,SLO,SLA… SLQuoi ? Les premières questions que l’on va se poser seront bien souvent : Par où commencer ? Quels sont les symptômes d’un dysfonctionnement de mon application ? Comment identifier que mon application fonctionne correctement ? L’approche des SRE explique qu’il est impossible de gérer un service correctement sans comprendre les comportements qui importent pour le service. Cela passe par une capacité à les mesurer et les évaluer. Et c’est ce qui nous permettra, à la fin, de délivrer un niveau de qualité qui répond aux attentes des utilisateurs finaux.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

SRE: The Cloud Native Approach to Operations e-book

SRE: The Cloud Native Approach to Operations e-book | Devops for Growth | Scoop.it
What is SRE—Site Reliability Engineering—and how can it help companies both maintain reliability and innovate quickly? This e-book explains how it works.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

10+ Great Books For Aspiring DevOps & SRE Engineers | by Aymen El Amri | FAUN

10+ Great Books For Aspiring DevOps & SRE Engineers | by Aymen El Amri | FAUN | Devops for Growth | Scoop.it
BooksForDevOps is simply “The Product Hunt of Modern IT Books” and yes you can submit your favorite book or apply to feature a book you wrote ! I am the curator of this collection’s website and the…
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Comprendre la fixité fonctionnelle et comment elle influence le comportement - 2020

Comprendre la fixité fonctionnelle et comment elle influence le comportement - 2020 | Devops for Growth | Scoop.it
BetterHelp offre des conseils en ligne privés et abordables lorsque vous en avez besoin auprès de thérapeutes agréés et agréés par le conseil d'administration. Obtenez de l'aide, vous méritez d'être heureux!
Mickael Ruau's insight:

 

Il est éclairant de voir comment Duncker voyait la «résolution de problèmes».

Le processus de résolution de problèmes, développé par Duncker

  1. Si un objectif ne peut être atteint immédiatement par ses actions évidentes ou habituelles, il devient un problème. Selon les mots de Duncker: "Un problème survient lorsqu'une créature vivante a un but mais ne sait pas comment ce but doit être atteint. Chaque fois que l'on ne peut pas passer de la situation donnée à la situation souhaitée simplement par l'action, alors il doit y avoir recours à la pensée (Par action, nous comprenons ici la performance d'opérations évidentes.) "
  2. La résolution de problèmes comprend des phases, chaque phase étant une reformulation du problème. Duncker décrit cette étape comme suit: "... la solution d'un nouveau problème se déroule généralement en phases successives qui (sauf la première phase) ont, rétrospectivement, le caractère d'une solution et (sauf la dernière phase), en perspective, celle de un problème."
  3. Le point ou la fonction d'une solution est également sa définition de «solution». "La valeur fonctionnelle d'une solution est indispensable pour comprendre qu'elle est une solution. C'est exactement ce qu'on appelle le sens, le principe ou le point de la solution."
  4. La définition du principe de la solution est, en général, la première étape du processus de résolution. "La forme finale d'une solution individuelle n'est, en général, pas atteinte par une seule étape à partir du cadre d'origine du problème; au contraire, le principe, la valeur fonctionnelle de la solution, survient généralement en premier, et la forme finale de la solution en question ne se développe que lorsque ce principe devient de plus en plus concret. "
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

jdumars/agileops: The Agile Operations methodology

jdumars/agileops: The Agile Operations methodology | Devops for Growth | Scoop.it
The Agile Operations methodology. Contribute to jdumars/agileops development by creating an account on GitHub.
Mickael Ruau's insight:

Note: This is the culmination of years of work managing and optimizing the practice of technical operations groups/DevOps at scale. These are proven tactics and techniques that can be applied across any technical value delivery organization of any size to increase efficiency, satisfaction, and enterprise agility. While I had hoped to write a book on this eventually (see the outline for what that would have looked like), I do not have the time to do so, and yet these topics are extremely relevant especially as the cloud native revolution takes hold. This is not a replacement of DevOps, but instead the overarching framework that DevOps is a part of.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Monitoring Your Data Center Like a Google SRE

Monitoring Your Data Center Like a Google SRE | Devops for Growth | Scoop.it
As Google SREs can attest, good DevOps monitoring systems need to be more than do-it-yourself toolkits—they need to provide real intelligence.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

The Why, How, and What of Metrics and Observability

The Why, How, and What of Metrics and Observability | Devops for Growth | Scoop.it
We have a varied tech stack ranging from the many services that power the cloud, from hardware to virtualization software. But with many moving pieces comes a need for observability.
Mickael Ruau's insight:

In addition to monitoring our services, we also monitor our infrastructure. As a former member of the team that maintained our container clusters, I noticed enormous benefits when leveraging the USE method: utilization, saturation, and errors. Coined by Brendan Gregg, the USE method allows one to solve “80% of server issues with 5% of the effort”.

Let us take a look at how we leveraged these metrics to monitor our Kubernetes clusters.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Resilience First: SRE and the Four Golden Signals of Monitoring - Dzone Whitepaper

Resilience First: SRE and the Four Golden Signals of Monitoring - Dzone Whitepaper | Devops for Growth | Scoop.it

Site reliability engineering (SRE) and the four golden signals of monitoring create visibility into application and infrastructure health. Learn about the golden signals and see how SRE teams are using them to improve observability and service resilience.

No comment yet.