Join Free

Research and publish the best content.

Devops for Growth

107.5K views | +7 today

Tags
Current selected tag: 'SRE - Site Reliability Engineering'. Clear

administration reseau 1

administration systeme 2

Agile 1

Aide à la décision 1

backlog 1

biais cognitifs 1

chaos engineering 1

cloud 2

coaching - technical coaching 1

coaching agile 4

customer development 1

deploiement continu 2

devops 11

devsecops 1

ebook 3

entreprise apprenante 2

gestion de projet 1

gestion de version 1

GitOps 1

golden signals 2

growth hacking 1

Kanban 1

kick-off 1

Kubernetes 2

Lean startup 1

Linux 2

livre 2

machine learning 1

micro-services 2

modèle Spotify 1

monitoring 11

Observabilité 4

post mortem 4

product manager 1

product owner 2

prometheus 1

PSM2 1

psm3 1

QoQ - Quality of Service 1

RED - Rate Erros & Duration 4

reliability 1

résolution de problème 1

retrospective 1

scalabilite 1

scrum master 1

SLA - Service Level Agreements 5

SLI - Service Level Indicators 6

SLO - Service Level Objectives 8

SRE - Site Reliability Engineering 75

test - games days 1

test - synthetic monitoring 1

USE - Utilization Saturation & Errors 6

ux 1

ux design 1

Technology

Devops for Growth

For Product Owners/Product Managers and Scrum Teams: Growth Hacking, Devops, Agile, Lean for IT, Lean Startup, customer centric, software quality...

Curated by Mickael Ruau

Your new post is loading...

Scooped by Mickael Ruau

Scoop.it!

From techbeacon.com - January 28, 2022 1:03 AM

Mickael Ruau's insight:

We are regularly asked if we know any DevOps or site reliability engineering (SRE) experts available for hire. Our answer is, invariably, "Not really." It's a tough market out there.

DevOps and SRE (for large-scale software, at least) are critical approaches for success in modern software delivery and operations, as widely demonstrated every year in the State of DevOps report or the array of presentations at the DevOps Enterprise Summit.

But if you think you can achieve DevOps by hiring "DevOps experts," you are missing some contextual awareness. What exactly are you trying to improve in the first place?

If your software delivery is slow because of work you're handing off among multiple teams with diverse schedules and priorities, will a new hire really help?

We're not suggesting that you not hire people with diverse skills and backgrounds—that can be quite valuable to bring in new perspectives and approaches.

But conventional hiring based on expertise alone is ineffective and prevents organizations from developing the "learning muscles" that can help teams traverse the latest trends (DevOps, SRE, etc.) to their benefit at the right time, and in the right context.

Hiring experts for every need is like engaging in palliative care for organizational health. Preventive care would be to incorporate the necessary team structures and interactions—as well as a focus on people growth and sufficient slack—to effectively take in process, technology, and business changes.

Learning organizations smoothly morph as they adapt to new challenges, and they unlearn existing ways of working when they become limitations rather than enablers.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From dzone.com - December 10, 2021 6:09 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From blog.operaepartners.fr - November 25, 2021 12:12 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From slack.engineering - October 14, 2021 8:17 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From www.infoq.com - August 5, 2021 8:03 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From www.infoq.com - June 5, 2021 1:16 AM

Mickael Ruau's insight:

This eMag on Data-Driven Decision Making provides an overview of how the three main activities in software delivery can be supported by data-driven decision making to increase the effectiveness, efficiency and service reliability of a software delivery organization.

The articles in this eMag come from the InfoQ series on Data-Driven Decision Making where Vladyslav Ukis shared his experiences from Siemens Healthineers; a large-scale distributed software delivery organization consisting of 16 software delivery teams located in three countries.

Each of the articles highlights an area where data-driven decision making can be applied:

In Product Management, Hypotheses can be used to steer the effectiveness of product decisions.
In Development, Continuous Delivery Indicators can be used to steer the efficiency of the development process.
In Operations, SRE’s SLIs and SLOs can be used to steer the reliability of services in production.

Free download

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From www.blameless.com - March 15, 2021 9:05 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From infrastructure-as-code.com - October 29, 2020 2:19 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From www.atlassian.com - July 17, 2020 11:47 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From victorops.com - April 14, 2020 2:40 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From go.forrester.com - April 7, 2020 2:29 AM

Mickael Ruau's insight:

To sum up the findings:

Forty-six percent of the principles in the book work out of the box — they’re sound advice for any IT organization. This includes creating SLOs (service level objectives) that augment SLAs (service level agreements), implementing error budgets, and monitoring the four “golden signals” (latency, traffic, errors, and saturation). Do these today. Your customers will thank you.
Fifty percent of the principles are good advice — but you’ll need to tweak them for your enterprise. This includes balancing tickets between operations and development, writing your own APIs to automate processes, and bringing down production systems to test resiliency. This isn’t bad advice per se, but your mileage may vary if you don’t alter them for your enterprise.
There’s a small number — 4% — that you should not execute. This mostly had to do with load balancing, which is not an invalid approach, but Google has some geographical architecture challenges that your enterprise probably does not.

In the end, we recommend applying most of the concepts with some tweaking. Focus on the service delivery, feature velocity, and automation concepts in the book. Focus less on the architecture sections, as Google’s challenges likely don’t mirror your own.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From medium.com - April 2, 2020 3:13 AM

Mickael Ruau's insight:

There are three common lists or methodologies:

From the Google SRE book: Latency, Traffic, Errors, and Saturation
USE Method (from Brendan Gregg): Utilization, Saturation, and Errors
RED Method (from Tom Wilkie): Rate, Errors, and Duration

You can see the overlap, and as Baron Schwartz notes in his Monitoring & Observability with USE and RED blog, each method varies in focus. He suggests USE is about resources with an internal view, while RED is about requests, real work, and thus an external view (from the service consumer’s point of view). They are obviously related, and also complementary, as every service consumes resources to do work.

For our purposes, we’ll focus on a simple superset of five signals:

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From monitoring.love - April 1, 2020 3:02 AM

Mickael Ruau's insight:

Start here

Distributed Systems Observability – Cindy Sridharan

What the hell is observability? How is it any different than monitoring? Is it just the “devops” vs “sysadmin” debate all over again? This article answers these and so much more.

Monitoring Isn’t Observability – Baron Schwartz

This quote from the article sums it up quite well: Monitoring tells you whether a system is working, observability lets you ask why it isn’t working.

Monitoring, Analytics, Diagnostics, Observability, and Root Cause Analysis – Baron Schwartz

Now that we’ve introduced some nuance into our world, our terminology is getting overloaded. This post sets out some definitions of what everything means and how they differ.

How do I make my applications more observable?

Monitoring and Observability with USE and RED – Baron Schwartz

USE and RED are two methods for deciding what to instrument why. This article walks you through their meaning and usage.

Hierarchical Observability with RED – Baron Schwartz

One of the best parts about the RED Method comes when you instrument all of your services to emit the same data: it becomes soooo much easier to spot the troublesome service in a microservice/distributed architecture.

Best Practices for Observability – Charity Majors

The list of best practices at the end is worthwhile reading.

How to Monitor the SRE Golden Signals – Steve Mushero

Another method for deciding what to monitor is the Four Golden Signals, popularized by the Site Reliability Engineering book. This article series by Steve Mushero walks you through what the signals mean and how to gather them.

Who is actually doing observability?

There are a number of teams out there doing observability-like things, and some of the larger, more engineering-focused companies have more mature Observability teams that are focused on providing expertise and a platform to other teams.

Twitter

Observability at Twitter

Dating from 2013, Twitter was one of the first companies to work toward solving the problem of monitoring high-scale, distributed monitoring. For more details on their architecture (from 2016), see these posts: Observability at Twitter: technical overview, part I, Observability at Twitter: technical overview, part II

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From dzone.com - December 13, 2021 1:05 AM

Mickael Ruau's insight:

SRE advocates addressing problems blamelessly. When something goes wrong, don't try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don't fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we'll look at:

A definition for root cause analysis
A definition for contributing factor analysis
How to choose between RCAs and contributing factor analysis
Best practices for contributing factor analyses
How to incorporate learning from analyses back into development

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From www.infoq.com - December 8, 2021 1:37 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From dzone.com - October 25, 2021 2:11 AM

Mickael Ruau's insight:

What Happened: A “Cascade of Errors”

The outage wasn’t the result of one simple mistake or oversight. It was instead a “cascade of errors” that bred a critical disruption, as the New York Times put it.

That cascade started when an engineer ran a command that was supposed to assess capacity for Facebook’s data centers. For reasons that Facebook hasn’t fully explained (maybe it was a typo- a mistake that has caused more than one serious incident in the past- but we’re just guessing), the command disrupted the backbone network that connects Facebook’s data centers. Facebook’s DNS servers also became unreachable.

An auditing tool was supposed to have detected and blocked the errant command, but Facebook said that a “bug” prevented the tool from catching the issue.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From www.padok.fr - October 8, 2021 8:24 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From blog.octo.com - June 24, 2021 1:13 AM

Mickael Ruau's insight:

Cet article a pour vocation de vous présenter la façon dont les Site Reliability Engineers ou SRE (terme défini dans la série de livres de Google Site Reliability Engineering) approchent les métriques de leurs applications. On y verra comment ils positionnent des objectifs factuels sur celles-ci afin de déterminer si l’application présente réellement la qualité de service attendue et comment ils font pour aller plus loin que la simple visualisation de celle-ci. Ces principes seront ensuite mis en lumière en vous présentant Keptn, une solution assez jeune sur le marché, mais qui illustre bien certaines des possibilités ouvertes par cette façon de faire. SLI,SLO,SLA… SLQuoi ? Les premières questions que l’on va se poser seront bien souvent : Par où commencer ? Quels sont les symptômes d’un dysfonctionnement de mon application ? Comment identifier que mon application fonctionne correctement ? L’approche des SRE explique qu’il est impossible de gérer un service correctement sans comprendre les comportements qui importent pour le service. Cela passe par une capacité à les mesurer et les évaluer. Et c’est ce qui nous permettra, à la fin, de délivrer un niveau de qualité qui répond aux attentes des utilisateurs finaux.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From info.container-solutions.com - April 30, 2021 5:24 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From medium.com - November 11, 2020 11:08 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From fr.lifehackk.com - October 9, 2020 12:14 PM

Mickael Ruau's insight:

Il est éclairant de voir comment Duncker voyait la «résolution de problèmes».

Le processus de résolution de problèmes, développé par Duncker

Si un objectif ne peut être atteint immédiatement par ses actions évidentes ou habituelles, il devient un problème. Selon les mots de Duncker: "Un problème survient lorsqu'une créature vivante a un but mais ne sait pas comment ce but doit être atteint. Chaque fois que l'on ne peut pas passer de la situation donnée à la situation souhaitée simplement par l'action, alors il doit y avoir recours à la pensée (Par action, nous comprenons ici la performance d'opérations évidentes.) "
La résolution de problèmes comprend des phases, chaque phase étant une reformulation du problème. Duncker décrit cette étape comme suit: "... la solution d'un nouveau problème se déroule généralement en phases successives qui (sauf la première phase) ont, rétrospectivement, le caractère d'une solution et (sauf la dernière phase), en perspective, celle de un problème."
Le point ou la fonction d'une solution est également sa définition de «solution». "La valeur fonctionnelle d'une solution est indispensable pour comprendre qu'elle est une solution. C'est exactement ce qu'on appelle le sens, le principe ou le point de la solution."
La définition du principe de la solution est, en général, la première étape du processus de résolution. "La forme finale d'une solution individuelle n'est, en général, pas atteinte par une seule étape à partir du cadre d'origine du problème; au contraire, le principe, la valeur fonctionnelle de la solution, survient généralement en premier, et la forme finale de la solution en question ne se développe que lorsque ce principe devient de plus en plus concret. "

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From github.com - April 24, 2020 2:42 AM

Mickael Ruau's insight:

Note: This is the culmination of years of work managing and optimizing the practice of technical operations groups/DevOps at scale. These are proven tactics and techniques that can be applied across any technical value delivery organization of any size to increase efficiency, satisfaction, and enterprise agility. While I had hoped to write a book on this eventually (see the outline for what that would have looked like), I do not have the time to do so, and yet these topics are extremely relevant especially as the cloud native revolution takes hold. This is not a replacement of DevOps, but instead the overarching framework that DevOps is a part of.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From devops.com - April 8, 2020 2:33 AM

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From blog.digitalocean.com - April 6, 2020 2:31 AM

Mickael Ruau's insight:

In addition to monitoring our services, we also monitor our infrastructure. As a former member of the team that maintained our container clusters, I noticed enormous benefits when leveraging the USE method: utilization, saturation, and errors. Coined by Brendan Gregg, the USE method allows one to solve “80% of server issues with 5% of the effort”.

Let us take a look at how we leveraged these metrics to monitor our Kubernetes clusters.

No comment yet.

Scooped by Mickael Ruau

Scoop.it!

From dzone.com - April 2, 2020 2:43 AM

No comment yet.

Devops for Growth

Popular Tags

Why you should hire DevOps enablers, not experts

Starting an SRE Team? Stay Away From Uptime - DZone Performance

DevOps en pratique : comment réduire le Mean Time To Repair –

Infrastructure Observability for Changing the Spend Curve

SRE @ Criteo, Why What Who How ?

The InfoQ eMag: Effective Software Delivery with Data-Driven Decision Making

How to Build an SRE Team with a Growth Mindset

DevOps, SRE, GitOps, Observability: My take on some current-ish buzzwords

Love DevOps? Wait until you meet SRE

Chapter 9 | Measuring Success in SRE | SRE Guide

Forty-Six Percent Of Google's SRE Principles Apply Directly To Your Enterprise — What About The Rest?

How to Monitor the SRE Golden Signals - Faun

What is "observability"?

Start here

How do I make my applications more observable?

Who is actually doing observability?

Twitter

How to Analyze Contributing Factors - DZone DevOps

Observing and Understanding Failures: SRE Apprentices

What SREs Can Learn From Facebook’s Largest Outage - DZone DevOps

What Happened: A “Cascade of Errors”

Différencier les SRE, DevOps, SysAdmin et Cloud engineers

SLO : la puissance insoupçonnée des métriques | OCTO Talks !

SRE: The Cloud Native Approach to Operations e-book

10+ Great Books For Aspiring DevOps & SRE Engineers | by Aymen El Amri | FAUN

Comprendre la fixité fonctionnelle et comment elle influence le comportement - 2020

jdumars/agileops: The Agile Operations methodology

Monitoring Your Data Center Like a Google SRE

The Why, How, and What of Metrics and Observability

Resilience First: SRE and the Four Golden Signals of Monitoring - Dzone Whitepaper