Devops for Growth
107.5K views | +9 today
Follow
Devops for Growth
For Product Owners/Product Managers and Scrum Teams: Growth Hacking, Devops, Agile, Lean for IT, Lean Startup, customer centric, software quality...
Curated by Mickael Ruau
Your new post is loading...
Your new post is loading...

Popular Tags

Current selected tag: 'chaos engineering'. Clear
Scooped by Mickael Ruau
Scoop.it!

Mise en application du mode dégradé : exemple d’un service de stérilisation

Mise en application du mode dégradé : exemple d’un service de stérilisation | Devops for Growth | Scoop.it
L’outil informatique est de plus en plus présent au sein des unités de stérilisation. Son utilisation permet de simplifier et sécuriser le processus en assurant une traçabilité des différentes étapes. Pour faire face à un dysfonctionnement, une procédure de mode dégradé doit être rédigée afin de maintenir l’activité. Ce travail se propose de présenter un exercice de « simulation » de notre procédure de mode dégradé. La première étape a été de formaliser, par une procédure, l’organisation de l’activité en cas de dysfonctionnement informatique. Puis, sans que le personnel soit au courant, l’exercice a été réalisé. Au terme de cet exercice, un débriefing avec l’ensemble des intervenants a permis d’apporter les corrections nécessaires. La simulation d’une panne de notre outil informatique a permis de mettre en évidence certaines difficultés d’application de la procédure initiale. Des éléments se sont avérés difficiles à mettre en œuvre en condition réelle. Les retours du personnel ont donc conduit à une nouvelle mise à jour du document. Cet exercice a permis de vérifier la bonne application et la connaissance de notre organisation en cas de panne de l’informatique. Les retours positifs de cet entraînement de la part du personnel confirment l’intérêt de pratiquer régulièrement ce type de simulation.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Gremlin Releases State of Chaos Engineering 2021 Report

Gremlin Releases State of Chaos Engineering 2021 Report | Devops for Growth | Scoop.it
Gremlin released their State of Chaos Engineering 2021 report based on a community survey and their own product data.The key findings include a positive correlation between running chaos engineering...
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Ingénierie du chaos — Chaos Engineering in French — Comment convaincre votre boss de se lancer dans…

Vous pouvez commencer par annoncer que vous allez tout casser en production, que ça va être fun ! Soit la méthode dit des « Gros Sabots ». On va vous voir venir de loin, mais à part faire du bruit…
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos Engineering: the history, principles, and practice

Chaos Engineering: the history, principles, and practice | Devops for Growth | Scoop.it
Chaos Engineering: the history, principles, and practice
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Principles of Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.

Mickael Ruau's insight:

ADVANCED PRINCIPLES

The following principles describe an ideal application of Chaos Engineering, applied to the processes of experimentation described above.  The degree to which these principles are pursued strongly correlates to the confidence we can have in a distributed system at scale.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Le chaos engineering –

Le chaos engineering – | Devops for Growth | Scoop.it

Cet article s’insère dans notre série « Chaos Engineering ».

Le contexte

Les architectures applicatives actuelles s’éloignent de plus en plus d’une structure en blocs monolithiques, pour s’orienter vers des architectures basées sur de la composition de services et structurées en systèmes distribués, notamment par l’utilisation des micro-services.

Les applications basées sur ces architectures proposent des fonctionnalités provenant de l’interaction de leurs composants, et de la bonne collaboration de l’ensemble des composants.

Les composants de ces architectures peuvent se compter par centaines, ce qui apporte des problématiques de gestion applicatives de plus en plus liées aux systèmes distribués, et acquièrent des propriétés similaires à celles des systèmes complexes.

Mickael Ruau's insight:

 

Un système complexe peut être défini de plusieurs façons d’après le prisme dont il est observé, nous considèrerons les quelques propriétés essentielles suivantes, utiles pour la suite:

  • il est composé d’un grand nombre d’éléments en interaction et ce, de manière simultanée
  • le comportement d’un système complexe est très difficile à modéliser, même en connaissant parfaitement chaque élément de ce système. Le comportement est émergent, car il est issu des différentes interactions entre les éléments le composant.
  • l’action d’un composant peut avoir un effet sur son propre état, sur l’état d’autres composants et sur l’état global du système par propagation
  • la connaissance d’une partie du système ne permet pas de déterminer l’état global du système

Les fonctionnalités proposées par une application sur l’interaction de composants à l’intérieur d’un système complexe seront donc systémiques, et dépendantes du bon fonctionnement et de la bonne coordination des différents composants.

Malheureusement, la moindre faille peut avoir des conséquences lourdes sur la bonne fonctionnalité de ces systèmes (voir propriétés ci-dessus), et il est très difficile, voire impossible de modéliser l’ensemble des conséquences pouvant émerger de la faille d’un composant (failles en cascade, goulot d’étranglement) et/ou de l’orchestration de différents composants (« retry storms »).

Les tests existants (unitaires, intégration, techniques) permettent de tester la bonne fonctionnalité de composants isolés, ou en intégration simple, mais restent très limités dans la possibilité de tester la robustesse d’un système complexe à l’échelle réelle car ils restent déterministes.

Utiliser un environnement autre que l’environnement réel peut aussi entraîner des biais qui fausseront les observations et la possibilité de les transposer dans la réalité.

On pré-supposera finalement que le fait d’observer le système réel n’a pas d’impact conséquent sur le comportement de ce système.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos Monkey — Wikipédia

Chaos Monkey - Wikipédia

Le concept de Chaos Monkey a été inventé en 2011 par Netflix pour tester la résilience de ses infrastructures informatiques. Le but de cet outil est de simuler des pannes en environnement réel et de vérifier que le système informatique continue à fonctionner.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

OUI Talk | Un an de partage autour du Chaos Engineering chez @OUI.sncf

OUI Talk | Un an de partage autour du Chaos Engineering chez @OUI.sncf | Devops for Growth | Scoop.it
Chaque mois, vous êtes en moyenne 16 millions de visiteurs uniques sur le site français OUI.sncf et vous êtes très nombreux lors d'opération spéciales comme les ouvertures de vente de billets ! Pour prévenir le moindre couac sur notre site et vous offrir la meilleure expérience de navigation, nous appliquons la discipline du Chaos Engineering.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos Engineering ou le stress ultime des applications et de l’infrastructure

Chaos Engineering ou le stress ultime des applications et de l’infrastructure | Devops for Growth | Scoop.it

Outre une meilleure compréhension du travail des exploitants par les développeurs, cette démarche a permis à OUI.sncf de renforcer la résilience de son infrastructure de production à plusieurs niveaux. « Nous avons reproduit une panne qui était survenue il y a 5 ans, une panne que nous avons baptisée Irma », souligne Benjamin Gakic. « Il y a 5 ans, nous avions mis 1 heure pour la détecter et la résoudre. Cette année, elle l'a été en moins de 10 minutes. En termes de volume d'affaires que cela représente, c'est très significatif. » Une autre illustration de l'amélioration de la résilience le la production du site concerne sa dépendance vis-à-vis d'un partenaire jugé peu stable. « A chaque incident chez ce prestataire, nous perdions des sessions. La mise en place d'un circuit breaker nous a permis de réduire d'un facteur 40 l'impact de cette instabilité. Là encore, c'est très significatif », résume l'expert en sûreté de fonctionnement.

Désormais, le Chaos Monkey entre peu à peu dans les habitudes du site. Si au premier raid des minions sur le système d'information de Voyages-SNCF tout le monde était prévenu, ce n'est plus le cas aujourd'hui : « Désormais nous exécutons un Chaos Monkey chaque trimestre et on n'a plus de différence entre l'incident de production et le Chaos Monkey », confie Christophe Rochefolle. « Les exploitants ne sont plus informés qu'un Chaos Monkey est en cours. Par contre, il y a une trace que le Chaos Monkey est lancé. Mais nous ne les lançons pas les jours où nous réalisons 20 millions d'euros de chiffre d'affaires ! »

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos engineering

Chaos engineering | Devops for Growth | Scoop.it
Building confidence in system behavior through experiments
Mickael Ruau's insight:

This is the full text of "Chaos Engineering," an ebook by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri. The ebook is also available for download.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Voyages-sncf mise sur l'ingénierie du chaos pour éprouver son infra

Voyages-sncf mise sur l'ingénierie du chaos pour éprouver son infra | Devops for Growth | Scoop.it
Conceptualisée à l'origine par Netflix, la méthode consiste à introduire régulièrement des pannes volontaires dans les systèmes informatiques pour tester et valider leur robustesse.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

The InfoQ eMag - Real World Chaos Engineering

The InfoQ eMag - Real World Chaos Engineering | Devops for Growth | Scoop.it
Creating a successful chaos practice isn’t purely an engineering problem. As with many aspects of cloud native computing, it requires buy-in across the organisation. In this eMag we’ve pulled together a variety of case studies to show mechanisms by which you can do so, even in tightly regulated industries where you might face considerable opposition.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

What to Make of SRE's Golden Signals

What to Make of SRE's Golden Signals | Devops for Growth | Scoop.it
The golden signals of SRE and monitoring are essential for any team looking to build reliable services and improve system visibility. SRE teams use the golden signals for basic service and infrastructure monitoring and alerting, then improve from there.
Mickael Ruau's insight:

Proactive SRE Goes Past the Golden Signals

While monitoring the golden signals is a great start to understanding incidents in your service, SRE teams of the future are proactively learning more about their system through numerous additional techniques. By running organized tests in both staging and production, SRE teams can actively learn about their systems and use the information to build reliability into their services.

  • Chaos Engineering: Chaos engineering is a discipline used by teams to experiment on their systems to proactively detect failure points or potential weaknesses. By actively injecting chaos into your service, you can see exactly how the system responds to different circumstances.

  • Game Days: While chaos engineering is geared toward understanding your system, game days can be used to understand your people. Game days are used to test the resiliency of your team when it comes to incident response and remediation. You can use the learnings from game days to develop more efficient processes or determine the need for new tools that make people more efficient.

  • Synthetic Monitoring: The use of synthetic monitoring allows teams to create artificial users and simulate user behavior through a service. You can determine specific artificial behavior flows in order to learn more about how your system responds under pressure. Synthetic monitoring is an excellent method for granularly testing and determining the reliability of specific services within your greater system.

 

SRE’s golden signals need to be monitored by any team looking to visibly measure the health of a system. But, knowing the health and general reliability of a system is far different from taking actions to improve a system’s reliability. In today’s ecosystem of highly distributed systems and rapid deployment, SRE teams have their work cut out for them. But, the golden signals of monitoring and SRE can help you achieve a healthy starting point from which you can constantly improve to become more proactive with SRE.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos Engineering: managing complexity by breaking things

Chaos Engineering: managing complexity by breaking things | Devops for Growth | Scoop.it
Chaos Engineering is a software engineering methodology or philosophy that asserts the importance of stress testing your software infrastructure. First developed by Netflix in 2011, it has grown to prominence in more recent years with the wider adoption of cloud and microservices.
Mickael Ruau's insight:

The principles of Chaos Engineering are documented here. This is effectively its ‘manifesto’. There’s a lot in there worth reading, but here are the 5 principles that any sort of testing or experimentation should aspire to:

  • Base your testing hypothesis on steady state behavior. Consider your infrastructure holistically, making individual parts work is important but not the priority.
  • Simulate a variety of real-world events. This could be hardware or software failures, or simply external changes like spikes in traffic. What’s important is that they’re all unpredictable.
  • Test in production. Your tests should be authentic.
  • Automate! Testing things could be laborious and require a lot of manual work. Make use of automation tools to do lots of different tests without taking up too much of your time.
  • Don’t cause unnecessary pain. While it’s important that your stress-tests are authentic, the impact must be contained and minimized by the engineer.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

ChAP: Chaos Automation Platform – Netflix TechBlog –

We are excited to announce ChAP, the newest member of our chaos tooling family! Chaos Monkey and Chaos Kong ensure our resilience to instance and regional failures, but threats to availability can also come from disruptions at the microservice level. FIT was built to inject microservice-level failure in production, and ChAP was built to overcome the limitations of FIT so we can increase the safety, cadence, and breadth of experimentation.

At a high level, the platform interrogates the deployment pipeline for a user-specified service. It then launches experiment and control clusters of that service, and routes a small amount of traffic to each. A specified FIT scenario is applied to the experimental group, and the results of the experiment are reported to the service owner.

Mickael Ruau's insight:

With ChAP, we have safely identified mistuned retry policies, CPU-intensive fallbacks, and unexpected interactions between circuit breakers and load balancers.

Learn more about Chaos

 
“Chaos Engineering,” authored by the Netflix Chaos Team.

We wrote the book on Chaos Engineering, available for free for a limited time from O’Reilly.

Aaron Blohowiak spoke at Velocity 2017 San Jose, on the topic of Precision Chaos.

Nora Jones also presented a talk at Velocity San Jose about our experiences with adoption of chaos tools.

Join the Chaos Communitygoogle group to participate in the discussion, keep up-to-date on evolution of the industry, and announcements about Chaos Community Day.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Le Chaos Engineering comme outil de validation de l'observabilité

Le Chaos Engineering comme outil de validation de l'observabilité | Devops for Growth | Scoop.it
Comment le Chaos Engineering a permis à OUI.sncf de mettre à l'épreuve son monitoring et d'évaluer son niveau d'observabilité.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

The InfoQ eMag: Chaos Engineering

The InfoQ eMag: Chaos Engineering | Devops for Growth | Scoop.it
This eMag will inspire you to dig deeper into your systems, question your mental models, and use chaos engineering to build confidence in your system’s behaviors under turbulent conditions.
Mickael Ruau's insight:

At Netflix, we’ve been embracing chaos engineering since Chaos Monkey was born in 2011. It has gone through several iterations and tools that eventually evolved into the Failure Injection Testing (FIT) platform and, ultimately, ChAP (a platform for safely automating and running chaos experiments in production) through the efforts of many amazing engineers. We’ve taken the opportunity to outline why this has been so beneficial for the business in a separate IEEE article titled “The Business Case for Chaos Engineering” and a free e-book from O’Reilly here.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Inside Azure Search: Chaos Engineering | Blog | Microsoft Azure

Inside Azure Search: Chaos Engineering | Blog | Microsoft Azure | Devops for Growth | Scoop.it
As systems scale, we expect nodes to fail ungracefully in random and unexpected ways, networks to experience sudden partitions, and messages to be dropped at any time. Azure Search uses chaos engineering to help solve this problem.
Mickael Ruau's insight:

Chaos Engineering in Action

To illustrate how this works, here’s a recent example of a failure that was driven from extreme chaos to low chaos using this model. Extreme chaos: Initial discovery.  A service emitted an unexpected low-priority alert.  Upon further investigation, it ended up being a downstream error signifying that at least one required background task was not running.  Initial classification put this error at extreme chaos – it left the cluster in an unknown state and didn’t alert correctly. High chaos: Mitigation.  At this point, we were not able to automate the failure since we were not aware of the root cause.  Instead, we worked to drive the failure down to a level of high chaos.  We identified a manual fix that worked, but impacted availability for services without replicas.  We tuned our alerting to the correct level so that the engineer on call could perform this manual fix any time the error occurred again.  Two unlucky engineers were woken up to do so by high priority alerts before the failure was fixed. Automation.  Once we were sure that our customer services were safe, we focused our efforts on reproducing the error.  The root cause ended up being unexpected stalls when making external calls that were impacting unrelated components.  Finding this required the addition of fault injection to cause artificial latency in the component making the calls. Low chaos: Fix and verification.  After the root cause was identified, the fix was straightforward.  We decoupled the component experiencing latency in calls from the rest of the system so that any stalls would only affect that component.  Some redundancy was introduced into this component so that its operation was no longer impacted by latency, or even isolated stalls, only prolonged and repeated stalls (a much rarer occurrence). We were able to use the automated chaos operation to prove that the original failure was now handled smoothly without any problems.  The failure that used to wake up on-call engineers to perform a potentially availability-impacting fix could now be downgraded to a low chaos failure that our system could recover from with no noise at all. At this point, the automated failure could be handed off to the Search Chaos Monkey to run regularly as a low chaos operation and continually verify our system’s ability to handle this once-serious error.  

Chaos Engineering and the Cloud

At Azure Search, chaos engineering has proven to be a very useful model to follow when developing a reliable and fault tolerant cloud service.  Our Search Chaos Monkey has been instrumental in providing a deterministic framework for finding exceptional failures and driving them to resolution as low-impact errors with planned, automated solutions.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos engineering - Wikipedia

Chaos engineering - Wikipedia

In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of service-often generalized as resiliency -is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field.

No comment yet.
Scooped by Mickael Ruau
Scoop.it!

Chaos Engineering, qu'est-ce que c'est ?

Chaos Engineering, qu'est-ce que c'est ? | Devops for Growth | Scoop.it
A quel point votre système est-il proche du précipice et peut sombrer dans le chaos ? C'est à cette question que la discipline de Chaos Engineering essaye de répondre.
No comment yet.
Scooped by Mickael Ruau
Scoop.it!

dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources.

dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources. | Devops for Growth | Scoop.it
A curated list of Chaos Engineering resources. Contribute to dastergon/awesome-chaos-engineering development by creating an account on GitHub.
No comment yet.