'chaos engineering' in Devops for Growth

Tags
Current selected tag: 'chaos engineering'. Clear

Mise en application du mode dégradé : exemple d’un service de stérilisation | Devops for Growth | Scoop.it

From www.hygienes.net - August 11, 2021 4:25 AM

From www.lemagit.fr - April 14, 2019 4:49 AM

Chaos Engineering in Action

To illustrate how this works, here’s a recent example of a failure that was driven from extreme chaos to low chaos using this model. Extreme chaos: Initial discovery. A service emitted an unexpected low-priority alert. Upon further investigation, it ended up being a downstream error signifying that at least one required background task was not running. Initial classification put this error at extreme chaos – it left the cluster in an unknown state and didn’t alert correctly. High chaos: Mitigation. At this point, we were not able to automate the failure since we were not aware of the root cause. Instead, we worked to drive the failure down to a level of high chaos. We identified a manual fix that worked, but impacted availability for services without replicas. We tuned our alerting to the correct level so that the engineer on call could perform this manual fix any time the error occurred again. Two unlucky engineers were woken up to do so by high priority alerts before the failure was fixed. Automation. Once we were sure that our customer services were safe, we focused our efforts on reproducing the error. The root cause ended up being unexpected stalls when making external calls that were impacting unrelated components. Finding this required the addition of fault injection to cause artificial latency in the component making the calls. Low chaos: Fix and verification. After the root cause was identified, the fix was straightforward. We decoupled the component experiencing latency in calls from the rest of the system so that any stalls would only affect that component. Some redundancy was introduced into this component so that its operation was no longer impacted by latency, or even isolated stalls, only prolonged and repeated stalls (a much rarer occurrence). We were able to use the automated chaos operation to prove that the original failure was now handled smoothly without any problems. The failure that used to wake up on-call engineers to perform a potentially availability-impacting fix could now be downgraded to a low chaos failure that our system could recover from with no noise at all. At this point, the automated failure could be handed off to the Search Chaos Monkey to run regularly as a low chaos operation and continually verify our system’s ability to handle this once-serious error.

Chaos Engineering and the Cloud

At Azure Search, chaos engineering has proven to be a very useful model to follow when developing a reliable and fault tolerant cloud service. Our Search Chaos Monkey has been instrumental in providing a deterministic framework for finding exceptional failures and driving them to resolution as low-impact errors with planned, automated solutions.

Devops for Growth

Popular Tags

Mise en application du mode dégradé : exemple d’un service de stérilisation

Gremlin Releases State of Chaos Engineering 2021 Report

Ingénierie du chaos — Chaos Engineering in French — Comment convaincre votre boss de se lancer dans…

Chaos Engineering: the history, principles, and practice

Principles of Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.

ADVANCED PRINCIPLES

Le chaos engineering –

Chaos Monkey — Wikipédia

Chaos Monkey - Wikipédia

OUI Talk | Un an de partage autour du Chaos Engineering chez @OUI.sncf

Chaos Engineering ou le stress ultime des applications et de l’infrastructure