Resilence Engineering and How it Can Change DevOps for the Better

Introduction

When a software development team first begins the journey of developing a piece of software what can and will go wrong is barely discussed. Rather, the goals are the primary driver and so they should be. But behind the confidence of success every team member something at some point will go wrong and hopefully not derail the whole project.

What if this impending sense of dread was used in a productive manner? Could the knowledge that something will go wrong be used to make the end product more resilient and an overall far more superior product? Resilience engineering attempts to answer these questions positively.

We sat down with Grigor Khachatryan, a software development and infrastructure engineering veteran who has seen many development trends come and go and has huge experience in solving large-scale infrastructure dilemmas in growth-stage startups, and asked him what makes resilience engineering different from past trends and why it has staying power as a methodology.

The Low Down

Rather than an adoption of technology or framework, resilience engineering has come to be a shift in the culture of software development. Seeing how Google's Site Reliabilty Engineering (SRE) drastically improved user experiences on the Internet, software developers were quick to take on some of the lessons learned by Google but make them fit DevOps ideas better. Perhaps the biggest difference to between SRE and resilience engineering is that the prior relies on reacting appropriately to when things break down. The later focuses on developing long-term response strategies. Grigor likes to look at resilience engineering in the following light,

"Resilience Engineering is all about building systems that can adapt and automatically take the best course of action when common issues occur. Any inadequacies found through testing are ironed out before the system can become truly resilient."

With the development of resilience engineering practises came the relisation that relying on frameworks can prevent large corporations, particularilty those reliant on cloud computing, from suffering extended periods of downtime. Grigor notes that cloud computing,

"Scaling up and growing a business requires expanding servers and other technologies designed to handle large amounts of data. However, with ultra-high internet speeds and massive amounts of data generated by websites, extracting the right data can take a long time and a lot of money.

That's why more and more businesses choose to migrate their websites, data analytics, and other business details onto cloud services. These services are designed to allow fast data analytics and results with automation features designed to speed things up."

Cloud computing's emphasis on scaling and automation meant that resilience engineering frameworks would fit right at home. Preventing exessive time used and money spent in data-analytics for example requires a framework built to handle whatever can be thrown at it. But this begs the question what should a such a framework consist of to realize this goal.

Resilient Frameworks

To have such a framework that meets the needs of modern development requires at leat three factors to be met. Those factors being is it data driven, does it establish habits and decision trees, and can it engineer reproducible incidents. It is wise then to look at easch of these factors in isolation.

Produce Habits and Decision Trees

When things break, and they will, a solution needs to found as a matter of priority. But what if the problem has been encountered in the past and an adequate solution was developed. Could that response be developed into a standardised method to deal with similar problems in the future? By creating repeatable solutions to problems developers can rely on what has worked in the past which greatly removes the fear of failure from the equation. If such a solution is developed and can be repeatable those looking to solve future problems can focus on the details and tweak the solution to best meet current needs.

Data Driven

Resilence engineering relies on data, what's more is that it relies on being able to access relevant data when needed. This ultimately means that data across the development cycle and delivery chain needs to be captured. This is important because if an issue arises at a certain part in the development cycle of delivery chain the solution can be found by rolling back development to where the source of the issue arises.

Engineering around Reprodocible Incidents

This factor is one most favoured by developers as it seems to show how resilience engineering can come together to build a product with strong foundations. Central to this factor is the idea that what is learned from previous incidents can be carried forward to drastically improve future products. Adding data and decision trees to this process means that solutions can be automated. This helps incidents look more managable rather than a disaster as a team will have a playbook to look back on to quickly resolve an incident. The incident itself can then be turned into something that is reproducible and an automated solution can be developed.

Conclusion

Resilience engineering is far from a trend. Rather, it is an idea that can rapidly change a teams development culture to conquer unforeseen challenges with confidence. Grigor notes that adopting such a mindset along with cloud based technologies was more than advantageous for those he worked with, stating that,

"It has solved our scalability and resilience problems by automatically deploying, scaling, and managing our containerized applications. Now, if one component fails, others continue to work so our clients aren't affected during downtime."

Join the Discussion