Resilience Is an R&D Problem, Not Just an SRE Problem
Imagine that you’re at your company’s all-hands meeting and one of the sellers is proudly ringing the office gong to celebrate closing a big deal with a client who’s on the other side of the world. It’s a big deal because it’s a major project. Their logo is going to look sleek on your website, and you are finally breaking into a new region of the world. But two months after the project kicks off, the situation isn’t looking as rosy. There have been myriad issues, and now the client has been informed of a critical issue with your application, which is preventing them from providing service to their clients. What a mess!
But that’s OK. Everyone knows that you can’t develop code without bugs; it’s simply a part of life. After accepting this fact, you need to learn how to deal with it by building processes that better enable you to handle these issues.
For this reason, your company has a team of talented site reliability engineers(SREs) to create scalable and highly reliable software systems to minimize the impact of bugs. They will handle customer issues, spend time on call and provide assistance with manual intervention. They are your front-line defense system when it comes to battling client bugs.
But what’s critical to remember is that no matter how amazing your SREs are, the cost of solving a bug early on in the development process is significantly lower than when it has already been released. One study reported that the cost to fix a bug found during the implementation stage is approximately six times more expensive than one identified during the design phase, and up to 100 times more expensive if it’s already in the maintenance phase.
For that reason, it is important to understand and accept that resilience — as in, how equipped you are in dealing with issues as they come up — is a research and development (R&D) problem, not just an SRE problem. After all, prevention is a cure.
So how do you go about creating resilient software? Start with taking a look at your R&D team. They are your core. If SREs are your front-line defense, R&D are the ones who ensure that the product is strong, robust, and durable. They are your immune system. R&D is the team that has to ensure, and continuously verify, that everything has been done that can be dealt with before reaching the client.
To ensure that happens, here are a few things you should focus on:
- Identify issues during the development phase, using various tools and techniques such as architecture risk analysis.
- Ensure that you set up and follow a structured and rigorous code review process. Having your peers go through your code provides another set of eyes to ensure that nothing is missed.
- Before releasing the software, conduct penetration tests to identify any issues and to ensure that bugs that you had previously identified have been resolved.
- Set up automated tests and track and monitor the performance of your code to prevent the client from experiencing any issues.
- Ensure you have set up the infrastructure needed to enable you to debug your code once it hits production.
All of these will ultimately affect the bottom line. Creating robust code early on in the development process minimizes the number of bugs, and if any bugs are identified, it will be easy and relatively inexpensive to fix them. Robust and resilient code in production ensures that the SREs in your company can focus on working with the clients and fixing client-specific issues instead of going back to R&D to address core issues in the code.
Company-wide resilience is critical to ensure that you can deal with the unexpected. There are many ways in which you can strengthen your resilience (check this article out), but it’s critical to remember that it’s not just something that those who are working directly with your customers have to worry about. The more R&D gets involved in ensuring that there is resilience company-wide, the more you will enjoy being part of a smooth operating machine.