An Engineer’s Dilemma
Working with Rookout customers, I have noticed a significant pattern in how they describe engineering routines in the days before our software became a part of their daily workflow. It shows up in various engineering tasks such as developing new features, reproducing and fixing bugs, or even just documenting the existing system and how to best utilize it. It is also consistent across industries and tech stacks.
I want to take this opportunity to share this pattern with you, one which I find to be much in line with my own experience not only as an engineer but also as an engineering manager.
Facing the Paradox
As an engineering team working on an existing code base, your first and foremost source of truth is the code itself. After all, software documentation is notoriously difficult to maintain and is predictably out of date when you need it most. With manual labor traditionally being automated and placed in the code itself using modern software methodologies, such as Infrastructure as Code and Database as Code, this is even more prominent.
That being said, it’s important to remember that reading source code only tells half the story of what’s happening in the software as it’s running (and running it locally doesn’t help all that much, though that’s an entirely different blog post). This is where other data sources come into play, most notably the Observability and Monitoring tools such as logging, tracing, metrics, and BI in place.
Unfortunately, more often than not, engineers lack the data required to design and execute their day-to-day assignments to the best of their ability. Still, getting more data requires writing more code, getting it integrated, and deploying it to the relevant environment, all of which can be just as expensive (and sometimes as risky), as doing their assigned tasks in the first place. This brings us to the Engineer’s Dilemma:
Do I develop the task ahead of me with the information I already have, or do I develop a feature that will get me more data?
The road to understandability
Reading this, you might be wondering: what are the missing pieces of information that all those engineers can’t get without writing more code? Well, here are a few of the most notable examples we are seeing:
- User Behavior – engineers (and Product Managers!) are interested in knowing how the system is utilized in real-world scenarios. While APM tools often provide basic metrics, such as which APIs are called more often than the others, they provide little insight into what arguments are passed in, which users are using which parts of the system, or dozens of other questions that provide a better understanding of the application.
- Real Data – engineers need to know what data is flowing through and where it’s stored in the system. This is even more important with dynamic languages, NoSQL databases, and unstructured data, where even the types of data might be hard to infer from reading the source code. By seeing examples of real data in various parts of the code, engineering is able to gain a better understanding of the application.
- Dependencies – engineers are always wondering how the services integrated with their own software behave in different scenarios. Software applications are becoming ever more interdependent, both internally with the move from SOA (Service-Oriented Architecture) to Microservices and externally with many new SaaS offerings. By observing the interactions of those incoming and outgoing APIs, engineers can gain a better understanding of the application.
- Complexity – engineers are tasked with staying on top of their evolving codebase. As software scales, it is constantly repurposed and retrofitted to meet new requirements, causing the codebase to grow in size and complexity. New engineers who join the team might not be as knowledgeable about the codebase itself. As code becomes more complicated, debugging it offers unique insights into its behavior, allowing engineers a better understanding of the application.
Do The Job
The most straightforward approach for engineers when working is to do the job in front of them with the information they already have. Naturally, performing a task while lacking critical information is hardly the best way to go forward.
The classic example is when developers are attempting to resolve a bug. They find that the lack of data means they have little ability to pinpoint the root cause and have to quite literally change code at random in the hopes it may fix the bug. Worse, when lacking the data to understand and/or reproduce the bug, the team has no way to verify the bug has even been fixed.
Yet, even when developing a new feature, the shortage of an understanding of the existing code base and how it’s being used presents a hurdle for engineers. This means extra time and effort that will be spent on handling potential “what ifs” that might not even be relevant, all the while failing to address real issues that will inevitably arise in production. Overall, this leads to more expensive, slower to develop features that have higher failure rates when rolled out.
Get More Data
Alternatively, engineers can dive into the rabbit hole, chasing those missing pieces of data. Once such a missing piece of data has been identified, they then have to develop an entirely new feature, one that will collect for them the data they need.
While this is (usually) a relatively simple feature, it is a feature nonetheless. One has to figure out where in the code to collect the data, how to process it, and where to send it out to, be it a new logline, a new alert, or a new metric. The new feature has to be integrated into the software’s mainline, and then verified that it is working properly, alongside regression tests for the new version. Last but not least, the new version has to be approved and deployed by whatever organizational processes in place and such changes are associated with their own set of risks.
Unfortunately, even after going through this process, an engineer may find that he failed to get the piece of data he was looking for, or that the new piece of data doesn’t provide as much clarity as he was hoping for, and might have to endure this process again.
Over and over, we have heard software engineers and architects lamenting that this process is so cumbersome and expensive in their own organizations that individual contributors prefer to skip it and act on whatever little data they already have.
The Real Purpose of Observability
I’m sure at this point you are asking yourself: how this can be? Organizations spend a fortune on the aforementioned Observability and Monitoring tools. How can these tools fail to fix these problems?
Well, the truth of the matter is that those tools were never meant to solve those problems. The main use cases for those tools are to:
- Minimize Service Disruption: the most basic use-case is detecting service disruptions as fast as possible, alerting the relevant (on-call) personnel, and aiding them in understanding the root cause and restoring service.
- Optimize Production Performance: the more advanced use-case is detecting performance bottlenecks and anomalies, as well as providing operations and engineering teams insights into why and where they are occurring.
- Auditing and Logging: another big use-case that provides long-term outputs of the system for consumption by customer-facing representatives such as technical support, as well as storing those logs for security and compliance purposes.
There’s a very good reason the use-cases above have been prioritized and solved by these tools. The financial incentives for solving those problems are very clear and ROI calculations tend to be very straightforward. At the same time, these use-cases have little to do with the day-to-day work of the majority of the engineering workforce.
Solving the dilemma
I have seen this pattern come up time after time in every organization we have worked with. Day after day engineers make suboptimal choices based on poor information, due to the sheer difficulty of collecting additional data to educate themselves. Besides causing a deep individual frustration, this has a big impact on software development velocity and quality.
That’s the very reason I founded Rookout. We strive to empower engineers to collect the data they need on the fly while maintaining all the software parameters including correctness, performance, availability, security, and compliance. Reach out to learn more about the huge difference this can make for you.