The 5 Approaches to Production Debugging
A genius friend of mine used to say – “Just do bugless oriented programming- when you code, if you see you’re about to write a bug- simply don’t – it saves a lot of time”. As bugless oriented remains a concept for gods, we mere mortals better find other practical approaches that actually work.
Software is eating the world – reaching greater scales, speed, and complexity; software development is constantly racing, trying to evolve in answer to the increasing demand.
In the past few years, we’ve seen the rise of DevOps – a collection of methodologies and tools for software development with greater agility and scale. The surge including highlights such as CI/CD, containers, microservices, auto-scaling, resource orchestration, and serverless.
With each technological step forward, and each layer added – visibility of code execution becomes harder and harder to achieve. We are left with limited visibility at scale, and slow response times to understand and debug production environments.
The question is clear, facing these immense challenges – what are our options to approach production debugging?
Don’t worry be happy
(gathering debt)
The first approach while being rather common isn’t much of an approach at all, but rather a naive, happy-go-lucky mindset, simply choosing to ignore the entire issue.
Often a result of inexperience. Developers going down this path would look at a PoC, MvP, or simple working project; and decide that if it had worked so far – there’s no issue. They’ll worry about it if and when it breaks… The pain starts the moment they’d try to scale, translating the approach’s results directly in technical debt.
Ultra-orthodox testing (if you test it, it will run)
The testing as a solution approach is a popular leader, plainly stating -”test everything!”.
The mindset, which is spearheaded by the followers of the TDD cult, believes in leaving no room for doubt in production, and simulating and testing everything in comprehensive tests and staging environments, including harnessing more proactive tools such as chaos-monkey or Gremlins to stress test a system in extreme conditions.
The approach has a lot of merits – it is capable of reducing the bug-potential-surface, and can detect potential problems earlier- before they become costly. But due to the nature of things, and mainly human nature it cannot remove the potential for bugs and problems completely. In addition the approach usually includes high costs in the form of heavy (and sometimes slow) R&D cycles, heavy CI/CD infrastructure work, and heavy testing requirements (e.g. runtime, resources).
PokéMonitor
(gotta catch’em all!)
In a similar fashion to the testing approach, the logging and monitoring approach calls for its disciples to catch and collect everything! – “Plan and write all the log lines you’d ever want beforehand”, “Deploy all the monitoring agents and SDKs from day one- and save all their data as long as you can”. The approach has a lot of value when things go wrong and require fixes- the more you collect the higher the chance you’ll have the information you need to understand and resolve the incident. In addition collecting everything can have surprising effects and value- when combined with analysis methods (e.g.anomaly detection) which can produce unexpected insights. Yet the method has three key disadvantages –
a. Predictions:
Since it’s impossible to literally collect every piece of data all the time; developers are required to predict the future (which data will be needed) and to prioritize collection – an extremely challenging task even for the most experienced, made even more complex due to conflicts of current mindset (design and create vs. debug and monitor).
b. Signal to noise ratio:
Even when succeeding to collect most of the data – you have to overcome the dark-data problem, and to overcome false positives. It’s quite a daunting task to process large amounts of data and be able to find the relevant pieces per case in time.
c. High infrastructure and maintenance costs:
Efficient data collection is a software engineering challenge by itself- especially if you’re aspiring to collect everything.
Costs include high compute resources (e.g. CPU, memory, networking, storage) , and a high dependency on 3rd party monitoring solutions (Agents and SDKs) in the form of service/license costs, maintenance costs, and worst of all the cost of increasing the bug-potential-surface by including 3rd party solutions / code.
The collect-all approach works well with and is often adopted with the test-everything approach (both sharing a puristic way-of-thinking, and zealous followers).
Move fast
(and break things)
In sharp contrast to the pursitc collect-everything and test-everything approaches – this approach tries to address production debugging by increasing the speed of software updates. Basically saying “It’s ok if things aren’t perfect with the software,as long as we can deliver new software quickly enough to respond to issues”
This approach usually invests less in infrastructure and testing in favour of tighter monitoring and alerting. More crucially this approach relies heavily on strong CI/CD capabilities and trying to minimize R&D and deployment cycles.
This approach has clear benefits to general R&D speed, response speed, and cost reduction. Yet suffers from two painful disadvantages- one obvious, and one deeply hidden.
a. Obvious disadvantage – Quality for speed trade off:
While gaining speed, the approach pays in overall R&D and infrastructure software quality.
b. Hidden disadvantage – Coupling Debugging cycles with R&D and Deployment cycles:
In non production environments, and in production of simpler projects- debugging is a simple straightforward process- “Connect, inspect, understand, debug” But in this case and approach we are required to mangle in other more complex and often asynchronous processes – those of general R&D and general deployment – creating a heavy, slow, complex, synchorunes system.
Agile Data-layer
(visibility set free)
The data-layer approach is the newest, building on top of existing agile / DevOps techniques, including the move-fast approach. The approach focuses on decoupling the data layer from the reset of the application – and reaching a state where the needed visibility into production can be reached on demand, preferably with as less as possible effects and dependency on other software aspects.
This approach has several clear benefits – primarily agility and stability – as it provides greater visibility faster and without risking other elements. In addition it untangles the R&D/Deployment cycle and debugging cycles – freeing up R&D and management resources.
The main disadvantage of this approach is the need to design the data-layer for agility and or at least a need to use a agile-data-layer solution (such as Rookout).
In the end the ever increasing challenges of software development can’t be tackled by a single approach, and a combination of all the five approaches listed above is required to reach real and effective production debugging. Already when reviewing organizations today we find multiple approach combinations, yet more often than not, organizations would over-focus on one or two specifics approaches.
Moving forward, organizations will need to find ways to combine all approaches in smart and case specific ways, reaching better understanding of their software and obtaining true agility in production debugging.