It’s Time We All Stop Cursing The Darkness
To paraphrase a popular Chinese proverb: “It is better to light a log line than to curse the darkness”. And yet it’s 2021, and many of us are still lost in the dark, struggling to find a match, praying for that one missing log line that would have pointed us in the right direction. How did things get this bad?
Since time immemorial, we have relied on logging as the first trustworthy, intuitive, and straightforward debugging method. A developer writes a tricky line of code, or a snippet that should never be executed, or a function whose run should be trivial and intuitive. And yet, when the application is running, things don’t go as expected. The developer writes another line of code, to be printed before, after, or during the tricky execution, specifying exactly what happened, providing context, and highlighting the severity of the printed line. This is the bread and butter of log-based debugging.
As the application grows and evolves, more developers join the party. A complex logging pipeline is built, and when the application runs in a staging or test environment, each developer is bombarded with the DEBUG logs written by their peers. The DevOps or Infrastructure or Observability team is called into action and significant R&D effort and infrastructure cost are invested in ensuring that the logs are well formatted, that the correct level of severity and detail are used, and that when a problem happens we can find the WARNING or ERROR logs we needed, along with the relevant stack trace, rather then drowning in a sea of nearly identical “@@@ I am here and it is now @@@” debug logs, masquerading as INFO logs.
As the application gets shipped to production, or deployed at a customer environment, the complexity rises again. We now need to consider the performance impact of printing logs, as well as the money we spend on shipping, storing, indexing and extracting logs. When the product is young and growing, and the cost of losing a transaction or failing to fix an issue are too high, we take a cost hit – we log everything. Quite literally, everything.
However, if things worked as planned, we eventually reach a level of maturity and growth that lets us cut down on logging. We invest several development sprints in removing unnecessary log lines or reducing their severity. Application code that was once fresh and unknown and covered with detailed log lines from top to bottom is now mature and familiar, and the only log lines that get printed are the ones that indicate that something, somewhere, went extremely and unexpectedly wrong. And when that happens? We use those logs as a flashlight, pointing at the root cause of the problem and adding color and insight into points of the code that would have otherwise been in black and white or altogether invisible.
And if we were indeed in a children’s story, this is where the story would have ended. We all lived happily ever after, having found the balance between logging too much and too little, like Goldilocks and the three bears. Everything is just right, and we could have our coffee and take a break. But real life software engineering is very rarely a fairytale.
In real life, we sometimes have too many logs, even in a mature and well-maintained application. When a problem occurs, we can’t find the needle in the haystack. Logging costs are still a challenge and each newly added log line has to be optimized for performance, cost, and readability. And in real life, more often than not, we don’t have enough logs. Maybe we confidently print INFO logs, but the extra level of detail we need is hidden in a DEBUG or TRACE level log line. And maybe we decided to print only WARNING or ERROR logs because the application is running at scale, and we can’t even see the hints provided by a well placed INFO log line.
So the pendulum keeps swinging – we add code, change verbosity level for specific lines, restart the app with a different log level. And repeat. When a developer tries to troubleshoot a specific service and only need DEBUG level for a specific service, for a specific account, for a specific deployment – they end up turning DEBUG logs for everyone, lighting up the entire app aflame when a candle would have been enough. When a DevOps engineer worries about increasing cost they reduce logging to ERROR for everyone, leaving everyone blind and afraid, without the light they need to see where they are heading. Cursing the darkness, or cursing the brightness of the sun, without the means to find a dynamic and flexible balance between the two.
We find ourselves reiterating our original question – how did things get this bad? And why didn’t things get better since then? When trying to find the root cause, we have stumbled upon the following sentiments, believed to be true by many of our customers and peers:
- Logging still costs too much. Logging everything is not feasible, is too expensive, requires too much engineering effort.
- Not logging at all is not an option. When things break down, you need to be able to quickly understand what happened and why, with as much context as possible.
- There is no efficient and safe way to switch from “dark mode” to “light mode” and back. Changing log level requires a restart, or changing code, or both. And going “full light” may be even riskier, as it could “break” the app rather than help us fix it.
- Even if there was an efficient way to switch to “light mode” – it would just leave us blind and grasping the dark. Rather than setting everything on fire or lighting up a sun, we need something precise – switch on logs in the area that we are looking for, printing logs only for a specific service, account or deployment.
What if instead of setting the whole building on fire, we could point a very effective searchlight at exactly the point of the problem? What if switching logs from WARNING to INFO to DEBUG and back was as easy as switching between Dark Mode and Light Mode on my favorite app? What if we could switch logging on temporarily, in a way that was focused on the service I am trying to debug? What if we could easily and dynamically balance the need for fresh, context-based data with the need to keep performance and log storage cost at the minimum?
Happily, we do have some ideas up our sleeve. And yes, it’ll save you all that frustration, pain, and insane logging costs. Stay tuned to find out next week just what this magical offering is 😉