Devops often don’t understand logging

My job involves writing software. Working on bug fixes, adding new features and generally making the software better. That could mean easier to use so less training time for users. It could mean the software is faster so our users can do more of their other work. It could mean safer so we cause less frustration and upset to the general public. This all fits into this end goal we call “delivering value”. Value is an incredibly loose term not necessarily related to money but commonly it can be. It can also simply be called “improvements to the product”. It’s not a science but we identify pain points and try and smooth them out.

Businesses should try and utilise data in all their decision making and move away from gut based decision making because the later is significantly flawed. I can name dozens of examples from my previous experience of where assumptions essentially wasted money, introduced avoidable technical debt and other complexities. As one example, at the place I currently work someone was moving all the Mongo database backups to use a new Mongo replica instead of master because the backups were slowing down the production applications. That turned out to be a waste of two months since it never had an impact on application speed. In another example, the business ask for dozens of reports, each more meaningless than the last unless truly challenged. Maybe look at an actual report once and decide if it’s useful before I code it into the application and have to support it forever. It’s always best practice to try and use data to prove your beliefs and numerous companies exist to help companies understand their own data better. In short, we should use the data we have to identify and assign value to certain work when we prioritise it instead of just guessing at what will improve the product.

Data warehousing is a very old discipline used by many companies. You collect ad-hoc and unprocessed data from across your business and then practice combining it in different ways to try and understand your customers and objectives in new ways. For me, I personally see my application logs as a huge data warehousing effort. So my boss and I will discuss a problem like how long it takes to do some task in the system and we’ll start looking at our logs and our database. Maybe the “edits” a user makes to an page denotes how many mistakes other users are making. Perhaps comparing two urls allows to see how long a mistake goes unnoticed for. Perhaps if we quantify this mistake-rate we can prove our work yields improvements by measuring how many less edits are made after the change. We can measure it before and after in order to prove our work is of some demonstrable value. One thing we do in our department is count the number of emails to our support bucket and try and ask ourselves which changes will reduce that expensive and annoying correspondence the most? However I don’t know what metric or check is going to be useful until I am looking at the JIRA tickets on my backlog. It could be the distance between log lines, it could be urls, it could be times and it could be a mixture. It’s incredibly situational.

Perhaps you think the work of attaching costs and values or parsing logs is for the product owners or managers – I would argue it’s a shared responsibility across all levels and we should challenge work and enrich requests will real stats rather than blindly implement meaningless change for a pay cheque.

In order to enrich JIRA tickets with provable estimates and data, I specifically need access to an ad-hoc, dynamic tool where I can make meaning out of unstructured data with no upfront planning. I can do this from the logs with Splunk. Splunk allows me to perform a free-form regex-like search over my logs and then draw graphs from them and derive averages, maximums, trends and deviations. However if I need to either define a fixed parsing pipeline to turn adhoc logs into structured json data, or if I need to add triggers to my code for sysdig – this immediately means I cannot evaluate any historic data. It also means I have to do upfront and expensive development work to find out if another piece of work is worth doing. That is expensive in terms of time, effort, effeciency, especially since it’s not a science and could be meaningless. I need to be able to experiment very cheaply (i.e a regex or a SQL query) and writing data to sysdig manually is not cheap. It means waiting for two weeks to find out the answer to my question assuming two weeks data is even enough to make an informed decision. It’s better to have a tool that runs like dogshit but answers business questions on demand with no upfront planning than to have a tool that draws graphs from extracted data but requires forethought when configuring it.

People who think Kibana and logs are useful for finding errors but should only keep data short-term, and people who think Kibana should only be fed parsed, structured json, are ignoring the enormous amounts of useful information that would make them better developers. I hate to generalise but I find at every company I go to that I run into DevOps members tend to overlap with the former group. Kibana and Splunk having similar looking UIs but since one opens a world of business intelligence and the other one doesn’t, that’s where the similarities end. I also advise you keep logs forever as you may want to do “year-on-year” analysis of growth and things like that later.