Monday, 3 June 2013

How to use logging levels

As the title says - what to log, and at which level? 

Note: I'll be giving examples far more complex than my personal project. Part of my day-to-day activity involves working with logs, so my knowledge and experience on this subject goes beyond what I'm coding. I've rarely taken the time to organize my thoughts in writing, so this is my first attempt at correcting that omission.

TRACE

I'm using TRACE for two situations:
  • Log records that I won't miss in production - e.g., a ctor that has an initializer list and an empty body.
  • Operations that could cause "log spamming" (no relation) - e.g., when I set up the asio io_service loop to call libsh2 functions, these are repeatedly called, many times per second, until their task is over. The last thing I want in my logs is this kind of spam, unless I really need it, while investigating an issue.
TRACE statements can be #defined out of the code, so the first point is important.

DEBUG

Here, I decided to apply some caution; I created two DEBUG levels: DEBUG and DEBUGPRD.

DEBUG
Again, this level applies to stuff I won't miss on a production log. A very obvious example: move and copy ctors.

DEBUG_PRD
At this level is stuff that I won't regularly want on my production log, but that I may find occasionally useful, while investigating an issue I can't replicate in QA. E.g., the point at which a RAII class releases its resource.

And here ends the selective code exclusion of my logging setup. From here on, as far as exclusion goes, it's all or nothing.

INFORMATION

Business as usual.
  • We invoked some other platform and got back a success.
  • We recorded a purchase on our DB, with no error.
  • Someone invoked us to validate a token and it was valid.
This is the level where we start logging mainly functionality, rather than technical details - e.g., we log that we're running an SSH command, and we're not interested in the details of the process - what classes are being created, moved, etc.

Some may ask "Why is this in a production log?". Well, suppose you get a ticket about a customer transaction on your platform, that was erroneously approved; the customer's balance was inferior to the transaction amount and, as a result, the customer went in the red. Now, you need to find out what went wrong and how to make sure it doesn't happen again. You try to replicate it in QA, with no success. You set up a test in production (the horror!) and, lo and behold, you can't replicate the issue.

Well, no biggie. Since you're logging at INFORMATION level, before you do all this, you check the log for the request/response to the system that validated the customer's balance, to see where was the point of failure - if you got an - incorrect - OK on the customer balance validation, the analysis must be performed on the validation platform; if you got a NO-GO and your application went ahead anyway, then the investigation is on your app's end.

Without this knowledge, you don't know on which system to focus the effort, and this is where logging at the INFORMATION level can be invaluable.

NOTICE

As a general rule of thumb, I'd say: "Just like INFORMATION, but for business issues". Taking the example above, we validated the customer's balance, and it was negative; or the customer tried to login using the wrong password. These are not hard errors, but rather perfectly acceptable situations. Still, we take notice.

Why? Because, by doing so, we can then mine the logs for this information.

Let's assume, as is often the case, our system was built with "aggressive" goals; meaning, neither enough budget, nor time. Under this scenario, anything that's not "core functionality" will fall prey to the usual corner-cutting; and monitoring isn't usually considered "core functionality".

Which is a shame, because monitoring, when properly implemented (i.e., when the application/system is designed to either implement monitoring, or to facilitate its ad-hoc implementation), is the most valuable prevention tool I know.

And this is what a well-implemented NOTICE level gives us: a way to easily mine a log file for things that may require our attention (provided we have the resources to spare said attention, obviously) - we have a customer that has had 10 failed balance validations in the last hour; or we have a login with 12 failed attempts in the last hour (and, since we block the login for 15 minutes every 3 failures, we know someone has been "constantly" trying to login). We can fire an automated contact to our customer, saying "We have just detected such and such, and we'd like to know if you're having any difficulty and if we can be of assistance".

All this is easier to detect because we're logging events at their optimal level.

WARNING

Here, the balance begins shifting back a bit to the technical side of the pendulum. Adding to the business issues (e.g., a client's balance decreased over 20% in a single transaction), we can also log things like resource usage (are we getting uncomfortably close to any limit) or performance data (we have invoked platform X; the expected response time is 1 sec and it took 15 secs).

Some of this can be complemented with external monitoring processes - e.g., keeping watch of CPU/memory/disk usage. Meaning, the application itself doesn't have to include everything and the kitchen sink - after all, in a high throughput app, we may not want to burden the component responsible for running the transactions with the task of verifying its own performance. So, if this task was not budgeted for the app, it can be built later on by sweeping the logs.

ERROR

I'll start by saying something that should be obvious, but often isn't: Log only errors at this level. Namely:

  • Business errors. As stated above, a failed balance validation because of a negative balance is not an error; a failed balance validation because the validation platform says the account doesn't exist, is. Thus, in this context, "business errors" means "conditions which are not covered by any business rule, be it success or failure"; a negative balance is a failure, but it's covered by a business rule, so it's not an error.

  • Technical errors. The balance validation platform is unavailable. The DB update failed because it couldn't acquire a lock. The DB insert failed because of a failed check constraint. This last one may not be a technical error - if the UI doesn't validate the constraint and passes any input to the DB, I'd call this a business error, because the DB's check constraint is actually validating the input (actually, I'd call it a design error, unless someone could justify the enormous burden of maintaining the check both in the DB and the UI).

CRITICAL

From here on, the pendulum has swung all the way back to technical issues.

That said, there are business issues that can indicate a critical situation - if every customer balance validation for the last 5 minutes is always returning negative balance, I'd like to have an alarm ring, thank you very much; similarly, if every login in the past 5 minutes has failed, let's have some of that alarm ring, and be quick about it.

However, I think this analysis should be performed by a process created specifically for this, and not by the application itself. The application has a "login" operation, and this operation shouldn't have to know that its failed login is one among dozens that are failing at that same moment.

As such, I can't think of many single business issues that could be classified as critical. An example could be a special high-visibility promotion we're having on our online store, and we want anything at all related to this promotion to bubble to the surface immediately. So, we could, via some configuration, say that any issue related to this promotion ID (or product ID, or whatever) should be considered critical. Still, while I find this idea valid, I don't think it belongs in the logging framework.

On the technical issue side, what is critical? The balance validation platform has been unavailable for 5 minutes. The DB has been unavailable for 5 minutes. Our user repository has been unavailable for 5 minutes. Or they're available, but we're getting times way above the expected performance for 10 minutes. Or our resource usage has gone way beyond "uncomfortably close to the limit".

Again, this requires either monitoring processes, or that the application be developed to "look at the big picture". Here, I also tend towards removing these checks from the application to monitoring processes; but, and this may be just bias, implementing these checks in the application itself doesn't strike me as odd.

FATAL

Ah... what do we put here?

Application startup gives me a few ideas - can't create the DB connection pool; or can create the pool but the DB is unavailable. During startup, anything that's fundamental for the task(s) the application performs should be available, otherwise, we might as well shut down.

When the application is already in flight, I can think of a few issues, but mainly concerned with running out of resources - memory, file system, sockets, threads; and even these may nor merit a shut down - e.g., sockets may be freed, in which case the application may resume its normal work and the incident could be just a momentary blip on the radar.

tl;dr - The importance of choosing the correct logging level comes from how it can simplify (and, in come cases, actually make possible) the task of mining the logs for useful data; "useful" in the sense that it gives you an image of the application's status and allows you to take action based on said status.

No comments:

Post a Comment