Wednesday, 14 August 2013

What's in a log line?

Before we dive into my suggestion of format/content for a log line, I'd like to introduce an essential rule, that should be followed whenever possible:

1 log record == 1 line

Yes, I know, sometimes it's difficult (stack trace, I'm looking at you), and even if you somehow manage to place all the exception data (minus the stack trace) on the first line, it could still be useless (e.g., nested exceptions only appear after the first exception's stack trace, and those may actually contain the most relevant error data).

Still, engineering is about trade-offs, not perfection, so the above rule still holds. Whenever possible, make sure you output it all in one line.

Header

Now, then, what should such a line contain? We'll present the first fields, and group them in a logical entity, which we'll call header.

Timestamp
My recommended format is "YYYY-MM-DD HH:MM:SS.MMM", meaning: Year, month, day, hours, minutes, seconds, milliseconds (or a higher resolution, if you need it) - e.g., "2013-08-08 08:08:08.444".

I prefer this format because, no matter how you deal with it, it's automatically sorted - not only as a date, but also as a string, and as a number (if you remove the separators and concatenate date+time).

Session ID
A session begins when a user first accesses your application. At this point, you give that user an ID. If you log this ID, you can later retrace the user's steps - e.g., while investigating an incident. This ID should be unique within a reasonable time frame (e.g., there should be no repeated values in a 48-hour interval).

Obviously, a session has a timeout, so if the user lets the session expire, this continuum is broken.

One thing to remember is that everything should have a session ID - not only a user that interacts with your application via an UI, but also any process that calls services your application exposes. Any incoming request that has no session ID should be assigned one.

Request ID
During a session, a user will perform several operations - e.g., access the shop's home; see product A's details; see product B's details; see product B's reviews; etc. These operations translate into requests to the application. Each of these incoming requests should get its own ID.

Since these IDs will be generated in the context of a session, that's the context where they must be unique - each incoming request in a particular session should have its own request ID, but two sessions could have similar request IDs. The most simplistic implementation would be a counter that gets incremented for each request in a session.

An operation begins when the remote end sends a request to your application and ends when the application finishes its response to that request; this, in turn, can launch other requests/responses - e.g., when the user accesses the shop's home, the application may request data from the product repository, and check which products are on discount. All these subsequent requests/responses should be logged with the same request ID.

We should have a standard way of logging the beginning and end of each operation. These lines should be simple, and this is what I suggest:

timestamp [session ID] [request ID] [operation name] [logging level] BEGIN_END_B
...
timestamp [session ID] [request ID] [operation name] [logging level] BEGIN_END_E [result tag] [return code] [message]

Using a similar identifier for both lines (begin and end) simplifies the task of isolating these lines, which is invaluable, say, to quickly assess whether 1K failures is an urgent issue (1K failures in 1K invocations) or not (1K failures in 100K invocations). The use of suffixes ("_B" and "_E") gives us further filtering options.

The code and message may pose problems for systems that return a train of such codes and messages. Once again, it's a matter of trade-offs - ideally, each particular case should be dealt with individually, in order to log, for each case, the most relevant code/message; if that's not possible/practical, just pick one.

Operation name
Identifies the operation the user invoked - e.g., "access_shop_home", or "product_details", or "product_reviews".

We use the request ID when we need to retrace a particular invocation, and the operation name when we want a global view of what's going on with an operation, since it allows us to isolate all the lines (but, especially, the BEGIN/END lines) regarding that operation.

Logging level
The usual - TRACE; DEBUG, INFO, etc. Not much to add here.

The line, so far

So, taking as an example our BEGIN_END lines above, those lines could look like this:

2013-08-11 12:32:04.248 [21EC2020-3AEA-1069-A2DD-08002B30309D] [1] [product_details] [INFO] BEGIN_END_B
...
2013-08-11 12:32:06.842 [21EC2020-3AEA-1069-A2DD-08002B30309D] [1] [product_details] [INFO] BEGIN_END_E [SUCCESS_PD1] [0] [success]


By using just this data (i.e., ignoring the rest of the line), we may quickly assess our app's behaviour, based on a few important metrics - total invocations vs. total successes/failures, or operation response time.

Notice the success tag, "SUCCESS_PD1". Once again, we can search the log for success on this particular operation ("PD1" standing for product_details), or just for success ("SUCCESS_").

The return code and message are both optional - e.g,, when we return a page to the user, there's no result code or message.

In case of error, the result tag should be different for technical errors and failed business rules. Why? Because a technical error (e.g., database unavailable) requires an investigation by the support team, whereas a failed business rule (e.g., out of stock) is usually part of the app's normal functioning.

Detail

Now, we get to the detailed part of the log line. The following data should be logged in a structured format, and my format suggestion is JSON.

I'll start by saying that the field names don't have to be fully descriptive, provided there's adequate documentation. E.g., we can easily tell a name or an address, by looking at it. So, instead of something like

{ "customer name": "blah blah blah blah", "contact number": 12345678, "vendor name": "bleh bleh bleh bleh", "customer address": "blih Av, etc, etc, etc", "billing address": "bloh Park, etc, etc, etc" }

we could have simple tags

{ "cn1": "blah blah blah blah", "cn2": 12345678, "vn1": "bleh bleh bleh bleh", "ca1": "blih Av, etc, etc, etc", "ba1": "bloh Park, etc, etc, etc" }

While we can't easily tell what is cn2, as long as we have this documented, no problem. This could even be used across different applications, so that cn1 would always be the tag for "customer name".

System information

The first detailed section should contain what we'll call, in a somewhat loose fashion, system information. It should be signaled by an identifier (e.g., "BGN_SYS_").

As to what it actually contains, there's two types of data we may put in this section:
  • Actual system information, such as the remote IP or the invocation endpoint, either remote (when we're invoking another system), or local (when someone invokes one of our interfaces).
  • Metadata. E.g., is it an incoming call, or outgoing?
As we said above, when a user performs an operation on the application, it may launch several requests/responses. Everytime our app interacts with a remote system, we should log the beginning and the end of that interaction, much like we logged the beginning and end of the operation itself.

Operation data

All that's missing is the operation data. Here, we have two options:
  1. List the data (invocation arguments/return values) using the log's structured format (e.g., JSON).
  2. List the data (invocation arguments/return values) in the original protocol format (e.g., SOAP).
Ideally, we should log both. However, this may not be possible - e.g., because of system performance or log size.

If we need to decide which one to choose, we must answer some questions:
  • How easy is it to replicate the exact invocation (e.g., SOAP envelope) from the arguments?
  • Are there any interfaces where we routinely need to supply the exact invocation (e.g., we request an analysis from the interface's support team, and they require the exact invocation)?
If we choose option #1 (which should be on level INFO), we should implement option #2 on DEBUG.

However, for interfaces that return several return codes for a single invocation, it's probably better if we log the whole return structure. For interfaces that return long lists, these should only be logged on DEBUG.

A simple example

Now, let's look at an example of how it would look like.

2013-08-11 12:32:04.248 [21EC2020-3AEA-1069-A2DD-08002B30309D] [7] [product_details] [INFO] BEGIN_END_B BGN_SYS_ { "i_o": "INCOMING_PD1" } BGN_DATA_ { "PID1": 123 }

The user asked our app to show him the details of the product with the ID 123. This is not his first request, so the app gave him the request ID 7. It's an incoming request, for the product_details operation, which we've tagged in our app as PD1. The tag BEGIN_END_B marks this line as the beginning of the operation.

2013-08-11 12:32:04.512 [21EC2020-3AEA-1069-A2DD-08002B30309D] [7] [product_details] [INFO] BGN_SYS_ { "i_o": "OUTGOING_GPH1", "tag": "GPH1_BEG_END_B", "endpoint": "PRODUCT_DB", "method": "get_product_header" } BGN_DATA_ { "PID1": 123 }

The app executed the SP get_product_header on the PRODUCT_DB database. It's an outgoing request, and we've tagged this SP as GPH1. The tag GPH1_BEG_END_B marks this line as the beginning of the request.

2013-08-11 12:32:04.758 [21EC2020-3AEA-1069-A2DD-08002B30309D] [7] [product_details] [INFO] BGN_SYS_ { "i_o": "OUTGOING_GPH1", "tag": "GPH1_BEG_END_E", "endpoint": "PRODUCT_DB", "method": "get_product_header", "result": "SUCCESS_GPH1" } BGN_DATA_ { "PIID1": 7718876, "PV1": 201308123 }

The SP replied with the product's internal ID (PIID1) and version (PV1). The tag GPH1_BEG_END_E marks this line as the end of the request, and the tag SUCCESS_GPH1 tells us there were no errors.

2013-08-11 12:32:04.796 [21EC2020-3AEA-1069-A2DD-08002B30309D] [7] [product_details] [INFO] BGN_SYS_ { "i_o": "OUTGOING_GPD1", "tag": "GPD1_BEG_END_B", "endpoint": "http://remotehost.my.domain/remote_system/remote_interface", "method": "get_product_details" } BGN_DATA_ { "PIID1": 7718876, "PV1": 201308123 }

The app invoked a remote service, via http, to get the product details. It's an outgoing request, and we've tagged this service as GPD1. The tag GPD1_BEG_END_B marks this line as the beginning of the request.

2013-08-11 12:32:05.904 [21EC2020-3AEA-1069-A2DD-08002B30309D] [7] [product_details] [INFO] BGN_SYS_ { "i_o": "OUTGOING_GPD1", "tag": "GPD1_BEG_END_E", "endpoint": "http://remotehost.my.domain/remote_system/remote_interface", "method": "get_product_details", "result": "SUCCESS_GPD1", "code": 0, "message": "Request processed successfully" } BGN_DATA_ { "return": "<SOAP RESPONSE WITH A TRAIN OF RESULT CODES AND OTHER RESPONSE DATA>" }

The remote service replied successfully, as shown by the tag SUCCESS_GPD1, and returned a SOAP response, which we logged in its original format. The tag GPD1_BEG_END_E marks this line as the end of the request.

2013-08-11 12:32:06.842 [21EC2020-3AEA-1069-A2DD-08002B30309D] [7] [product_details] [INFO] BEGIN_END_E [SUCCESS_PD1] BEGIN_SYS_ { "i_o": "INCOMING_PD1" }

The app finished its response to the operation (the tag BEGIN_END_E marks this line as the end of the operation), i.e., the incoming request to show the details of a product. No errors occurred, hence the tag SUCCESS_PD1. There's no return code/message, since none was necessary.

Next, our user decided to purchase this product.

2013-08-11 12:35:04.248 [21EC2020-3AEA-1069-A2DD-08002B30309D] [8] [product_purchase] [INFO] BEGIN_END_B BGN_SYS_ { "i_o": "INCOMING_PP1" } BGN_DATA_ { "PID1": 123, "PIID1": 7718876, "PV1": 201308123 }

The user began the purchase process for the product with the ID 123. The app incremented the session's request ID counter, and gave him the request ID 8. It's an incoming request, for the product_purchase operation, which we've tagged in our app as PP1. The tag BEGIN_END_B marks this line as the beginning of the operation. We've cached the product's internal ID (PIID1) and version (PV1) so we don't have to retrieve it again from the DB.

2013-08-11 12:35:04.796 [21EC2020-3AEA-1069-A2DD-08002B30309D] [8] [product_purchase] [INFO] BGN_SYS_ { "i_o": "OUTGOING_GPA1", "tag": "GPA1_BEG_END_B", "endpoint": "http://remotehost.my.domain/remote_system/remote_interface", "method": "get_product_availability" } BGN_DATA_ { "PIID1": 7718876, "PV1": 201308123 }

The app invoked a remote service, via http, to determine product availability. It's an outgoing request, and we've tagged this service as GPA1. The tag GPA1_BEG_END_B marks this line as the beginning of the request.

2013-08-11 12:35:05.904 [21EC2020-3AEA-1069-A2DD-08002B30309D] [8] [product_purchase] [NOTICE] BGN_SYS_ { "i_o": "OUTGOING_GPA1", "tag": "GPA1_BEG_END_E", "endpoint": "http://remotehost.my.domain/remote_system/remote_interface", "method": "get_product_availability", "result": "ERR_BIZRULE_GPA1", "code": 8, "message": "Out of stock" } BGN_DATA_ { "return": "<SOAP RESPONSE WITH A TRAIN OF RESULT CODES AND OTHER RESPONSE DATA>" }

The remote service replied with a business rule failure, as shown by the tag ERR_BIZRULE_GPA1 (and logged it at NOTICE level). Our product is out of stock. The tag GPA1_BEG_END_E marks this line as the end of the request.

2013-08-11 12:35:06.342 [21EC2020-3AEA-1069-A2DD-08002B30309D] [8] [product_purchase] [NOTICE] BEGIN_END_E [SUCCESS_PP1, ERR_BIZRULE_PP1] [0] [This product is unavailable. Would you like to be alerted when it becomes available?] BEGIN_SYS_ { "i_o": "INCOMING_PP1" }

The app finished its response to the operation (the tag BEGIN_END_E marks this line as the end of the operation), i.e., the incoming request to show the details of a product. No technical errors occurred, hence the tag SUCCESS_PP1. However, we had a business error, due to the "out of stock" situation, so we signal it with the tag ERR_BIZRULE_PP1 (once again at NOTICE level, although here I admit I'm not quite so sure about it). We're not returning an error code, so we log 0, but we're presenting a message to the user, so we log this message (within practical limits).