Musings on Good Error Logging within Backend SystemsJuly 20, 2015
Let me start off by saying that good error logging can be pretty tricky to get right in any environment; for this post though, I’d like to talk about some of the challenges faced when these errors occur on server side applications.
Challenge: The errors being logged contain no useful information
Event logs where the entries only contain information like ‘Null Reference Exception’ and ‘Error’ are of little use to developers trying to debug an application. The more information they have the faster they’ll be able to develop fixes.
Solution: Make the error logs as verbose as possible
(Using C# as an example) C# exceptions provide several mechanisms which let you encode useful data into the exception which can then be put into the logs to enable better debugging such as:
Create your own exception types
For example, on Zoomph we have feeds which are IDed by an integer, if code requests a feed which can’t be found (read doesn’t exist) we throw a ‘FeedNotFoundException’. This has several benefits such as allowing calling code to easily differentiate types of errors as well as making the error logs more meaningful.
Creating custom exception types also has an added benefit of being able to add custom meta data to the exception classes. For instance in the aforementioned ‘FeedNotFoundException’ we have a field which is the ID of the feed which was requested but not found which can be useful in debugging. One word of caution on this though is to be careful about which data gets included in logs. If you’re writing code to handle errors occurring during users’ logging into your site it would be a good idea to log the username which was used, not the password though.
Use inner exceptions to your advantage
In C# exceptions can wrap other exceptions which allows for exceptions which travel up the call stack to have meaningful data added at various abstraction layers. This is useful since an error at one abstraction level such as a disk IO error makes sense in a data access layer but is rather meaningless in a presentation layer which only cares that needed data couldn’t be fetched.
Challenge: Bugs in one area cause errors to pop up in a different place
Let’s say there is a bug in system A which causes an object to be created with a missing (null) field. Now let us suppose this object is passed into system B which then emits a Null Reference Exception somewhere deep within its bowels; although the errors are popping up in system B the actual bug is in system A. These kinds of bugs can be extremely difficult to track down, especially in distributed systems.
Solution: Make use of assertions and sanity checks
In our above example, system A is emitting malformed data and system B is accepting malformed data. To prevent this place assertions/sanity checks at the boundaries which will ‘fail loudly’ if malformed data is detected.
Challenge: The sheer volume of log entries being generated is impossible to make sense of
When systems fail (and under normal operation sometimes) they can generate enormous amounts of logging data which no human could ever hope to parse through.
Solution: Make use of a taxonomy to organize the errors and some sort of search/analysis engine to make sense of them
Here on Zoomph we not only use Elastic Search to perform real time analytics for our customers, we also use it to make sense of logs. Having the ability to instantly sift through huge sets of logs as well as look at aggregates such as error rates is immensely helpful when digging through millions of logs entries trying to track down that single entry which begat the whole situation.
The above list is by no means exhaustive but has proven to be a good starting point for us on Zoomph to help keep our infrastructure under control.
Please feel free to share your thoughts/comments or suggestions in the comment box below!