Sunday, 7 December 2014

A simple class to read files

Yes, I'm olde enough to know "simple" is a misnomer.
I have a new addition to my bluesy github repo.
Part of what I do at work involves reading through user-un-friendly log files. How user-un-friendly? Just to give a small example, the cases where a "log record" is on a single line are the exception (I wish it was just the Exceptions), and there's usually no easy way to correlate the beginning/end of operations with a single grep. This means I spend a lot of time creating scripts/small programs to process those log files.
I'm not an enthusiast of shell scripting. Oh, I can persuade it to do some complex stuff, yes, but it always seems to obey me reluctantly, and I'm left with the nagging feeling that, somehow, it's laughing behind my back. Not to mention that the resulting scripts are very... spaghetti-like. I find it hard to think modularly when writing shell scripts. I had a similar issue with Perl. It's still there with Ruby, but being modular with Ruby is a lot easier for me.
However, I don't have this difficulty with C++. Or Java, or C#, or Delphi/Lazarus, for that matter. It must be some sort of mental block, because while I couldn't create a modular design in shell scripting to save my life, that comes as second nature when I'm working in C++.
So, when I have to create some kind of custom tool to process these files, most of the time I turn to C++. Thanks to C++1y/Boost, I don't take much longer to write it than I would with any other tool/language, and I definitely appreciate that when I need to change something a few weeks/months later, I can find my bearings much more quickly.
And, after creating a few very similar programs, all driven by a file-reading loop, I've figured I had enough use-cases to create...


A class... actually, a class template, that reads lines from a file. Surprising, heh?
template <typename LineMatcher = SimpleLineMatcher,
    typename LineCounter = SimpleLineCounter<unsigned long>>
class FileLineReader : private LineMatcher, private LineCounter
The concept is simple - read a line, which becomes the current line, and give the caller a way to get it (or copy it). Then, we add some nuggets of convenience:
  • Keep a counter of read lines. Implemented by LineCounter.
  • Supply matching operations, that not only perform matching on the current line, but also allow skipping lines based on matching/non-matching. The matching is implemented by LineMatcher; which is then used by FileLineReader to implement the skipping.

Why inheritance, instead of composition? Because there will be cases where LineMatcher and LineCounter have no state, and a data member is a bit of waste (yes, a tiny little bit of a waste). Can this be abused? Absolutely, but you know - protect against Murphy, not Machavelli.

Skipping lines 

The first line skipping functions I introduced were SkipMatchingLines(), which skips lines while there's a match, and SkipLinesUntilMatch(), which skips lines until it finds a match.
These functions share a similar trait - their stop condition is met after reading the line that triggers the stopping condition. Suppose we have this file:

[2014-01-01 00:00:00.000] match-1 This is line 0
[2014-01-01 00:00:00.100] match-1 This is line 1
[2014-01-01 00:00:00.200] match-1 This is line 2
[2014-01-01 00:00:00.300] match-2 This is line 3
[2014-01-01 00:00:00.400] match-2 This is line 4
Something like this

FileLineReader<> flr{kFileName};
// Process lines
can only stop when the line "... line 3" is read, because there's no way we can perform a match against a line that hasn't been read yet. This means that when we reach "// Process lines", the current line will be the first line to process, so we should "process-then-read" (do-while), rather than the more intuitive "read-then-process" (while).

I've entertained the notion of using tellg()/seekg() to rewind the file (IOW, un-read the line), but I didn't even get started, after reading about how it behaves in text mode. So, I'll stick with "process-then-read", for now.

Another common scenario I encounter is skipping a certain number of lines, usually a header. So, I've added this:
void SkipNumberLines(unsigned int number_lines);

Because we're skipping lines independently of any matching, it didn't make sense to implement this in LineMatcher; so, I've implemented it straight in FileLineReader. I'm not completely happy with this solution, but I figured it's better than no solution.

Skipping dilemmas

I also thought it would make sense to keep semantic coherence between all the line-skipping functions. So, SkipNumberLines() should stop on the first line to process, not on the last line skipped (even though it could, because we're skipping a known number of lines), just like the other functions.
No biggie, we just perform an extra read - so, for SkipNumberLines(3), we actually read 4 lines. Lovely, everything's coherent, little ponies are happy, and fairies are spreading pixie dust and sneezes throughout the realm.
The thing is... nobody expects the Spanish Inquisition... their chief weapon is surprise and SkipNumberLines(0)... their two chief weapons are surprise, an almost fanatical devotion to the Pope, and SkipNumberLines(0)... OK, among their weaponry we can find elements as diverse as... blah, blah, blah, and SkipNumberLines(0).

Yep, SkipNumberLines(0).

Let's say you have a function that calculates how many lines to skip, based on, e.g., an input argument - it could be the type of file being processed, or the phase of the moon; then, you could use it like this:

You're aware CalculateNumberLinesToSkip() could return 0 (you could, say, be processing a file without a header). And I'd venture a guess that you would probably expect that SkipNumberLines(0) would skip, you know, a number of lines kinda equal to... zero. As in, more than -1, but definitely less than 1.

Which left me with two alternatives:
  1. Treat 0 as a special case. Which would force upon the caller the "specialness" of this case.
  2. Indulge in a fanatical devotion to coherence - read one line on SkipNumberLines(0), and consequences be damned.
  3. Forget coherence, and accept that even though they're all line-skipping functions, they can have different semantics/post-conditions.

OK, enough Spanish Inquisition jokes.
I've settled for alternative 3. I'm not entirely sure it's the best option, and as I get more use out of this class, I expect to find new scenarios and patterns, which will provide me with more data to revisit this design later. But, for now, it seems to be the safest choice.

And there you have it. Quite modest, yes, but it has been saving me some boilerplate code in these last few weeks.

I don't know yet how it will evolve. I can feel an itch with regards to a text mode rewind, but I'll have to dive into streams and buffers much more deeply than I feel like, at the moment.

And right now I have other challenges awaiting... impatiently, I might add. A crash course in some obscure aspects of Weblogic administration to better diagnose socket problems without resorting to strace.


No comments:

Post a Comment