Saturday, 27 December 2014

'Tis the Season...

... to be jolly, and grateful, and hopeful.
 

On the personal front...

...this year has been very positive, the only clouds on the horizon being some health issues. If I had to choose the most important points of 2014, I'd go with these:
 
  • The level of complicity and understanding between me and my wife has increased tremendously. I believe we achieved this by undergoing a similar change, which takes us right to the next point...
  • We both began going at life with a more relaxed attitude. It's a daily exercise to keep from slipping, but it's quite worth it. Obviously, it occasionally slips, but not only have we gotten better at identifying these slips, both on ourselves and on each other, we are also quicker to defuse these situations, often by sharing a good laugh at ourselves.
The kids are working their way through college, on what are the first steps of their own journey, their own adventure. It's time they take full control of the pen and start writing their story; that's their part. Our part is hoping everything they've taken in through the years will serve them well in getting their bearings as they set out. More than hopeful, we're confident it will.
 
I'm a somewhat spiritual guy, although I don't often stop to think about it. However, looking at 2014, I feel blessed; I already knew I'd found someone more understanding of my failings than I ever deserved, but this year took it to a new height. If I have so much to write about in the following point is because I was fortunate enough to find someone as understanding as my wife.
 
If you don't care much for mildly technical stuff, you can jump straight to the end of the post.
 

On the professional front...

...this has been a year where I continued a trend that began in mid-2013, namely, generalization.
 
When it all began in 2011, I had picked up C#. Then, because I felt I wasn't learning anything other than the language, I've added C++. Actually, my plan was adding C++, but I ended up switching to C++, and C# was left behind. The learning experience was a lot more intense, not just about the language, but also about the whole system - processor, OS, environment, external libs, debugging, assembly, and a whole lot of etc.

Then, I've accidentally set out on a task for which I found little help on the web. So, for much of what I was doing, I was on my own. While I did manage to get a working result (which is still a top-hitter on google.com and bing.com), I wasn't totally happy with it, and I suspected I'd have been even less happy if my knowledge about what I was doing wasn't so lacking. So, I've stepped back and went back to basics. And I've quickly learned that, indeed, the potential for "improvement" (i.e., correction) in what I'd done was much bigger than I had anticipated.
 
Then, as I began taking on more technical tasks at work, the generalization began - certificates, network, DNS, managing Linux, setting up environments according to specific constraints, managing web servers, managing Weblogic server, picking up new languages (e.g., Ruby), revisiting familiar languages (e.g., Java).
 
At the same time, I've taken the learning experience provided by C++ deeper - assembly debugging, system traces, building gcc from source and installing it without touching the system gcc, doing the same with perl. And I've also began getting my feet wet with Javascript (bootstrap and query.js) and Android development.
 
And before I noticed, I had not only changed my course, but I was quite satisfied with that change. This is where I see myself going, becoming a generalist. I love solving problems, and you need a widespread knowledge to do it; you don't always find the cause of a problem at the same level, and I don't like getting stuck for lack of knowledge. I also don't like getting stuck for lack of system access, but that's the way things are when you work in a large-ish corporation.
 
So, here are my New Years resolutions, a.k.a., goals for 2015:
  • Redesign my libssh2 + asio components, incorporating what I've learned in the meantime. And hoping that two years from now I may look at what I've done, say "What I was thinking??!!", and repeat this goal again, as I keep learning more.
  • Pick up a functional language, probably Haskell or Erlang. It's about time I get my feet wet on functional programming. I love what I've learned about generic programming in C++ (actually, what I'm still learning), but it's time to add something more to the mix.
  • Continue my exposure to Javascript and Android.
  • Deepen my knowledge of systems/network administration.
  • Increase my knowledge of low-level tracing/debugging. It's time to begin some serious experiments with loading up core dumps and getting my bearings, and getting more mileage out of stuff like Windows Performance Analyzer.
Too ambitious, you say? Yes, I know. I won't manage to do all of this? Probably not. Which is a good thing, otherwise 2016 would be a very boring year.
 

At the end of the day...

 
 
...our family wishes you and your loved ones a Merry Christmas and a Happy 2015, filled with love, joy, and peace.
 
 

Sunday, 7 December 2014

A simple class to read files

Yes, I'm olde enough to know "simple" is a misnomer.
 
I have a new addition to my bluesy github repo.
 
Part of what I do at work involves reading through user-un-friendly log files. How user-un-friendly? Just to give a small example, the cases where a "log record" is on a single line are the exception (I wish it was just the Exceptions), and there's usually no easy way to correlate the beginning/end of operations with a single grep. This means I spend a lot of time creating scripts/small programs to process those log files.
 
I'm not an enthusiast of shell scripting. Oh, I can persuade it to do some complex stuff, yes, but it always seems to obey me reluctantly, and I'm left with the nagging feeling that, somehow, it's laughing behind my back. Not to mention that the resulting scripts are very... spaghetti-like. I find it hard to think modularly when writing shell scripts. I had a similar issue with Perl. It's still there with Ruby, but being modular with Ruby is a lot easier for me.
 
However, I don't have this difficulty with C++. Or Java, or C#, or Delphi/Lazarus, for that matter. It must be some sort of mental block, because while I couldn't create a modular design in shell scripting to save my life, that comes as second nature when I'm working in C++.
 
So, when I have to create some kind of custom tool to process these files, most of the time I turn to C++. Thanks to C++1y/Boost, I don't take much longer to write it than I would with any other tool/language, and I definitely appreciate that when I need to change something a few weeks/months later, I can find my bearings much more quickly.
 
And, after creating a few very similar programs, all driven by a file-reading loop, I've figured I had enough use-cases to create...
 

FileLineReader

A class... actually, a class template, that reads lines from a file. Surprising, heh?
 
template <typename LineMatcher = SimpleLineMatcher,
    typename LineCounter = SimpleLineCounter<unsigned long>>
class FileLineReader : private LineMatcher, private LineCounter
 
The concept is simple - read a line, which becomes the current line, and give the caller a way to get it (or copy it). Then, we add some nuggets of convenience:
  • Keep a counter of read lines. Implemented by LineCounter.
  • Supply matching operations, that not only perform matching on the current line, but also allow skipping lines based on matching/non-matching. The matching is implemented by LineMatcher; which is then used by FileLineReader to implement the skipping.

Why inheritance, instead of composition? Because there will be cases where LineMatcher and LineCounter have no state, and a data member is a bit of waste (yes, a tiny little bit of a waste). Can this be abused? Absolutely, but you know - protect against Murphy, not Machavelli.
 

Skipping lines 

The first line skipping functions I introduced were SkipMatchingLines(), which skips lines while there's a match, and SkipLinesUntilMatch(), which skips lines until it finds a match.
 
These functions share a similar trait - their stop condition is met after reading the line that triggers the stopping condition. Suppose we have this file:

[2014-01-01 00:00:00.000] match-1 This is line 0
[2014-01-01 00:00:00.100] match-1 This is line 1
[2014-01-01 00:00:00.200] match-1 This is line 2
[2014-01-01 00:00:00.300] match-2 This is line 3
[2014-01-01 00:00:00.400] match-2 This is line 4
 
Something like this

FileLineReader<> flr{kFileName};
flr.SkipMatchingLines("match-1");
// Process lines
 
can only stop when the line "... line 3" is read, because there's no way we can perform a match against a line that hasn't been read yet. This means that when we reach "// Process lines", the current line will be the first line to process, so we should "process-then-read" (do-while), rather than the more intuitive "read-then-process" (while).

I've entertained the notion of using tellg()/seekg() to rewind the file (IOW, un-read the line), but I didn't even get started, after reading about how it behaves in text mode. So, I'll stick with "process-then-read", for now.

Another common scenario I encounter is skipping a certain number of lines, usually a header. So, I've added this:
   
void SkipNumberLines(unsigned int number_lines);

Because we're skipping lines independently of any matching, it didn't make sense to implement this in LineMatcher; so, I've implemented it straight in FileLineReader. I'm not completely happy with this solution, but I figured it's better than no solution.
 

Skipping dilemmas

I also thought it would make sense to keep semantic coherence between all the line-skipping functions. So, SkipNumberLines() should stop on the first line to process, not on the last line skipped (even though it could, because we're skipping a known number of lines), just like the other functions.
 
No biggie, we just perform an extra read - so, for SkipNumberLines(3), we actually read 4 lines. Lovely, everything's coherent, little ponies are happy, and fairies are spreading pixie dust and sneezes throughout the realm.
 
The thing is... nobody expects the Spanish Inquisition... their chief weapon is surprise and SkipNumberLines(0)... their two chief weapons are surprise, an almost fanatical devotion to the Pope, and SkipNumberLines(0)... OK, among their weaponry we can find elements as diverse as... blah, blah, blah, and SkipNumberLines(0).

Yep, SkipNumberLines(0).

Let's say you have a function that calculates how many lines to skip, based on, e.g., an input argument - it could be the type of file being processed, or the phase of the moon; then, you could use it like this:

flr.SkipNumberLines(
    CalculateNumberLinesToSkip(some_relevant_argument));
 
You're aware CalculateNumberLinesToSkip() could return 0 (you could, say, be processing a file without a header). And I'd venture a guess that you would probably expect that SkipNumberLines(0) would skip, you know, a number of lines kinda equal to... zero. As in, more than -1, but definitely less than 1.

Which left me with two alternatives:
  1. Treat 0 as a special case. Which would force upon the caller the "specialness" of this case.
  2. Indulge in a fanatical devotion to coherence - read one line on SkipNumberLines(0), and consequences be damned.
  3. Forget coherence, and accept that even though they're all line-skipping functions, they can have different semantics/post-conditions.

OK, enough Spanish Inquisition jokes.
 
I've settled for alternative 3. I'm not entirely sure it's the best option, and as I get more use out of this class, I expect to find new scenarios and patterns, which will provide me with more data to revisit this design later. But, for now, it seems to be the safest choice.

And there you have it. Quite modest, yes, but it has been saving me some boilerplate code in these last few weeks.

I don't know yet how it will evolve. I can feel an itch with regards to a text mode rewind, but I'll have to dive into streams and buffers much more deeply than I feel like, at the moment.

And right now I have other challenges awaiting... impatiently, I might add. A crash course in some obscure aspects of Weblogic administration to better diagnose socket problems without resorting to strace.

 

Sunday, 14 September 2014

Performance puzzle(d)

So, back to programming.
 
Every now and then, I pick up a programming puzzle. The goal is not only to solve the puzzle, but also the learn more about the performance of my solutions.
 
A few days ago, I've decided to take a shot at making a Cracker Barrel solver. Simple stuff, just brute-force your way through solving the puzzle. First, I've started with a standard 15-hole board. Then, I've moved on to a configurable board (but never smaller than 15 holes). Then, I've implemented getting the board setup (including the list of all possible moves) just from the number of holes. Then, I've decided that this list would be kept in an std::array, which meant getting this information at compile-time, which gave an excuse for a little bit of basic recursive template meta-programming lite.

Yes, I've forced feature-creep on myself. Ah, and no design optimization - e.g, I didn't use STL's bitset<N> (or even C bit-twiddling) for the board, and I've created a good-ole struct for the moves, with three integers.

I've built it with GCC (mingw32/Qt Creator) and MSVC 2013. For a board with 15 holes, both were immediate. As I've moved to 21 holes, the GCC exe took a few seconds, but MSVC took almost... 1 minute?! With 28 holes, Qt Creator takes forever, and so does MSVC.

The fact that this takes forever doesn't surprise me. As I said, I've made no optimizations. However, the difference between GCC and MSVC puzzled me. So, I turned to Windows Performance Analyzer (WPA) to figure out what was going on.

Now, I had a good guess about what was going on - I was sure that, compiler optimizations notwithstanding, there was probably a lot of copying going on. There certainly is a lot of vectors getting constructed, and I was expecting to find my main culprit along those lines.

So, I fired up WPA, loaded the trace file, selected the CPU Usage (Sampled) graph, changed it to Display table only, opened the View Editor and set the Stack (call stack) to Visible.

And this was flagged as the most time-consuming operation of all:

std::_Equal_range<GameMove const *,GameMove,int,
    bool (__cdecl*)(GameMove const &,GameMove const &)

Sometimes, life proves more interesting than anticipated. However, this time it was just me not paying attention.

When I say I've made no optimizations, I didn't even go for the most important optimization of all - an intelligent algorithm. This means that when I calculate all the valid moves after each move, I go through the entire board, including spots for which no move is possible. And I'm using sorted vectors/arrays and equal_range() to perform this search.

So, while it the result was surprising, it shouldn't really have been.

Next step - find a more intelligent way to get the list of valid moves.
 

Saturday, 30 August 2014

Extremely destructive

Off-topic, today. Quite so, actually. If you came here to read about programming, you may want to give this one a miss.
 
There's this light comedy French film called "Le Placard" ("The Closet" in English, "Sai do Armário" in Portuguese). Quite enjoyable, BTW, recommended.
 
The IMDB summary says it all:
A man spreads the rumor of his fake homosexuality with the aid of his neighbor, to prevent his imminent firing at his work.
The man's elderly neighbour is actually gay, and at some point mentions the irony of him being fired for being gay back in the day, when he was young, and how these days what happens is the exact opposite, because of anti-discrimination laws/fear of lawsuits/whatever. Yes, I know it's an over-simplification, but it's a light comedy, not a documentary.
 
However, it's no secret that, if we go back a few decades, being gay could land you in trouble at the workplace (and land you out of the workplace). It was a totally unfair practice, where your professional performance would be ignored and you'd be judged exclusively on (selected parts of) your personal profile. And you have any doubt on how unfair and tragic this was, just read Alan Turing's bio, and think on how many "anonymous" people suffered similar fates.
 
Fast-forward to the 21st century, and things are looking a lot better... or so I thought. Until I witnessed the whole Brendan Eich's debacle.
 
Speaking solely for myself, the only thing it has achieved was to make me more cautious in providing my support to a cause I've always considered worthwhile (and still do) - equal rights for everyone. That's the way I've always voted, and that's the way I've always seen myself voting. But now that I've witnessed how the previously-persecuted can do the exact same thing to others with such abandon and glee, I'm left wondering - if I'm ever called to vote on any subject-related matter, what will I do? As the guy sang, "I don't know the answer".

You see, the thing is... I don't like mobs, and I'm not particularly amenable to the argument "We're a mob, but it's for a good cause, a just cause". There is no "good cause" in mob mentality, only rabid persecution.
 
However, this incident was months ago, why write about it now? Well, because now we have "the mobs" up in arms again, this time in the "gaming industry".
 
I'm an on/off gamer. I'm a Steam customer, and a very good GOG customer (if there's one thing I've learned is to vote with my wallet and reward those that actually care about their customers). These days, I sometimes go for months without playing a game. Before I got back into programming, I could spend all my free time playing games. Now, gaming is relegated to 3rd place, behind programming and guitar-playing (I'm excluding Real Life, which means family, friends and work, obviously).
 
I've never had any problem in calling myself a gamer, even as it stopped being part of what I did. And then, slowly but surely, the word "gamer" began to take on very specific meanings, all of them negative. It didn't bothered me much. It's like reading "All Portuguese are lazy bastards, living off everyone else's money", this says more about the speaker than about the Portuguese people.
 
Anyway, now it all came to a climax with this last brouhaha. I'm sure you can find the... "relevant information" (for lack of more appropriate words) and links. I'll just link to the best summation I've seen of this whole deal:
So many interesting conversations and things to challenge. But you can't have a discussion during a riot. It's a shame.
Indeed. What you have here is a) a few people who should be detained for online harassment/threats; b) some people on both sides who have nothing to say except indulge in some feces-throwing (tribalism at its best); c) some people (fortunately, more numerous), again on both sides, who are raising up important issues that should be discussed (but, as Shamus put it, not "during a riot"); and d), (going out on a limb here) I'm willing to bet you have an even grater number of people who are just looking at this whole thing from the sideline, thinking "God, what a waste".
 
Like the "Eich incident", this has just contributed to make me more skeptical towards a cause I would definitely lend my unconditional support to - namely, equal treatment, be it of gender or any other differentiating aspect.
 
Because, once again, I don't care what is moving a persecuting mob. At the end of the day, it's still a persecuting mob.
 
Actually, I'm just about to finish, so I believe it's a good time to let Godwin in to make an argumentum ad Nazium (Nauziem, maybe?) - I see absolutely no difference between a mob that's going to indiscriminately shoot Jews, calling them all sub-humans, and a mob that's going to indiscriminately shoot Germans, calling them all Nazis. I, for one, am glad we had the "Nuremberg trials", not the "Nuremberg massacres".
 
So, if all you want is to stoop down to the lowest level of what you have elected as your "opposing tribe", I'll just stand by the sideline, watching, and taking notes. Those are notes I'll reread and evaluate the next time I'm called to vote, either with my ballot or my wallet.
 

Sunday, 24 August 2014

Productivity - Cakes and lies

On my "C++ and Ruby - Side by side" post, I mentioned having a natural affinity towards a programming language. Today, I'll elaborate on that, using something I've worked on recently to illustrate.
 
Important note: Nothing I write below is an indictment/endorsement for this or that language. What I'm going for is the exact opposite - whenever you read a book/article about a language (any language) that promises greater productivity just by switching to that language, you should be suspicious.
 

The problem

I've never liked the way IDEs (in my case, Visual Studio/Qt Creator) define their project structures, and I've been trying to setup my own structure, with the goal of supporting multiple tools.
 
Unfortunately, the tools prefer their own particular structure arrangements, and aren't very cooperative to anything that strays from their comfort zone.
 
So, in the end, I've settled for a two-step procedure - I create the projects using the IDEs and then I run a process I created to move the project to my structure. The moving part is easy, and I could do it manually. The tinkering with the project settings files is more tricky, and is what ended up giving me the final push to look at automating the whole thing.
 
I've had three attempts at creating this process.
 

Perl

The first version of this process was a Perl script. Why Perl? Well, because... scripting language... well-suited to small hacks... higher productivity... y'know?
 
It was easy to create a linear script. However, a few weeks later, when I wanted to introduce some changes, it was also easy to get lost navigating around said script, trying to get a grasp of what I had done.
 
So, I went through several redesign/refactor iterations, trying to move from a linear script to something a bit more structured.
 
I finally settled on a version that lasted several months, with Modern::Perl and Moose as foundations. As I used this script, I became aware of several weaknesses in my original design. So, a few weeks ago, I decided to review my design, and proceeded to change the script. And, again, I was lost. Even more so than in the linear script, in fact.
 
I've had to review all the scripts/modules I created in order to understand again what I had done at that time (docs? Come on, it's a simple script to create some directories, copy a few files, edit a couple of those, and init git). And, since I had to do that, I've decided to have another go at it, but this time in...
 

Ruby

Why Ruby? Well, because... scripting language... well-suited to small hacks... higher productivity... y'know?
 
It began well. I removed some syntactic quirks, especially where OO-ness was concerned. Strange as it may seem, a clean language presents a greater potential for a clear design. Maybe it's just me, maybe I'm easily distracted by these syntactic quirks, which I admit should not be quite so important.
 
Long story short, it began well, but... I've started having growing difficulties to implement my design. I've finally decided to try...
 

C++

Yes, predictable, I know...
 
Why? Because I've decided that maybe "lower productivity" was worth a shot.
 
And it went smoothly. Not "easily", not "simply"; but definitely smoothly. I've actually finished the process with the design I wanted. I can actually navigate around the code, easily finding what I'm looking for.
 
What was I aiming at? Here, look at my "main()":
 
void Run(int argc, char *argv[])
{
    AppOptions<ConfigProjectOptions>
        ao{argc, argv, "Opções ProjectConfig"};
 
    if (ao.HaveShownHelp())
    {
        return;
    }
 
    ConfigProjectOptions const& opt = ao.GetOptions();
    string project_name = opt.GetProjectName();
 
    // All objects are validated on construction.
    ProjectDirectory prj_dir(PROJ_PRJ, project_name, 
        STRUCT_PRJ, SHOULD_NOT_EXIST);
    ProjectDirectory bld_dir(PROJ_BLD, project_name, 
        STRUCT_BLD, SHOULD_NOT_EXIST);
    ProjectDirectory stg_dir(PROJ_STG, project_name, 
        STRUCT_STG, SHOULD_EXIST);
 
    Project<QtcStgValidator, QtcCopier, QtcProjectConfigUpdater>
        qtc_project{project_name, prj_dir, bld_dir, stg_dir};
    Project<MsvcStgValidator, MsvcCopier, MsvcProjectConfigUpdater>
        msvc_project(project_name, prj_dir, bld_dir, stg_dir);
 
    // Everything is valid, get user confirmation.
    if (!UserConfirms(project_name))
    {
        return;
    }
 
    prj_dir.CreateStructure();
    bld_dir.CreateStructure();
 
    if (opt.WantGit())
    {
        ConfigureGit(prj_dir.GetProjectHomeDir());
    }
 
    if (opt.IsQtcProject())
    {
        qtc_project.Copy();
    }
 
    if (opt.IsMsvcProject())
    {
        msvc_project.Copy();
    }
}

This is what I was after all along, but was unable to achieve either with Perl or Ruby. It's as clean as it gets, with the main classes clearly identified, based on the operations that I need to do, and with support classes implementing policies that actually take care of the different ways things are done.
 
The code itself is quite simple (this is a trivial program, after all), and you can find it here.
 
I probably could have achieved the same with Ruby, but this is what I'm talking about when I say "natural affinity". With C++, this code structure flowed naturally; with Ruby, not so much.
 

 What's all this about, then?

I'm repeating myself, but I'll say it anyway.
 
Don't trust productivity promises at face value, especially for quick hacks/trivial programs, where everyone says shell/scripting languages are the best choice. Sometimes, the cake is actually a lie.
 
If all you want is to get a count of particular string/regex on a log file, grep is the way to go. But say you need to do some manipulation on the results - e.g., the customer finds out he doesn't need a simple total count, but rather a list of totals for each key (e.g., customer ID). Suddenly, you're reading man pages/docs and searching the web, finding awk/perl/whatever "solutions" that don't quite give you what you want; so, you fiddle with those solutions and read some more man pages/docs.
 
And, then, you look at the clock, see how much time has passed and say - if I had fired up MS Access, I'd probably have written a little VBA, finished this and moved on (yes, "MS Access" and "VBA" are just examples, not endorsements).
 
At the end of the day, just because it's the best option for someone else, doesn't mean it's the best option for you.
 

Saturday, 16 August 2014

Still here, still going

I'm back, after another long absence.
 
I went on vacation, which meant the weeks before were spent on the pre-vacation rush, where you work like crazy to leave things in a state that can then be managed by the rest of the team.
 
After a few days winding off, I've started coding again, mainly tackling off code puzzles in C++. I've also been refactoring (and redesigning) the small program I was going to use for the "next post" I promised on my last post (no, it's not forgotten). And what else?
 

Building GCC

I'll kick this off by saying: Hats off to the folks behind GCC! Well done! I've built it a couple of times, with different configurations, on a CentOS VM, and it was totally effortless, just fire-and-forget. This time, instead of downloading and building each prerequisite by itself, I've decided to use every automatism available. So, I've reread the docs more carefully, and found this (which I had missed the first time around):
Alternatively, after extracting the GCC source archive, simply run the ./contrib/download_prerequisites script in the GCC source directory. That will download the support libraries and create symlinks, causing them to be built automatically as part of the GCC build process. Set GRAPHITE_LOOP_OPT=yes in the script if you want to build GCC with the Graphite loop optimizations.

GRAPHITE_LOOP_OPT=yes is the default, but it doesn't hurt to check it before running the script.
 
I've also noted that it is now possible for all prerequisites to be built together with GCC. A few months back, when I first did it, this was possible only for GMP, MPFR, and MPC (or so the docs said, at the time).
 

Linux distros

After playing around with a CentOS VM, I've decided I needed something else.
 
What did I need?
 
Something more "bleeding edge", that gave me simple access to more up-to-date software. Simple as in "no need to set up every repo known to man" and, at the same time, as in "no need to manually configure every damn piece of software on the system". After some research, I chose Fedora 20.
 
Why did I need it?
 
I'm going to start taking a closer look at some open-source software and, while I've become quite comfortable with building OSS on Windows, I'd rather just install it from a repository.
 
Couldn't I get by with CentOS?
 
Not really. I've decided to use FileZilla as a pilot for this, and build it. Even after adding some non-official repositories on CentOS (e.g., EPEL), it was still a pain to get all the necessary packages in the required versions. On Fedora? I was running make in a matter of minutes.
 
I may have to go for a memory upgrade, since I didn't design my system specs with virtualization in mind. But, for now, I'll make do without it.
 
I did have to install a different desktop. Not only did I find GNOME Shell (Fedora's default) non-intuitive, but it was also resource-consuming. Fedora responded a lot slower than CentOS, and both VMs had the same characteristics. I switched to MATE, and it's much more responsive.
 
I understand the need for a unified desktop experience across all devices, and I accept it brings an advantage both to the (average) user and the developer. However, not only am I perfectly capable of dealing with different paradigms on different devices, I actually prefer it that way. AFAIC, it makes sense that different devices require different experiences. On a desktop, GNOME Shell doesn't make sense for me; just like Unity, on Ubuntu, didn't; same for Windows 8 (although, to its credit, MS is making corrections). But with Linux we can, at least, switch desktops.
 
Anyway...
 
I expect to be absent again for a few weeks, since I'm going to enter the post-vacation rush, where you work like crazy to pick up everything that was left behind while on vacation.
 

Sunday, 6 July 2014

C++ and Ruby - Side by side

Yes, I know, choosing this blog's name wasn't my finest moment.
 
A couple of weeks ago I had to create a process to cross-validate files. Each line in those files is identified by a key (which can be repeated), and we have to sum all the values for each key in each file and compare those sums.
 
So, I whipped up a Ruby script in a couple of hours, and it took me a few more hours to refine it. As a side note, the "Refinement To Do" list has been growing, but my priorities lie elsewhere, so this will stay at "good enough" for now.
 
Then, when I got home, I've decided to replicate that Ruby script in C++. I didn't go for changing/improving the design (although I did end up changing one detail), just replicating it.
 
And I was pleasantly surprised.
  • It also took me a couple of hours.
  • The C++ code was not noticeably "larger" than its Ruby counterpart.
 
Yes, you read it right. Not only was my "C++ productivity" on par with my "Ruby productivity", but also the resulting code was very similar, both in size and in complexity.
 
Let's take a look at it, shall we?
 
Important: The Ruby code, as presented in this post, may not be legal Ruby code, i.e., if copy/pasted on a script, may cause errors. My concern here was readability, not correctness.
 

Paperwork, please

While the Ruby script needed no "paperwork", the C++ program needed some.

This was the only non-standard header I included:
 
#include <boost/algorithm/string.hpp>
using boost::split;
using boost::is_any_of;
using boost::replace_first_copy;

And these were the definitions I had in C++:
 
using Total = double;
Total const VALUE_DIFF = 0.0009;
Total const ZERO_DIFF = 0.0001;
 
using CardEntries = map<int, Total>;
using CardEntry = pair<int, Total>;
using SplitContainer = vector<string>;
 
#define NOOP
 

Support structures

We created a structure to gather the data necessary to process the files. Here it is in Ruby.
 
class TrafficFile
  attr_reader(:description, :filename,
    :key_position, :value_position, :calc_value)
 
  def initialize(description, filename, key_position,
    value_position = 0, &calc_value
  )
    @description = description
    @filename = filename
    @key_position = key_position
    @value_position = value_position
    @calc_value = calc_value
  end
end

The key position is the field number on the file containing the key. The value position is the field number on the file containing the value to add. Why do we also have a code block to calculate the value? Because it turned out that one of the validations had to be performed by adding three fields, instead of just one, so I decided to add the code block.
 
It would've been easier to turn value_position into an array, but I wanted to experiment with code blocks. So, I've done what I usually do in these situations - I set myself a deadline (in this case, 30 minutes); if I don't have something ready when the time comes, I switch to my alternative - usually, simpler - design.
 
And here is its counterpart in C++. I've eliminated de value_position/calc_value dichotomy, and decided to use lambdas for every case.
 
template <typename T>
struct TrafficFile
{
    TrafficFile(string desc, string fn, short kp, T cv) :
        description{desc}, filename{fn}, key_position{kp},
        calc_value{cv} {}
 
    string description;
    string filename;
    short key_position;
    T calc_value;
}; 
 
Note: Yes, I'm passing objects by value (e.g., the strings above). Whenever I don't foresee an actual benefit in passing by reference, I'll stick to pass-by-value.
 

Little Helpers

The filenames have a date, in the format MMYYYY. We extract this from the name of one particular file, which serves as an index to all the other files.
 
Here's the Ruby method.
 
def get_file_date(filename)
  return filename.split('_')[8].split('.')[0]
end
 
And here's the equivalent in C++, which is pretty much... well, equivalent. It's not a one-liner, but other than that, it's exactly the same - two splits.
 
string GetFileDate(string filename)
{
    SplitContainer c;
    split(c, filename, is_any_of("_"));
    split(c, c[8], is_any_of("."));
    return c[0];
}
 

Main 

Now, the entry point. This is where we have the most number of differences between Ruby and C++, because we're always passing a lambda to calculate the total in the C++ version. Other than that, it's similar.
 
We're creating an object that contains all the data to describe each file being validated and to gather the data for that validation.
 
Here it is in Ruby.
 
file_date = get_file_date(ARGV[0])
 
tel_resumo =
  TrafficFile.new("Telemóvel (resumo)",
    "#{file_date}_tlm.csv", 3, 15)
resumo =
  TrafficFile.new("Resumo", "#{file_date}_rsm.csv", 8, 13)
compare(tel_resumo, resumo)
 
detalhe =
  TrafficFile.new("Detalhe", "#{file_date}_det.csv", 4, 22)
tel_detalhe =
  TrafficFile.new("Telemóvel (detalhe)",
    "#{file_date}_tlm.csv", 3)
  do |fields|
    fields[4].gsub(',', '.').to_f() +
    fields[8].gsub(',', '.').to_f() +
    fields[31].gsub(',', '.').to_f()
  end

compare(detalhe, tel_detalhe)

 
And here it is in C++.
 
int main(int argc, char *argv[])
{
    string file_date = GetFileDate(string{argv[1]});
 
    auto cv_tr = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[15], ",", "."));};
    TrafficFile<decltype(cv_tr)> 
        tel_resumo{"Telemóvel (resumo)", 
        file_date + "_tlm.csv", 3, cv_tr};
 
    auto cv_r = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[13], ",", "."));};
    TrafficFile<decltype(cv_r)> 
        resumo{"Resumo", file_date + "_rsm.csv", 8, cv_r};
 
    Compare(tel_resumo, resumo);
 
    auto cv_d = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[22], ",", "."));};
    TrafficFile<decltype(cv_d)> 
        detalhe{"Detalhe", file_date + "_det.csv", 4, cv_d};
 
    auto cv_td = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[4], ",", "."))
            + stod(replace_first_copy(c[8], ",", "."))
            + stod(replace_first_copy(c[31], ",", "."));};
    TrafficFile<decltype(cv_td)>
        tel_detalhe{"Telemóvel (detalhe)", 
        file_date + "_tlm.csv", 3, cv_td};
 
    Compare(detalhe, tel_detalhe);
}
 

Validate

The validation is performed by comparing files in sets of two. For each file, we load the pair (key, total) into a container, and then we compare the totals for each key. Since there is no guarantee that every key is present on both files, when we find a key that exists on both containers, we remove that key from both containers.
 
This is the function that performs that comparison. We output every key that has different totals in both files.
 
In Ruby.
 
def compare(first, second)
  first_data =
    get_unified_totals(first.filename, first.key_position,
      first.value_position, &first.calc_value)
  second_data =
    get_unified_totals(second.filename, second.key_position,
      second.value_position, &second.calc_value)
 
  first_data.each() do |key, value|
    if second_data.has_key?(key)
      if (value - second_data[key]).abs() > 0.0009
        puts("ERRO! #{key} tem valores incoerentes: #{value}" +
          " e #{second_data[key]}")
      end

      first_data.delete(key)
      second_data.delete(key)
    end
  end
 
  check_remaining(first_data)
  check_remaining(second_data)
end

In C++. 
 

template <typename T1, typename T2>
void Compare(T1 first, T2 second)
{
    CardEntries first_data = 
        GetUnifiedTotals(first.filename, first.key_position, 
            first.calc_value);
    CardEntries second_data =
        GetUnifiedTotals(second.filename, second.key_position, 
            second.calc_value);
 
    for (auto it = first_data.cbegin(); 
        it != first_data.cend(); NOOP )
    {
        auto f = second_data.find(it->first);

        if (f != second_data.end())
        {
            if (fabs(it->second - f->second) > VALUE_DIFF)
            {
                cout << "ERRO! " << it->first 
                    << " tem valores incoerentes: "
                    << it->second << " e " << f->second << " (" 
                    << fabs(it->second - f->second)
                    << ")" << endl;
            }
 

            first_data.erase(it++);
            second_data.erase(f);
        }
        else
        {
            ++it;
        }
    }
 
    CheckRemaining(first_data);
    CheckRemaining(second_data);
} 
 
Since we remove all keys that exist on both containers, in the end, each container will have only the keys that didn't exist on the other container. We then use another function to validate these keys, which should all have a 0 total.
 
Here's Ruby.
 
def check_remaining(data_set)
  data_set.each() do |key, value|
    if (value - 0).abs() > 0.0001
      puts("AVISO! #{key} tem valor: #{value}")
    end
  end
end
 
Here's C++. Once again, note the code similarity, how the C++ code isn't any more complex than the Ruby code.
 
void CheckRemaining(CardEntries const& data_set)
{
    for (auto& item : data_set)
    {
        if (fabs(item.second - 0) > ZERO_DIFF)
        {
            cout << "AVISO! " << item.first 
                << " tem valor: " << item.second << endl;
        }
    }
}
 

Adding up

This is the function that reads a file and creates a container with the key and its total, which is the sum of all the values for all the occurrences of the key. The file is a CSV, but since I'm on Ruby 1.8.7, I preferred not to use Ruby's CSV lib(s); after all, I just needed to split the line on a ";", so I did it "manually".
 
def get_unified_totals(filename, key_position, value_position = 0)
  totals = Hash.new(0)
 
  File.open(filename) do |file|
    # skip header
    file.gets()
   
    while line = file.gets()
      a = line.split(';')
      if block_given?()
        totals[a[key_position]] += yield(a)
      else
        totals[a[key_position]] +=
          a[value_position].gsub(',', '.').to_f()
      end
    end
  end
 
  return totals
end
 
The C++ version is a bit simpler, because we always pass a lambda.
 
template <typename T>
CardEntries GetUnifiedTotals(string filename, short key_position,
    T calc_value)
{
    ifstream inFile{filename};
 
    string line;
    getline(inFile, line, inFile.widen('\n')); // Skip header
 
    CardEntries totals;
    SplitContainer c;
    while(getline(inFile, line, inFile.widen('\n')))
    {
        split(c, line, is_any_of(";"));
        totals[stoi(c[key_position])] += calc_value(c);
    }
 
    return totals;
} 
 

TL;DR

When you have a task to do, just pick the language you're most familiar/comfortable with, unless you have a very good reason not to. The much-famed "productivity gap" may not be as large as is usually advertised.
 
One thing I read often is that a good programmer is productive/competent/idiomatic with any language. I don't doubt it, although I do struggle a bit to accept some quirks (e.g., Ruby's "no return" policy).
 
However, I believe we all have a natural affinity towards some language(s)/style(s); that's our comfort zone, and that's where we'll have a tendency to be most productive. I'll write a bit more about this on my next post.