Sunday 14 September 2014

Performance puzzle(d)

So, back to programming.
 
Every now and then, I pick up a programming puzzle. The goal is not only to solve the puzzle, but also the learn more about the performance of my solutions.
 
A few days ago, I've decided to take a shot at making a Cracker Barrel solver. Simple stuff, just brute-force your way through solving the puzzle. First, I've started with a standard 15-hole board. Then, I've moved on to a configurable board (but never smaller than 15 holes). Then, I've implemented getting the board setup (including the list of all possible moves) just from the number of holes. Then, I've decided that this list would be kept in an std::array, which meant getting this information at compile-time, which gave an excuse for a little bit of basic recursive template meta-programming lite.

Yes, I've forced feature-creep on myself. Ah, and no design optimization - e.g, I didn't use STL's bitset<N> (or even C bit-twiddling) for the board, and I've created a good-ole struct for the moves, with three integers.

I've built it with GCC (mingw32/Qt Creator) and MSVC 2013. For a board with 15 holes, both were immediate. As I've moved to 21 holes, the GCC exe took a few seconds, but MSVC took almost... 1 minute?! With 28 holes, Qt Creator takes forever, and so does MSVC.

The fact that this takes forever doesn't surprise me. As I said, I've made no optimizations. However, the difference between GCC and MSVC puzzled me. So, I turned to Windows Performance Analyzer (WPA) to figure out what was going on.

Now, I had a good guess about what was going on - I was sure that, compiler optimizations notwithstanding, there was probably a lot of copying going on. There certainly is a lot of vectors getting constructed, and I was expecting to find my main culprit along those lines.

So, I fired up WPA, loaded the trace file, selected the CPU Usage (Sampled) graph, changed it to Display table only, opened the View Editor and set the Stack (call stack) to Visible.

And this was flagged as the most time-consuming operation of all:

std::_Equal_range<GameMove const *,GameMove,int,
    bool (__cdecl*)(GameMove const &,GameMove const &)

Sometimes, life proves more interesting than anticipated. However, this time it was just me not paying attention.

When I say I've made no optimizations, I didn't even go for the most important optimization of all - an intelligent algorithm. This means that when I calculate all the valid moves after each move, I go through the entire board, including spots for which no move is possible. And I'm using sorted vectors/arrays and equal_range() to perform this search.

So, while it the result was surprising, it shouldn't really have been.

Next step - find a more intelligent way to get the list of valid moves.