Tuesday 30 October 2012

Exception and Stack Trace

I'm using boost::exception. I like the usage pattern they suggest in their docs, I like the concept of having exception types as semantic tags to which I "append" arbitrary error info. In short, I find it brilliant.

However, I've hit a little problem - stack trace. I'd like something similar to Java's and C#'s stack trace, but I'm not ready to dive into backtrace or RtlCaptureStackBackTrace. So, I've decided to create a poor man's stack trace. On the downside, it'll trace only my own code; on the upside, I hope it will be less problematic than a real stack trace (this opinion is not based on actual experience, but rather from what I've found online about it - e.g., this).

I didn't actually consider many alternatives for this one - I needed a structure of an abritrary size that could hold BOOST_CURRENT_FUNCTION, __FILE__ and __LINE__. I've toyed with the idea of placing an std::vector in my exceptions. But I quickly realized that wasn't the way to go.

Why? Suppose we've got a function, LowLevelFunction() that throws a low_level_exception, which has a "stack trace" vector (STV) where we place the error location. This is caught further up the chain by HighLevelFunction(), which adds some more error info, adds its own location to the STV, and rethrows to its caller, HigherLevelFunction(). Now, HigherLevelFunction() was called by HighestLevelFunction(), which doesn't really care about low_level_exceptions; it won't touch anything other than a high_level_exception, except to send it directly to a logger (however, it does want the information contained in low_level_exception). So, HigherLevelFunction() throws a high_level_exception, which has its own STV. Now what do we do? Copy low_level_exception STV entries to high_level_exception? Add HigherLevelFunction()'s location to low_level_exception's STV? Use both STVs? There's no good solution, because the idea isn't sound.

However, as we said, HighestLevelFunction() wants all the relevant info contained in low_level_exception. And all this info must end up on a high_level_exception, since that's what it expects to catch. And the best solution I've found for this is nested exceptions, which is business-as-usual in Java and C#. And, thankfully, also in boost::exception.

We start with this typedef (from here):
typedef boost::error_info<struct tag_nested_exception, boost::exception_ptr>
    nested_exception.
And that means we can now do something like this:
void LowLevelFunction()
{
...
    BOOST_THROW_EXCEPTION(low_level_exception() << error_id(id) << 
        error_message(msg));
}

void HighLevelFunction()
{
...
    catch (low_level_exception& lle)
    {
        BOOST_THROW_EXCEPTION(low_level_exception() << 
            resource_name(res_name) << 
            nested_exception(current_exception()));
    }
}

void HigherLevelFunction()
{
...
    catch (low_level_exception& lle)
    {
        BOOST_THROW_EXCEPTION(high_level_exception() << 
            operation_value(oper_val) << 
            nested_exception(current_exception()));
    }
}

void HighestLevelFunction()
{
...
    catch (high_level_exception& lle)
    {
        // Finally, do something useful with all this info.
        // Hope the logger's still working...
    }
}
So, on the downside, we create two instances of low_level_exception, where we could've used just one. I can live with that, until evidence shows me it's a bad idea.

Yes, I know, there will be times when this may only make matters worse - e.g., if we catch an std::bad_alloc because we ran out of memory, creating exceptions is not the way to go. However, in a situation like this, I'd say our system is most likely headed for a fresh new (re)start, and there's little we can do to change its course.

Next on the list: Take a look at the implementation of diagnostic_information(). I like the completeness of its output, but not so much its formatting, especially when dealing with nested exceptions.

On a side note, I've loved finding out about current_exception(). Finally, I can do something useful here:
SomeImportantClass::~SomeImportantClass()
{
    try
    {
        // Cleanup. May throw
    }
    catch (...)
    {
        // I'd like to log this, but what exactly did we catch?
        // current_exception() to the rescue
    }
}
Oh, and I've been having some fun with perl, regex, and log files. I'll be diving into Boost.Regex soon, so I'm getting used to it in a more "dynamic" environment.

Friday 19 October 2012

The light at the end of the tunnel - Design is jelling

Here is a first version of the AsyncLoop functions. As I've said before, this functionality is mostly intact from our first working example.

What I particularly like about this solution is that we're separating SSH details and loop control. Our loop only cares about each iteration's task state - has it finished, or is it still going? This also means we don't need an object to drive these loops, free functions are enough.

The functions have different names because of this. I prefer having two different names than having to worry about casts every time I want to use these functions.

This new design has also revealed a problem that had escaped my attention. I've been using mostly AsyncLoopSocket() as my preferred loop driver. And some operations require starting both a read and a write on the socket. And that, in turn, gives rise to handlers being called after the work is done, i.e., after we call sock.get_io_service().stop(). Since this doesn't actually clear io_service's queue, outstanding handlers will be called on this io_service's next reset()/run() cycle.

And this becomes a problem when we have the life cycle spread out across several classes. Let's take a look at, say, SessionConnection. As I explained in my previous post, this class handles the SSH session connection to a remote host (handshake/authentication and disconnect). We would like to use the socket as our driver for the disconnect loop. However, this loop runs on the dtor, and requires both a read and a write.

So, it's possible we'll have one or more events from this loop still on the queue after SessionConnection's dtor has finished execution. That means there's a risk of our io_service's next reset()/run() cycle calling a method for an instance of SessionConnection whose dtor has finished execution. "There's a risk" means "it probably won't happen most of the time, and when it does, it'll create some mysterious hard-to-debug problem". Which I'd rather avoid.

Even though I already had an alternative, I took a look at asio's source (despite my current semi-illiteracy, when it comes to templates). The queue is cleared on io_service::~io_service(). This deletes a pointer to service_registry, so I took a look at its destructor:

service_registry::~service_registry()
{
  // Shutdown all services. This must be done in a separate loop before the
  // services are destroyed since the destructors of user-defined handler
  // objects may try to access other service objects.
  boost::asio::io_service::service* service = first_service_;
  while (service)
  {
    service->shutdown_service();
    service = service->next_;
  }

  // Destroy all services.
  while (first_service_)
  {
    boost::asio::io_service::service* next_service = first_service_->next_;
    destroy(first_service_);
    first_service_ = next_service;
  }
}

The first comment seems to hint that shutdown_service() doesn't clear outstanding handlers in the queue, and that this is done in destroy(), which deletes first_service_, a service * it receives as parameter. So, for a socket (and considering IOCP), we'ge got this to look for, as a service type: typedef detail::win_iocp_socket_service<Protocol> service_impl_type.

I digged deeper into asio's source, and soon I was looking at code like this, on win_iocp_socket_service_base.ipp:

if (FARPROC cancel_io_ex_ptr = ::GetProcAddress(
        ::GetModuleHandleA("KERNEL32"), "CancelIoEx"))
  {
    // The version of Windows supports cancellation from any thread.
    typedef BOOL (WINAPI* cancel_io_ex_t)(HANDLE, LPOVERLAPPED);
    cancel_io_ex_t cancel_io_ex = (cancel_io_ex_t)cancel_io_ex_ptr;
    socket_type sock = impl.socket_;
    HANDLE sock_as_handle = reinterpret_cast<HANDLE>(sock);
    etc...

And that's when I've decided to stop. Yes, maybe there is a right way to do what I want buried somewhere in asio, but this is not where I want to invest my time right now. One of the reasons for using asio is precisely to keep some distance from OS details. I have an alternative, and I'll try it first: I'll use AsyncLoopTimer() for all operations on the dtors, since this doesn't fire two events.

I've also completed my first draft for SSHCommand's design:
  • SSHCommand will contain an instance of ChannelLocal (further work is required on these names, obviously), that does nothing on ctor and frees LIBSSH2_CHANNEL's local resources on dtor, calling libssh2_session_free().
  • The method that implements command execution will have a block with a local instance of ChannelRemote, that opens an SSH channel on ctor, calling libssh2_channel_open_session(), and closes it on dtor, calling libssh2_channel_close(). In this block we'll execute the command, via libssh2_channel_exec(), and read its result, via libssh2_channel_read(). When we exit this block, i.e., after the channel is closed, we'll call libssh2_channel_get_exit_status() and libssh2_channel_get_exit_signal(). I expect it'll be something like this:
SSHCommand::Execute()
{
    {
        // ChannelLocal WILL NEED THE LIBSSH2 CHANNEL
        // TO CALL libssh2_session_free() ON DTOR,
        ChannelRemote cr(channelLocal);
        // AND WE'LL NEED IT, AFTER cr IS DESTROYED,
        // TO GET THE EXIT STATUS/SIGNAL
        channel = cr.GetChannel();
        
        ExecuteCommand();
        GetResult();
    }
    GetExitInfo();
}

SSHCommand's design is more complex because the life cycle involved is not quite symmetric. We've got open vs. close and free, and we can only get the exit status/signal after close. So, even though I've considered (again) placing all functionality on its ctor/dtor, I've decided against it, mainly to keep all dtors simple.

A final note: I believe io_service needs more in terms of queue management. We should be able to clear the queue, to query how many waiting events it contains, to set a limit for its size. I can even see scenarios where we could traverse all its elements, performing some action. If this couldn't be implemented without a performance penalty, then it could be designed as a compilation optional, i.e., it's not active by default, but if you really need it, just activate it and rebuild.

Friday 12 October 2012

Re-re-re-etc-design

Progress has slowed down a bit. First, because work shifted into overdrive, with impact on my free time. Second, because of this, namely: "Designing an interface takes longer than whipping together a concrete class which fulfills that interface". Getting libbsh2 and boost::asio to work was, it turns out, the easy part. Creating a robust class out of this is turning out to be a bit more difficult.

This is the 5th time I've started writing this post. Every time I begin, I find myself having to think deeper about my work (so I can explain it properly), which leads me to an insight, usually followed by some testing, a redesign, and some more testing. Then, I let it settle for a while, and start writing a post about it. Rinse and repeat.

So...

After reviewing SSHSession vis-à-vis items from Scott Meyer's "Effective C++" and "More Effective C++", I've noticed some problems to correct, which led to a redesign; which led to more problems; which was useful to remind me that, despite all the progress I've made, I'm still a beginner (let's just say I've created more than my share of source files with names like "test_*.cpp" in these last couple of weeks).

The main goals behind the redesign are:
  • Simplifying SSHSession's ctor and dtor, by implementing RAII on these resources: socket, session instance (resources created by libssh2 on the local machine), and session connection (actual ssh session established on the remote host).
  • Making SSHSession more robust to failure.

 The life cycle of a libssh2 session goes through these stages:
  1. Initialization (acquires local resources)
  2. Handshake/Authentication (establishes an ssh session on the remote host)
  3. Disconnect (terminates ssh session on the remote host)
  4. Free (releases local resources; if necessary, also performs step 3).
Before step 1, we have to open a socket, which we pass to libssh2, in step 2. From then on, we don't need to pass the socket anymore, as libssh2 will store a reference to it. After step 4 (and not before), we can close the socket.

So, after a few iterations, I arrived at this:
  • LoopAsync() and LoopTimer() (see here) will be free functions. Each will have a new parameter added - a socket/timer reference, upon which they'll act.
  • SSHSession will have a nested class, SessionConnection. This class is responsible for ssh handshake/authentication (on ctor) and ssh disconnect (on dtor), thus implementing RAII for the session connection (steps 2 and 3, above).
  • SessionConnection will have a nested class SessionHandle, that will encapsulate a LIBSSH2_SESSION *. This class is responsible for calling libssh2 functions that acquire (in its ctor) and free (in its dtor) local resources (steps 1 and 4, above).
I'm still on the fence about how to best handle the socket. I'm considering two alternatives:
  • SessionConnection will also be responsible for the socket, creating it on its ctor and destroying in on its dtor. This causes an order dependency on the dtor - the socket must remain open until SessionHandle dtor exits, as LIBSSH2_SESSION attempts a cleanup on the remote server, if necessary, in the case of an abnormal termination.
OR
  • SessionHandle will contain the socket, since it may need it in the dtor, as LIBSSH2_SESSION attempts a cleanup on the remote server, if necessary. This will probably mean a class rename, since "Handle" implies something simpler, such as a thin wrapper around a resource.
Right now, I'm more inclined towards the first alternative, in spite of the order dependency.

Both cases will require a method to create the socket, as asio's tcp::socket has no ctor that accepts a host name, only IP address (via tcp::endpoint). So, whoever end up owning the socket will also have a function that creates it and returns it by std::move.