Friday 19 October 2012

The light at the end of the tunnel - Design is jelling

Here is a first version of the AsyncLoop functions. As I've said before, this functionality is mostly intact from our first working example.

What I particularly like about this solution is that we're separating SSH details and loop control. Our loop only cares about each iteration's task state - has it finished, or is it still going? This also means we don't need an object to drive these loops, free functions are enough.

The functions have different names because of this. I prefer having two different names than having to worry about casts every time I want to use these functions.

This new design has also revealed a problem that had escaped my attention. I've been using mostly AsyncLoopSocket() as my preferred loop driver. And some operations require starting both a read and a write on the socket. And that, in turn, gives rise to handlers being called after the work is done, i.e., after we call sock.get_io_service().stop(). Since this doesn't actually clear io_service's queue, outstanding handlers will be called on this io_service's next reset()/run() cycle.

And this becomes a problem when we have the life cycle spread out across several classes. Let's take a look at, say, SessionConnection. As I explained in my previous post, this class handles the SSH session connection to a remote host (handshake/authentication and disconnect). We would like to use the socket as our driver for the disconnect loop. However, this loop runs on the dtor, and requires both a read and a write.

So, it's possible we'll have one or more events from this loop still on the queue after SessionConnection's dtor has finished execution. That means there's a risk of our io_service's next reset()/run() cycle calling a method for an instance of SessionConnection whose dtor has finished execution. "There's a risk" means "it probably won't happen most of the time, and when it does, it'll create some mysterious hard-to-debug problem". Which I'd rather avoid.

Even though I already had an alternative, I took a look at asio's source (despite my current semi-illiteracy, when it comes to templates). The queue is cleared on io_service::~io_service(). This deletes a pointer to service_registry, so I took a look at its destructor:

service_registry::~service_registry()
{
  // Shutdown all services. This must be done in a separate loop before the
  // services are destroyed since the destructors of user-defined handler
  // objects may try to access other service objects.
  boost::asio::io_service::service* service = first_service_;
  while (service)
  {
    service->shutdown_service();
    service = service->next_;
  }

  // Destroy all services.
  while (first_service_)
  {
    boost::asio::io_service::service* next_service = first_service_->next_;
    destroy(first_service_);
    first_service_ = next_service;
  }
}

The first comment seems to hint that shutdown_service() doesn't clear outstanding handlers in the queue, and that this is done in destroy(), which deletes first_service_, a service * it receives as parameter. So, for a socket (and considering IOCP), we'ge got this to look for, as a service type: typedef detail::win_iocp_socket_service<Protocol> service_impl_type.

I digged deeper into asio's source, and soon I was looking at code like this, on win_iocp_socket_service_base.ipp:

if (FARPROC cancel_io_ex_ptr = ::GetProcAddress(
        ::GetModuleHandleA("KERNEL32"), "CancelIoEx"))
  {
    // The version of Windows supports cancellation from any thread.
    typedef BOOL (WINAPI* cancel_io_ex_t)(HANDLE, LPOVERLAPPED);
    cancel_io_ex_t cancel_io_ex = (cancel_io_ex_t)cancel_io_ex_ptr;
    socket_type sock = impl.socket_;
    HANDLE sock_as_handle = reinterpret_cast<HANDLE>(sock);
    etc...

And that's when I've decided to stop. Yes, maybe there is a right way to do what I want buried somewhere in asio, but this is not where I want to invest my time right now. One of the reasons for using asio is precisely to keep some distance from OS details. I have an alternative, and I'll try it first: I'll use AsyncLoopTimer() for all operations on the dtors, since this doesn't fire two events.

I've also completed my first draft for SSHCommand's design:
  • SSHCommand will contain an instance of ChannelLocal (further work is required on these names, obviously), that does nothing on ctor and frees LIBSSH2_CHANNEL's local resources on dtor, calling libssh2_session_free().
  • The method that implements command execution will have a block with a local instance of ChannelRemote, that opens an SSH channel on ctor, calling libssh2_channel_open_session(), and closes it on dtor, calling libssh2_channel_close(). In this block we'll execute the command, via libssh2_channel_exec(), and read its result, via libssh2_channel_read(). When we exit this block, i.e., after the channel is closed, we'll call libssh2_channel_get_exit_status() and libssh2_channel_get_exit_signal(). I expect it'll be something like this:
SSHCommand::Execute()
{
    {
        // ChannelLocal WILL NEED THE LIBSSH2 CHANNEL
        // TO CALL libssh2_session_free() ON DTOR,
        ChannelRemote cr(channelLocal);
        // AND WE'LL NEED IT, AFTER cr IS DESTROYED,
        // TO GET THE EXIT STATUS/SIGNAL
        channel = cr.GetChannel();
        
        ExecuteCommand();
        GetResult();
    }
    GetExitInfo();
}

SSHCommand's design is more complex because the life cycle involved is not quite symmetric. We've got open vs. close and free, and we can only get the exit status/signal after close. So, even though I've considered (again) placing all functionality on its ctor/dtor, I've decided against it, mainly to keep all dtors simple.

A final note: I believe io_service needs more in terms of queue management. We should be able to clear the queue, to query how many waiting events it contains, to set a limit for its size. I can even see scenarios where we could traverse all its elements, performing some action. If this couldn't be implemented without a performance penalty, then it could be designed as a compilation optional, i.e., it's not active by default, but if you really need it, just activate it and rebuild.

No comments:

Post a Comment