Thursday 27 September 2012

libssh2/asio - Redesign and test wrap-up

I'm in the midst of a design change in my libssh2 + asio solution.

I'll be having an SSHSession class, which will, in turn, create SSHCommand() (or SSHExec, I haven't decided yet) instances to run the commands on the remote server.

And I've also finished my idle connection tests. How did it go?

I've started by defining ClientAliveInterval and ClientAliveCountMax in sshd_config. Sure enough, the idle connection was killed a lot sooner - 80 seconds were enough to get a LIBSSH2_ERROR_SOCKET_SEND error. And on the sshd side, I got a "Timeout, client not responding" message.

In these cases, the socket reported as open, i.e., socket::is_open() returned true. Which should be correct, otherwise libssh2 probably would've returned LIBSSH2_ERROR_SOCKET_DISCONNECT. And this confirmed that this timeout is for the SSH session only, i.e., the TCP connection is still alive.

Next test. I undid this change, and activated the socket's keepalive. And we've broken the previous "record", of approx. 1.5 hours of idleness. An idle connection has "survived" for 6 hours. At that time, I killed the app, I don't think it's really necessary to test the 12-hour limit. Checking the configuration on my Linux host, the default value for a TCP connection was 2 hours. After that, keepalive kicks in, at default intervals of 75 seconds, with 9 attempts. This matches our previous testing, i.e., our idle connection lasted for 1.5 hours, but not for 3 hours. And when we enabled keepalive on our socket, it lasted for 6 hours, until we killed it.

For the final test, I changed the keepalive settings for the Linux host and rebooted. I checked the values, just to make sure my changes held: 5 minutes (300 seconds) for tcp_keepalive_time, 30 seconds for tcp_keepalive_intvl, and 3 for tcp_keepalive_probes; and I didn't set the socket's keepalive. So, I expected that the connection would be dropped after some 7ish minutes. That's definitely not what happened. It behaved as if I hadn't changed the keepalive settings and as if I had enabled the socket's keepalive.

The usual googling produced no help, so I've decided to keep it at that. I'll be logging the timestamp for each connection's creation and invalidation, so I'm sure some field data will give me a clearer picture of what to expect. Also, this app may be used in conditions where lost connections happen often. So, what I actually need is to put together a strategy for dealing with lost connections. While it's annoying that I couldn't get to the bottom of this, I feel a need to move on and create something. It's true these last few months have been a fantastic learning experience, but I've been stopped here for too long.

Finally, a note about cancelling requests. From what I've understood, I can use io_service::stop() or socket::cancel(), the latter being less problematic than the former. However, since I expect this app to be used in Windows XP, I don't really want to have to deal with this (from boost asio's docs): "It can appear to complete without error, but the request to cancel the unfinished operations may be silently ignored by the operating system. Whether it works or not seems to depend on the drivers that are installed".

This means I expect I'll "roll my own", when it comes to work status - running or cancelled. I'm also implementing state management; one of the problems I had when integrating libssh2 + asio is that outstanding requests remained in queue even after calling io_service::stop(). Since I then called io_service::reset() to prepare for another run, those outstanding requests would still get processed, but out of turn. So, that meant, e.g., attempting to close a channel (whose pointer I had already nullptr'd) at the same time I was opening a new channel; or attempting to execute a command a second time while reading the output of its first (and, in normal conditions, only) execution.

I've also finished work on integrating the SSH session life cycle correctly with asio. One thing I tested was which libssh2 functions needed both asio read and write; only libssh2_userauth_password() and libssh2_session_disconnect() required both an async_read_some() and an async_write_some().

No comments:

Post a Comment