Saturday 29 September 2012

SSHWrapper/TestSSH redesign - An observation

Sometimes, we're so smart, we miss the obvious.

I'm redesigning my SSHWrapper/TestSSH (yes, it's in that stage of its career where it changes its artistic name every now and then), my C++ wrapper around libssh2 using boost asio.

I've finished work on SSHSession, which provides SSH session (uncanny, heh?) functionality. Another class will provide SSH Channel/command execution functionality.

I was working on a text to post here, detailing how SSHSession worked, when I had an idea for an addition to SSHSession. You see, it connects in the constructor and disconnects in the destructor, but something could go wrong during disconnect.

So, I added to its public interface a method that forcibly terminates a connection, controlled by a flag, so that it could only be invoked if the normal disconnection process failed. This forced terminate would be based on the premise that the disconnect had failed on the SSH level and would attempt a TCP (i.e., socket) disconnect, leaving the server to clean up any SSH mess on its side.

So, it would flow like this:
  1. SSHSession goes out of scope on the client code, thus invoking the destructor.
  2. Destructor performs SSH cleanup.
  3. SSH cleanup fails. We set a flag that will allow the client to invoke SSHSession's forced disconnection.

I coded it enthusiastically, thinking how cool it was that I was making my class more robust, and user-friendly. And I was updating the text at the same time. And then, just before posting it here, I decided to run a test on this brilliant new idea of mine. Not that it needed, of course, being such a brilliant idea.

Once again, I wrote the test code and updated the text. And then, the obvious hit me!

Despite my best intentions, the destructor finishes successfully and the object is, well... destroyed. The client code isn't even aware that something failed. And even it it was, there is no object upon which to invoke my Wonder-Terminate-Forced anymore. Because, you know, that's one of the expected consequences of a destructor finishing its execution successfully.

And the point is... writing the post helped me visualize what I was doing, and helped me getting to the error of my design sooner. I'll have to do this more often - write a post after a coding session, even if I end up not publishing it. It forces me to think about what I've done, and to review it.

And that's always a good thing.

Thursday 27 September 2012

libssh2/asio - Redesign and test wrap-up

I'm in the midst of a design change in my libssh2 + asio solution.

I'll be having an SSHSession class, which will, in turn, create SSHCommand() (or SSHExec, I haven't decided yet) instances to run the commands on the remote server.

And I've also finished my idle connection tests. How did it go?

I've started by defining ClientAliveInterval and ClientAliveCountMax in sshd_config. Sure enough, the idle connection was killed a lot sooner - 80 seconds were enough to get a LIBSSH2_ERROR_SOCKET_SEND error. And on the sshd side, I got a "Timeout, client not responding" message.

In these cases, the socket reported as open, i.e., socket::is_open() returned true. Which should be correct, otherwise libssh2 probably would've returned LIBSSH2_ERROR_SOCKET_DISCONNECT. And this confirmed that this timeout is for the SSH session only, i.e., the TCP connection is still alive.

Next test. I undid this change, and activated the socket's keepalive. And we've broken the previous "record", of approx. 1.5 hours of idleness. An idle connection has "survived" for 6 hours. At that time, I killed the app, I don't think it's really necessary to test the 12-hour limit. Checking the configuration on my Linux host, the default value for a TCP connection was 2 hours. After that, keepalive kicks in, at default intervals of 75 seconds, with 9 attempts. This matches our previous testing, i.e., our idle connection lasted for 1.5 hours, but not for 3 hours. And when we enabled keepalive on our socket, it lasted for 6 hours, until we killed it.

For the final test, I changed the keepalive settings for the Linux host and rebooted. I checked the values, just to make sure my changes held: 5 minutes (300 seconds) for tcp_keepalive_time, 30 seconds for tcp_keepalive_intvl, and 3 for tcp_keepalive_probes; and I didn't set the socket's keepalive. So, I expected that the connection would be dropped after some 7ish minutes. That's definitely not what happened. It behaved as if I hadn't changed the keepalive settings and as if I had enabled the socket's keepalive.

The usual googling produced no help, so I've decided to keep it at that. I'll be logging the timestamp for each connection's creation and invalidation, so I'm sure some field data will give me a clearer picture of what to expect. Also, this app may be used in conditions where lost connections happen often. So, what I actually need is to put together a strategy for dealing with lost connections. While it's annoying that I couldn't get to the bottom of this, I feel a need to move on and create something. It's true these last few months have been a fantastic learning experience, but I've been stopped here for too long.

Finally, a note about cancelling requests. From what I've understood, I can use io_service::stop() or socket::cancel(), the latter being less problematic than the former. However, since I expect this app to be used in Windows XP, I don't really want to have to deal with this (from boost asio's docs): "It can appear to complete without error, but the request to cancel the unfinished operations may be silently ignored by the operating system. Whether it works or not seems to depend on the drivers that are installed".

This means I expect I'll "roll my own", when it comes to work status - running or cancelled. I'm also implementing state management; one of the problems I had when integrating libssh2 + asio is that outstanding requests remained in queue even after calling io_service::stop(). Since I then called io_service::reset() to prepare for another run, those outstanding requests would still get processed, but out of turn. So, that meant, e.g., attempting to close a channel (whose pointer I had already nullptr'd) at the same time I was opening a new channel; or attempting to execute a command a second time while reading the output of its first (and, in normal conditions, only) execution.

I've also finished work on integrating the SSH session life cycle correctly with asio. One thing I tested was which libssh2 functions needed both asio read and write; only libssh2_userauth_password() and libssh2_session_disconnect() required both an async_read_some() and an async_write_some().

Saturday 22 September 2012

Mingw And The Mystery of the Missing Console

So, you create a new GUI project in Qt Creator and then decide, during development, to get some console output; say, you're testing some code you're not exactly sure how it will work, and you haven't even built a GUI yet, so it's easier to get some couts going.

Well, you may or may not comment out gui references in your QT variable, in you .pro file. But you better don't forget to add CONFIG += console. What happens if you don't? Well, let's assume you have something like this:

#include <iostream>
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
    cout << "start" << endl;
}

If you run it in the IDE, all is well. If you open a DOS window and run it from there, you get nothing. No error, no output, nothing. You may wonder if anything is going wrong before execution hits main(), so you fire gdb, set a breakpoint in main() and run it. It stops on your breakpoint, steps perfectly through you cout, but still nothing happens.

And, throughout all of this, not a single warning. It sets my eyes on fire with warnings if I have a variable that I'm not using (yet), but it doesn't seem to have the insight to tell me "You're using cout, but you didn't specify -subsystem,console; if you're expecting to see any output, boy, are you in for some Interesting Times".

Anway...

Testing on libssh2 + asio is still going on.

Also, I've just installed VS 2012 Express. I've already built libssh2 with it (and OpenSSL + zlib). It was easier building it with mingw, so there goes my theory that everything is easier to build on Windows if you have VC++.

Sunday 9 September 2012

libssh2/boost::asio progresses

One of my goals is to have a pool of open SSH sessions. Since the authentication is measured in seconds, I've thought it would be a good idea to avoid repeating it as much as possible; so, I open the session once and reuse it. I'm still working on the design, but these sessions will probably be stored in a multimap, the key being by host + port (usually, 22) + user. I'll be using a multimap because you can have more than one operation running on the same host/port/user.

Of course, having a pool means we'll have SSH sessions sitting idle in the pool, until we use them. And idle connections are always a source of problems. Speaking of which...

I've been testing last post's libssh2/asio example. While it runs fine if we run a command and exit, in a scenario where we want to reuse the same SSH session to run several commands (i.e., open/close several channels), it doesn't work. So far, I've been concentrating on the channel life cycle, as this is where most of the action (and, naturally, problems) happens.

So, in all operations concerning channels, I've eliminated the async_read_some(), keeping only the async_write_some(). I've run two tests on this:
  • One, where a command is executed every 10 seconds. This ran flawlessly for about 20 minutes, until I CTRL+C'd it.
  • The other was similar, but the interval doubled between executions; so, the first interval was 10 seconds; the second was 20 seconds; and so on. This one crashed with an exception after being idle for nearly 3 hours. So, it remained idle and valid for 1.5 hours. The error was LIBSSH2_ERROR_SOCKET_SEND, which seems to indicate the socket was closed by inactivity.

Testing will continue, naturally. I've added code to check the socket's status when an exception occurs (basically, check socket::is_open(), to see if the asio socket is aware that the conection was terminated), and I'll enable the socket's keepalive option, to see if it makes a difference.

I'm using an Ubuntu guest on VirtualBox as remote server. Checking its sshd_config, it has TCPKeepAlive on, and it's not using ClientAliveInterval. I'll have to test these settings, too.

Still a lot of work ahead, but so far, so good.

Further along on the horizon:
- Multithreading (I'm not going to mess with this until I get single-threading working correctly).
- Recovering from a dead connection. It's bound to happen, so I better be ready to deal with it.