Fix FAFT synchronization problems

When running many FAFT tests the autotest server copies some code
(client/cros/faft_client.py) to the DUT and starts it up. The file
name is a bit misleading, this code happens to be a server providing
access to functions FAFT requires to run on the DUT.

To make things worse, the autotest server can not talk directly to the
server running on the DUT, because Chrome OS locks all ports but
22. To work around this an ssh port forwarder process is also started
on the DUT, such that the autotest server can connect to a certain
port on the local host and gain access to the FAFT server on the DUT.

The problem is that FAFT suffers from occasional connection problems
causing random test failures. Some investigation has shown there are
two distinct issues:

- sometimes the connection gets rejected by the DUT, and then no
  matter how many times the server retries, the connection does not
  get established.

- sometimes an unexpected exception happens when trying to connect

This change addresses both of the problems.

Failure to connect once the first attempt did not succeed was caused
by the fact that the ssh tunnel created for port forwarding stayed up
while the server on the DUT was restarting. The port forwarding tunnel
needs to be reopened after the app restarts. This change makes sure
that it is happening when trying to reconnect to the DUT.

The unexpected exception turned out to be httplib.BadStatusLine - it
is raised when the server reports some unrecognizable HTTP
status. This seems to be happening when the server on the DUT is
accessed soon after restart, and only once, the following attempt of
the same access succeeds. This change adds the new exception to the
list of exceptions causing an attempt to reconnect.

Apparently there is a method implementing this functionality (graceful
restarts of the server on the DUT) in

server.hosts.site_host.py:SiteHost:xmlrpc_connect()

servo_test.py might need to be refactor to use that implementation. It
is still good to submit this fix to provide a baseline to verify the
refactoring fix, when and if it happens.

BUG=chrome-os-partner:15610
TEST=manual

  run the firmware_FAFTSetup test as follows 10 times in a
  row. Observe it succeed all 10 runs. It used to fail once in three
  or four runs on average.

    ./server/autoserv -m 192.168.1.4 --ssh-port 22 \
       -s server/site_tests/firmware_FAFTSetup/control -r /tmp/autot \
       --args 'servo_host=localhost servo_port=9999';
     rm -rf /tmp/autot/*

Change-Id: I875e2cc2a3fed15e4f7733cf9dbc3b84e255b06f
Signed-off-by: Vadim Bendebury <vbendeb@chromium.org>
Reviewed-on: https://gerrit.chromium.org/gerrit/39929
Reviewed-by: Tom Wai-Hong Tam <waihong@chromium.org>
diff --git a/server/cros/servo_test.py b/server/cros/servo_test.py
index 06f9631..9fcbff5 100644
--- a/server/cros/servo_test.py
+++ b/server/cros/servo_test.py
@@ -2,12 +2,7 @@
 # Use of this source code is governed by a BSD-style license that can be
 # found in the LICENSE file.
 
-import logging
-import os
-import socket
-import subprocess
-import time
-import xmlrpclib
+import httplib, logging, os, socket, subprocess, sys, time, xmlrpclib
 
 from autotest_lib.client.common_lib import error
 from autotest_lib.server import autotest, test
@@ -172,13 +167,15 @@
         """
         assert info['used'], \
             'Remote %s dependency not installed.' % info['ref_name']
-        if not info['ssh_tunnel'] or info['ssh_tunnel'].poll() is not None:
-            self._launch_ssh_tunnel(info)
-        assert info['ssh_tunnel'] and info['ssh_tunnel'].poll() is None, \
-            'The SSH tunnel is not up.'
+
+        if info['ssh_tunnel']:
+            info['ssh_tunnel'].terminate()
+            info['ssh_tunnel'] = None
 
         # Launch RPC server remotely.
         self._kill_remote_process(info)
+        self._launch_ssh_tunnel(info)
+
         logging.info('Client command: %s', info['remote_command'])
         log_file = info.get('remote_log_file', '/dev/null')
         logging.info("Logging to %s", log_file)
@@ -206,11 +203,16 @@
                 polling_rpc = getattr(remote_object, info['polling_rpc'])
                 polling_rpc()
                 succeed = True
-            except (socket.error, xmlrpclib.ProtocolError) as e:
+            except (socket.error,
+                    xmlrpclib.ProtocolError,
+                    httplib.BadStatusLine) as e:
                 logging.info('caught exception %s', e)
                 # The client RPC server may not come online fast enough. Retry.
                 timeout -= 1
                 rpc_error = e
+            except:
+                logging.error('Unexpected error: %s', sys.exc_info()[0])
+                raise
 
         if not succeed:
             if isinstance(rpc_error, xmlrpclib.ProtocolError):
@@ -300,6 +302,8 @@
                 'ssh -N -n -q %s -L %s:localhost:%s root@%s' %
                 (info['ssh_config'], info['port'], info['port'],
                 self._client.ip)], shell=True)
+            assert info['ssh_tunnel'].poll() is None, \
+                'The SSH tunnel on port %d is not up.' % info['port']
 
 
     def _kill_remote_process(self, info):
@@ -322,3 +326,4 @@
             if info['ssh_tunnel'] and info['ssh_tunnel'].poll() is None:
                 info['ssh_tunnel'].terminate()
             self._kill_remote_process(info)
+            info['ssh_tunnel'] = None