Fix FAFT synchronization problems When running many FAFT tests the autotest server copies some code (client/cros/faft_client.py) to the DUT and starts it up. The file name is a bit misleading, this code happens to be a server providing access to functions FAFT requires to run on the DUT. To make things worse, the autotest server can not talk directly to the server running on the DUT, because Chrome OS locks all ports but 22. To work around this an ssh port forwarder process is also started on the DUT, such that the autotest server can connect to a certain port on the local host and gain access to the FAFT server on the DUT. The problem is that FAFT suffers from occasional connection problems causing random test failures. Some investigation has shown there are two distinct issues: - sometimes the connection gets rejected by the DUT, and then no matter how many times the server retries, the connection does not get established. - sometimes an unexpected exception happens when trying to connect This change addresses both of the problems. Failure to connect once the first attempt did not succeed was caused by the fact that the ssh tunnel created for port forwarding stayed up while the server on the DUT was restarting. The port forwarding tunnel needs to be reopened after the app restarts. This change makes sure that it is happening when trying to reconnect to the DUT. The unexpected exception turned out to be httplib.BadStatusLine - it is raised when the server reports some unrecognizable HTTP status. This seems to be happening when the server on the DUT is accessed soon after restart, and only once, the following attempt of the same access succeeds. This change adds the new exception to the list of exceptions causing an attempt to reconnect. Apparently there is a method implementing this functionality (graceful restarts of the server on the DUT) in server.hosts.site_host.py:SiteHost:xmlrpc_connect() servo_test.py might need to be refactor to use that implementation. It is still good to submit this fix to provide a baseline to verify the refactoring fix, when and if it happens. BUG=chrome-os-partner:15610 TEST=manual run the firmware_FAFTSetup test as follows 10 times in a row. Observe it succeed all 10 runs. It used to fail once in three or four runs on average. ./server/autoserv -m 192.168.1.4 --ssh-port 22 \ -s server/site_tests/firmware_FAFTSetup/control -r /tmp/autot \ --args 'servo_host=localhost servo_port=9999'; rm -rf /tmp/autot/* Change-Id: I875e2cc2a3fed15e4f7733cf9dbc3b84e255b06f Signed-off-by: Vadim Bendebury <vbendeb@chromium.org> Reviewed-on: https://gerrit.chromium.org/gerrit/39929 Reviewed-by: Tom Wai-Hong Tam <waihong@chromium.org>

commit: 692341e5e4e1206db81215170231980579f5bc41 [log] [tgz]
author: Vadim Bendebury <vbendeb@chromium.org> Tue Dec 18 15:24:38 2012 -0800
committer: Gerrit <chrome-bot@google.com> Wed Dec 19 15:12:07 2012 -0800
tree: e482173349eed118997433ae7d3ba2d8038ef457
parent: 5b8d925de2875a868e330927645330e40da28713 [diff] [blame]
diff --git a/server/cros/servo_test.py b/server/cros/servo_test.py
index 06f9631..9fcbff5 100644
--- a/server/cros/servo_test.py
+++ b/server/cros/servo_test.py

@@ -2,12 +2,7 @@
 # Use of this source code is governed by a BSD-style license that can be
 # found in the LICENSE file.
 
-import logging
-import os
-import socket
-import subprocess
-import time
-import xmlrpclib
+import httplib, logging, os, socket, subprocess, sys, time, xmlrpclib
 
 from autotest_lib.client.common_lib import error
 from autotest_lib.server import autotest, test
@@ -172,13 +167,15 @@
         """
         assert info['used'], \
             'Remote %s dependency not installed.' % info['ref_name']
-        if not info['ssh_tunnel'] or info['ssh_tunnel'].poll() is not None:
-            self._launch_ssh_tunnel(info)
-        assert info['ssh_tunnel'] and info['ssh_tunnel'].poll() is None, \
-            'The SSH tunnel is not up.'
+
+        if info['ssh_tunnel']:
+            info['ssh_tunnel'].terminate()
+            info['ssh_tunnel'] = None
 
         # Launch RPC server remotely.
         self._kill_remote_process(info)
+        self._launch_ssh_tunnel(info)
+
         logging.info('Client command: %s', info['remote_command'])
         log_file = info.get('remote_log_file', '/dev/null')
         logging.info("Logging to %s", log_file)
@@ -206,11 +203,16 @@
                 polling_rpc = getattr(remote_object, info['polling_rpc'])
                 polling_rpc()
                 succeed = True
-            except (socket.error, xmlrpclib.ProtocolError) as e:
+            except (socket.error,
+                    xmlrpclib.ProtocolError,
+                    httplib.BadStatusLine) as e:
                 logging.info('caught exception %s', e)
                 # The client RPC server may not come online fast enough. Retry.
                 timeout -= 1
                 rpc_error = e
+            except:
+                logging.error('Unexpected error: %s', sys.exc_info()[0])
+                raise
 
         if not succeed:
             if isinstance(rpc_error, xmlrpclib.ProtocolError):
@@ -300,6 +302,8 @@
                 'ssh -N -n -q %s -L %s:localhost:%s root@%s' %
                 (info['ssh_config'], info['port'], info['port'],
                 self._client.ip)], shell=True)
+            assert info['ssh_tunnel'].poll() is None, \
+                'The SSH tunnel on port %d is not up.' % info['port']
 
 
     def _kill_remote_process(self, info):
@@ -322,3 +326,4 @@
             if info['ssh_tunnel'] and info['ssh_tunnel'].poll() is None:
                 info['ssh_tunnel'].terminate()
             self._kill_remote_process(info)
+            info['ssh_tunnel'] = None
commit	692341e5e4e1206db81215170231980579f5bc41	[log] [tgz]
author	Vadim Bendebury <vbendeb@chromium.org>	Tue Dec 18 15:24:38 2012 -0800
committer	Gerrit <chrome-bot@google.com>	Wed Dec 19 15:12:07 2012 -0800
tree	e482173349eed118997433ae7d3ba2d8038ef457
parent	5b8d925de2875a868e330927645330e40da28713 [diff] [blame]