When a drone fails to initialize, let the scheduler die. We used to try to carry on gracefully, but it turns out this really isn't safe. If the drone failure is due to a network condition, and the drone is actually still up, then Autoserv processes will continue to run on that drone, but the scheduler will be unable to detect or stop these processes. This can lead to dual Autoserv processes running against the same machine.
This is unfortunate since it means the whole scheduler will block on any drone that goes down. But until we can implement a more complex solution, this is what we need to do to be conservative.
Signed-off-by: Steve Howard <showard@google.com>
git-svn-id: http://test.kernel.org/svn/autotest/trunk@3381 592f7852-d20e-0410-864c-8624ca9c26a4
diff --git a/scheduler/drone_manager.py b/scheduler/drone_manager.py
index 5c915c4..dbaa75d 100644
--- a/scheduler/drone_manager.py
+++ b/scheduler/drone_manager.py
@@ -114,15 +114,8 @@
base_results_dir, drone_utility._TEMPORARY_DIRECTORY))
for hostname in drone_hostnames:
- try:
- drone = self._add_drone(hostname)
- drone.call('initialize', base_results_dir)
- except error.AutoservError:
- warning = 'Drone %s failed to initialize:\n%s' % (
- hostname, traceback.format_exc())
- email_manager.manager.enqueue_notify_email(
- 'Drone failed to initialize', warning)
- self._remove_drone(hostname)
+ drone = self._add_drone(hostname)
+ drone.call('initialize', base_results_dir)
if not self._drones:
# all drones failed to initialize
@@ -130,7 +123,7 @@
self.refresh_drone_configs()
- logging.info('Using results repository on %s',
+ logging.info('Using results repository on %s',
results_repository_hostname)
self._results_drone = drones.get_drone(results_repository_hostname)
# don't initialize() the results drone - we don't want to clear out any