[autotest] Queue calls in drone after drone refresh.
Drone refresh is done in a non-thread safe fashion. It starts the refresh at
the beginning of the tick, then follow by couple other operations, then wait
for the refresh to finish. When it starts, it executes all queued calls in
drone using drone_utils. After drone_utils finishes processing the calls,
the scheduler will empty the queued calls in drones.
That means any calls added between the drone refresh is started and the
completion of drone refresh will be removed without being called.
This CL moves the cleanup call after the drone refresh, also add a comment
about potential future issues. A better fix might fix the root cause. For
example, add a tracker in each drone's call queue. After drone refresh is done,
only clear the calls being processed within refresh. crbug.com/484715 is filed
to track this issue.
BUG=chromium:484039
TEST=local scheduler run, make sure lxc_cleanup is kicked off and finished.
Change-Id: I1bb3229a3da578299949a00af25b3d4674eeed4b
Reviewed-on: https://chromium-review.googlesource.com/269255
Trybot-Ready: Dan Shi <dshi@chromium.org>
Tested-by: Dan Shi <dshi@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Simran Basi <sbasi@chromium.org>
Commit-Queue: Dan Shi <dshi@chromium.org>
diff --git a/scheduler/drone_manager.py b/scheduler/drone_manager.py
index e74cb72..d4e95d5 100644
--- a/scheduler/drone_manager.py
+++ b/scheduler/drone_manager.py
@@ -307,19 +307,10 @@
def cleanup_orphaned_containers(self):
"""Queue cleanup_orphaned_containers call at each drone.
"""
- drones = list(self.get_drones())
- for drone in drones:
- logging.info('Queue cleanup_orphaned_containers at %s', drone)
+ for drone in self._drones.values():
+ logging.info('Queue cleanup_orphaned_containers at %s',
+ drone.hostname)
drone.queue_call('cleanup_orphaned_containers')
- with self._timer.get_client('cleanup_orphaned_containers'):
- # Each task will start a new process of lxc_cleanup in drone and
- # exit, the wait time is about 2-3 seconds at most. If this call
- # does not wait, the drone refresh may have a race condition when
- # it tries to process all queued calls in a different thread. The
- # race condition will lead to scheduler crash. Therefore, the tasks
- # queued here will be waited for finishing. Considering it will
- # only be called once a day, the overhead should be minimum.
- self._refresh_task_queue.execute(drones, wait=True)
def _get_drone_for_process(self, process):