Abort child process groups with SIGKILL. Currently we unnecessarily sleep for 12 seconds every time we kill a process. Here's how it works: - Send SIGCONT to all processes. - If any are still running, wait 6 seconds. - Send SIGTERM to all processes. - If any are still running, wait 6 seconds. - Send SIGKILL to all processes. There are several problems with the above algorithm: - SIGCONT doesn't cause processes to exit, so waiting 6 seconds after that is pointless. - After sending SIGTERM, we check for whether any of the processes are present immediately. This doesn't give children enough time to actually clean up. - We sleep for 6 seconds unconditionally without considering the fact that the processes might exit early. Instead of doing this, I've just updating it to send SIGKILL to the entire process group and not have any sleep statements. BUG=chromium:432191 DEPLOY=scheduler TEST=Set up local scheduler, verified timeouts still work. Change-Id: Ie41d10d0605851df61dd07789d0d4afffd9eef01 Reviewed-on: https://chromium-review.googlesource.com/230323 Reviewed-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Prashanth B <beeps@chromium.org> Commit-Queue: David James <davidjames@chromium.org> Tested-by: David James <davidjames@chromium.org>

commit: f77198b6b71c701b719bf28fc122c56116bc3a98 [log] [tgz]
author: David James <davidjames@google.com> Mon Nov 17 17:23:12 2014 -0800
committer: chrome-internal-fetch <chrome-internal-fetch@google.com> Sat Dec 06 01:11:34 2014 +0000
tree: a231adf76f51cf63439bb4c4c66ade0bca417d00
parent: 5240742bbdeeb9773968d2922a6b9e06abad1604 [diff]
diff --git a/client/common_lib/site_utils.py b/client/common_lib/site_utils.py
index fe612b2..352ee43 100644
--- a/client/common_lib/site_utils.py
+++ b/client/common_lib/site_utils.py

@@ -290,13 +290,13 @@
                 # The process may have died from a previous signal before we
                 # could kill it.
                 pass
+        if sig == signal.SIGKILL:
+            return sig_count
         pid_list = [pid for pid in pid_list if base_utils.pid_is_alive(pid)]
         if not pid_list:
             break
         time.sleep(CHECK_PID_IS_ALIVE_TIMEOUT)
     failed_list = []
-    if signal.SIGKILL in signal_queue:
-        return sig_count
     for pid in pid_list:
         if base_utils.pid_is_alive(pid):
             failed_list.append('Could not kill %d for process name: %s.' % pid,

diff --git a/scheduler/drone_utility.py b/scheduler/drone_utility.py
index 489c358..dfb4905 100755
--- a/scheduler/drone_utility.py
+++ b/scheduler/drone_utility.py

@@ -226,12 +226,11 @@
         kill_proc_key = 'kill_processes'
         stats.Gauge(_STATS_KEY).send('%s.%s' % (kill_proc_key, 'net'),
                                      len(process_list))
-        signal_queue = (signal.SIGCONT, signal.SIGTERM, signal.SIGKILL)
         try:
             logging.info('List of process to be killed: %s', process_list)
             sig_counts = utils.nuke_pids(
-                            [process.pid for process in process_list],
-                            signal_queue=signal_queue)
+                            [-process.pid for process in process_list],
+                            signal_queue=(signal.SIGKILL,))
             for name, count in sig_counts.iteritems():
                 stats.Gauge(_STATS_KEY).send('%s.%s' % (kill_proc_key, name),
                                              count)
commit	f77198b6b71c701b719bf28fc122c56116bc3a98	[log] [tgz]
author	David James <davidjames@google.com>	Mon Nov 17 17:23:12 2014 -0800
committer	chrome-internal-fetch <chrome-internal-fetch@google.com>	Sat Dec 06 01:11:34 2014 +0000
tree	a231adf76f51cf63439bb4c4c66ade0bca417d00
parent	5240742bbdeeb9773968d2922a6b9e06abad1604 [diff]