Autotest: reboot DUTs when they are moved from shard to master. A special task REPAIR is triggered to force rebooting on the shard DUTs when a shard is deleted, in order to make sure the DUTs are still of ready status and own testing logs in master DB. However, the REPAIR job won't support reboot in a short time. This CL triggers a reboot test with highest priority on all DUTs that will be moved from shard to master. The procedure for deleting a shard is: 1. unlock all related DUTs of this shard. 2. delete any shard information in master DB. 3. trigger a reboot test with highest priority, to make sure that this test runs firstly after the DUTs are unlocked. 4. unlock these DUTs. BUG=chromium:499865 TEST=Configer a cbf master and a cbf shard. Set several tasks on master, one is running, others are pending. Ran 'atest shard delete ***' on the master to make sure: * The DUTs beloging to the shard are locked. * The shard is deleted * A reboot with highest priority is triggered. * The DUTs are unlocked. * The reboot test is ran first, all other pending tasks are queued still. * After reboot test is finished, other pending tasks will continue to run on the master. Ran site_rpc_interface_unittest locally. Change-Id: I2b348e520c0f67bec5b4b1c89c75ad41e86c72a2 Reviewed-on: https://chromium-review.googlesource.com/334434 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Fang Deng <fdeng@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org>

commit: 03cb93fd2444eebe5bd992f773f6d1d73763c7d7 [log] [tgz]
author: xixuan <xixuan@google.com> Tue Mar 22 16:21:41 2016 -0700
committer: chrome-bot <chrome-bot@chromium.org> Mon Apr 04 20:05:11 2016 -0700
tree: bd479a9a73bba2520bb70ab1fdf529cd4ddbbbfc
parent: 5fb9b05950a7d2d2784be5e81dee597abe861693 [diff] [blame]
diff --git a/frontend/afe/site_rpc_interface.py b/frontend/afe/site_rpc_interface.py
index 797ca06..72fbfae 100644
--- a/frontend/afe/site_rpc_interface.py
+++ b/frontend/afe/site_rpc_interface.py

@@ -859,10 +859,16 @@
     """Delete a shard and reclaim all resources from it.
 
     This claims back all assigned hosts from the shard. To ensure all DUTs are
-    in a sane state, a Repair task is scheduled for them. This reboots the DUTs
-    and therefore clears all running processes that might be left.
+    in a sane state, a Reboot task with highest priority is scheduled for them.
+    This reboots the DUTs and then all left tasks continue to run in drone of
+    the master.
 
-    The shard_id of jobs of that shard will be set to None.
+    The procedure for deleting a shard:
+        * Lock all unlocked hosts on that shard.
+        * Remove shard information .
+        * Assign a reboot task with highest priority to these hosts.
+        * Unlock these hosts, then, the reboot tasks run in front of all other
+        tasks.
 
     The status of jobs that haven't been reported to be finished yet, will be
     lost. The master scheduler will pick up the jobs and execute them.
@@ -870,31 +876,38 @@
     @param hostname: Hostname of the shard to delete.
     """
     shard = rpc_utils.retrieve_shard(shard_hostname=hostname)
+    hostnames_to_lock = [h.hostname for h in
+                         models.Host.objects.filter(shard=shard, locked=False)]
 
     # TODO(beeps): Power off shard
+    # For ChromeOS hosts, a reboot test with the highest priority is added to
+    # the DUT. After a reboot it should be ganranteed that no processes from
+    # prior tests that were run by a shard are still running on.
 
-    # For ChromeOS hosts, repair reboots the DUT.
-    # Repair will excalate through multiple repair steps and will verify the
-    # success after each of them. Anyway, it will always run at least the first
-    # one, which includes a reboot.
-    # After a reboot we can be sure no processes from prior tests that were run
-    # by a shard are still running on the DUT.
-    # Important: Don't just set the status to Repair Failed, as that would run
-    # Verify first, before doing any repair measures. Verify would probably
-    # succeed, so this wouldn't change anything on the DUT.
-    for host in models.Host.objects.filter(shard=shard):
-            models.SpecialTask.objects.create(
-                    task=models.SpecialTask.Task.REPAIR,
-                    host=host,
-                    requested_by=models.User.current_user())
+    # Lock all unlocked hosts.
+    dicts = {'locked': True, 'lock_time': datetime.datetime.now()}
+    models.Host.objects.filter(hostname__in=hostnames_to_lock).update(**dicts)
+
+    # Remove shard information.
     models.Host.objects.filter(shard=shard).update(shard=None)
-
     models.Job.objects.filter(shard=shard).update(shard=None)
-
     shard.labels.clear()
-
     shard.delete()
 
+    # Assign a reboot task with highest priority: Super.
+    t = models.Test.objects.get(name='platform_BootPerfServer:shard')
+    c = utils.read_file(os.path.join(common.autotest_dir, t.path))
+    if hostnames_to_lock:
+        rpc_utils.create_job_common(
+                'reboot_dut_for_shard_deletion',
+                priority=priorities.Priority.SUPER,
+                control_type='Server',
+                control_file=c, hosts=hostnames_to_lock)
+
+    # Unlock these shard-related hosts.
+    dicts = {'locked': False, 'lock_time': None}
+    models.Host.objects.filter(hostname__in=hostnames_to_lock).update(**dicts)
+
 
 def get_servers(hostname=None, role=None, status=None):
     """Get a list of servers with matching role and status.
commit	03cb93fd2444eebe5bd992f773f6d1d73763c7d7	[log] [tgz]
author	xixuan <xixuan@google.com>	Tue Mar 22 16:21:41 2016 -0700
committer	chrome-bot <chrome-bot@chromium.org>	Mon Apr 04 20:05:11 2016 -0700
tree	bd479a9a73bba2520bb70ab1fdf529cd4ddbbbfc
parent	5fb9b05950a7d2d2784be5e81dee597abe861693 [diff] [blame]