Autotest: reboot DUTs when they are moved from shard to master.
A special task REPAIR is triggered to force rebooting on the shard DUTs
when a shard is deleted, in order to make sure the DUTs are still of
ready status and own testing logs in master DB. However, the REPAIR job
won't support reboot in a short time.
This CL triggers a reboot test with highest priority on all DUTs that will
be moved from shard to master. The procedure for deleting a shard is:
1. unlock all related DUTs of this shard.
2. delete any shard information in master DB.
3. trigger a reboot test with highest priority, to make sure that this
test runs firstly after the DUTs are unlocked.
4. unlock these DUTs.
BUG=chromium:499865
TEST=Configer a cbf master and a cbf shard. Set several tasks on master, one
is running, others are pending. Ran 'atest shard delete ***' on the master
to make sure:
* The DUTs beloging to the shard are locked.
* The shard is deleted
* A reboot with highest priority is triggered.
* The DUTs are unlocked.
* The reboot test is ran first, all other pending tasks are queued still.
* After reboot test is finished, other pending tasks will continue
to run on the master.
Ran site_rpc_interface_unittest locally.
Change-Id: I2b348e520c0f67bec5b4b1c89c75ad41e86c72a2
Reviewed-on: https://chromium-review.googlesource.com/334434
Commit-Ready: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Fang Deng <fdeng@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
diff --git a/frontend/afe/site_rpc_interface.py b/frontend/afe/site_rpc_interface.py
index 797ca06..72fbfae 100644
--- a/frontend/afe/site_rpc_interface.py
+++ b/frontend/afe/site_rpc_interface.py
@@ -859,10 +859,16 @@
"""Delete a shard and reclaim all resources from it.
This claims back all assigned hosts from the shard. To ensure all DUTs are
- in a sane state, a Repair task is scheduled for them. This reboots the DUTs
- and therefore clears all running processes that might be left.
+ in a sane state, a Reboot task with highest priority is scheduled for them.
+ This reboots the DUTs and then all left tasks continue to run in drone of
+ the master.
- The shard_id of jobs of that shard will be set to None.
+ The procedure for deleting a shard:
+ * Lock all unlocked hosts on that shard.
+ * Remove shard information .
+ * Assign a reboot task with highest priority to these hosts.
+ * Unlock these hosts, then, the reboot tasks run in front of all other
+ tasks.
The status of jobs that haven't been reported to be finished yet, will be
lost. The master scheduler will pick up the jobs and execute them.
@@ -870,31 +876,38 @@
@param hostname: Hostname of the shard to delete.
"""
shard = rpc_utils.retrieve_shard(shard_hostname=hostname)
+ hostnames_to_lock = [h.hostname for h in
+ models.Host.objects.filter(shard=shard, locked=False)]
# TODO(beeps): Power off shard
+ # For ChromeOS hosts, a reboot test with the highest priority is added to
+ # the DUT. After a reboot it should be ganranteed that no processes from
+ # prior tests that were run by a shard are still running on.
- # For ChromeOS hosts, repair reboots the DUT.
- # Repair will excalate through multiple repair steps and will verify the
- # success after each of them. Anyway, it will always run at least the first
- # one, which includes a reboot.
- # After a reboot we can be sure no processes from prior tests that were run
- # by a shard are still running on the DUT.
- # Important: Don't just set the status to Repair Failed, as that would run
- # Verify first, before doing any repair measures. Verify would probably
- # succeed, so this wouldn't change anything on the DUT.
- for host in models.Host.objects.filter(shard=shard):
- models.SpecialTask.objects.create(
- task=models.SpecialTask.Task.REPAIR,
- host=host,
- requested_by=models.User.current_user())
+ # Lock all unlocked hosts.
+ dicts = {'locked': True, 'lock_time': datetime.datetime.now()}
+ models.Host.objects.filter(hostname__in=hostnames_to_lock).update(**dicts)
+
+ # Remove shard information.
models.Host.objects.filter(shard=shard).update(shard=None)
-
models.Job.objects.filter(shard=shard).update(shard=None)
-
shard.labels.clear()
-
shard.delete()
+ # Assign a reboot task with highest priority: Super.
+ t = models.Test.objects.get(name='platform_BootPerfServer:shard')
+ c = utils.read_file(os.path.join(common.autotest_dir, t.path))
+ if hostnames_to_lock:
+ rpc_utils.create_job_common(
+ 'reboot_dut_for_shard_deletion',
+ priority=priorities.Priority.SUPER,
+ control_type='Server',
+ control_file=c, hosts=hostnames_to_lock)
+
+ # Unlock these shard-related hosts.
+ dicts = {'locked': False, 'lock_time': None}
+ models.Host.objects.filter(hostname__in=hostnames_to_lock).update(**dicts)
+
def get_servers(hostname=None, role=None, status=None):
"""Get a list of servers with matching role and status.