beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 1 | #pylint: disable-msg=C0111 |
| 2 | |
| 3 | """ |
| 4 | Prejob tasks. |
| 5 | |
| 6 | Prejob tasks _usually_ run before a job and verify the state of a machine. |
| 7 | Cleanup and repair are exceptions, cleanup can run after a job too, while |
| 8 | repair will run anytime the host needs a repair, which could be pre or post |
| 9 | job. Most of the work specific to this module is achieved through the prolog |
| 10 | and epilog of each task. |
| 11 | |
| 12 | All prejob tasks must have a host, though they may not have an HQE. If a |
| 13 | prejob task has a hqe, it will activate the hqe through its on_pending |
beeps | ec1c4b2 | 2013-11-18 08:26:39 -0800 | [diff] [blame] | 14 | method on successful completion. A row in afe_special_tasks with values: |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 15 | host=C1, unlocked, is_active=0, is_complete=0, type=Verify |
| 16 | will indicate to the scheduler that it needs to schedule a new special task |
| 17 | of type=Verify, against the C1 host. While the special task is running |
| 18 | the scheduler only monitors it through the Agent, and its is_active bit=1. |
| 19 | Once a special task finishes, we set its is_active=0, is_complete=1 and |
| 20 | success bits, so the scheduler ignores it. |
| 21 | HQE.on_pending: |
| 22 | Host, HQE -> Pending, Starting |
| 23 | This status is acted upon in the scheduler, to assign an AgentTask. |
| 24 | PreJobTask: |
| 25 | epilog: |
| 26 | failure: |
| 27 | requeue hqe |
| 28 | repair the host |
| 29 | Children PreJobTasks: |
| 30 | prolog: |
| 31 | set Host, HQE status |
| 32 | epilog: |
| 33 | success: |
| 34 | on_pending |
| 35 | failure: |
| 36 | repair throgh PreJobTask |
| 37 | set Host, HQE status |
beeps | ec1c4b2 | 2013-11-18 08:26:39 -0800 | [diff] [blame] | 38 | |
| 39 | Failing a prejob task effects both the Host and the HQE, as follows: |
| 40 | |
| 41 | - Host: PreJob failure will result in a Repair job getting queued against |
| 42 | the host, is we haven't already tried repairing it more than the |
| 43 | max_repair_limit. When this happens, the host will remain in whatever status |
| 44 | the prejob task left it in, till the Repair job puts it into 'Repairing'. This |
| 45 | way the host_scheduler won't pick bad hosts and assign them to jobs. |
| 46 | |
| 47 | If we have already tried repairing the host too many times, the PreJobTask |
| 48 | will flip the host to 'RepairFailed' in its epilog, and it will remain in this |
| 49 | state till it is recovered and reverified. |
| 50 | |
| 51 | - HQE: Is either requeued or failed. Requeuing the HQE involves putting it |
| 52 | in the Queued state and setting its host_id to None, so it gets a new host |
| 53 | in the next scheduler tick. Failing the HQE results in either a Parsing |
| 54 | or Archiving postjob task, and an eventual Failed status for the HQE. |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 55 | """ |
beeps | ec1c4b2 | 2013-11-18 08:26:39 -0800 | [diff] [blame] | 56 | |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 57 | import logging |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 58 | |
| 59 | from autotest_lib.client.common_lib import host_protections |
| 60 | from autotest_lib.frontend.afe import models |
| 61 | from autotest_lib.scheduler import agent_task, scheduler_config |
| 62 | from autotest_lib.server import autoserv_utils |
| 63 | from autotest_lib.server.cros import provision |
| 64 | |
| 65 | |
| 66 | class PreJobTask(agent_task.SpecialAgentTask): |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 67 | def epilog(self): |
| 68 | super(PreJobTask, self).epilog() |
| 69 | |
| 70 | if self.success: |
| 71 | return |
| 72 | |
| 73 | if self.host.protection == host_protections.Protection.DO_NOT_VERIFY: |
| 74 | # effectively ignore failure for these hosts |
| 75 | self.success = True |
| 76 | return |
| 77 | |
| 78 | if self.queue_entry: |
| 79 | # If we requeue a HQE, we should cancel any remaining pre-job |
| 80 | # tasks against this host, otherwise we'll be left in a state |
| 81 | # where a queued HQE has special tasks to run against a host. |
| 82 | models.SpecialTask.objects.filter( |
| 83 | queue_entry__id=self.queue_entry.id, |
| 84 | host__id=self.host.id, |
| 85 | is_complete=0).update(is_complete=1, success=0) |
| 86 | |
| 87 | previous_provisions = models.SpecialTask.objects.filter( |
| 88 | task=models.SpecialTask.Task.PROVISION, |
| 89 | queue_entry_id=self.queue_entry.id).count() |
| 90 | if (previous_provisions > |
| 91 | scheduler_config.config.max_provision_retries): |
| 92 | self._actually_fail_queue_entry() |
| 93 | # This abort will mark the aborted bit on the HQE itself, to |
| 94 | # signify that we're killing it. Technically it also will do |
| 95 | # the recursive aborting of all child jobs, but that shouldn't |
| 96 | # matter here, as only suites have children, and those are |
| 97 | # hostless and thus don't have provisioning. |
| 98 | # TODO(milleral) http://crbug.com/188217 |
| 99 | # However, we can't actually do this yet, as if we set the |
| 100 | # abort bit the FinalReparseTask will set the status of the HQE |
| 101 | # to ABORTED, which then means that we don't show the status in |
| 102 | # run_suite. So in the meantime, don't mark the HQE as |
| 103 | # aborted. |
| 104 | # queue_entry.abort() |
| 105 | else: |
| 106 | # requeue() must come after handling provision retries, since |
| 107 | # _actually_fail_queue_entry needs an execution subdir. |
| 108 | # We also don't want to requeue if we hit the provision retry |
| 109 | # limit, since then we overwrite the PARSING state of the HQE. |
| 110 | self.queue_entry.requeue() |
| 111 | |
Dan Shi | a1f0d02 | 2014-10-24 12:13:04 -0700 | [diff] [blame] | 112 | # Limit the repair on a host when a prejob task fails, e.g., reset, |
| 113 | # verify etc. The number of repair jobs is limited to the specific |
| 114 | # HQE and host. |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 115 | previous_repairs = models.SpecialTask.objects.filter( |
| 116 | task=models.SpecialTask.Task.REPAIR, |
Dan Shi | a1f0d02 | 2014-10-24 12:13:04 -0700 | [diff] [blame] | 117 | queue_entry_id=self.queue_entry.id, |
| 118 | host_id=self.queue_entry.host_id).count() |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 119 | if previous_repairs >= scheduler_config.config.max_repair_limit: |
| 120 | self.host.set_status(models.Host.Status.REPAIR_FAILED) |
| 121 | self._fail_queue_entry() |
| 122 | return |
| 123 | |
| 124 | queue_entry = models.HostQueueEntry.objects.get( |
| 125 | id=self.queue_entry.id) |
| 126 | else: |
| 127 | queue_entry = None |
| 128 | |
| 129 | models.SpecialTask.objects.create( |
| 130 | host=models.Host.objects.get(id=self.host.id), |
| 131 | task=models.SpecialTask.Task.REPAIR, |
| 132 | queue_entry=queue_entry, |
| 133 | requested_by=self.task.requested_by) |
| 134 | |
| 135 | |
| 136 | def _should_pending(self): |
| 137 | """ |
| 138 | Decide if we should call the host queue entry's on_pending method. |
| 139 | We should if: |
| 140 | 1) There exists an associated host queue entry. |
| 141 | 2) The current special task completed successfully. |
| 142 | 3) There do not exist any more special tasks to be run before the |
| 143 | host queue entry starts. |
| 144 | |
| 145 | @returns: True if we should call pending, false if not. |
| 146 | |
| 147 | """ |
| 148 | if not self.queue_entry or not self.success: |
| 149 | return False |
| 150 | |
| 151 | # We know if this is the last one when we create it, so we could add |
| 152 | # another column to the database to keep track of this information, but |
| 153 | # I expect the overhead of querying here to be minimal. |
| 154 | queue_entry = models.HostQueueEntry.objects.get(id=self.queue_entry.id) |
| 155 | queued = models.SpecialTask.objects.filter( |
| 156 | host__id=self.host.id, is_active=False, |
| 157 | is_complete=False, queue_entry=queue_entry) |
| 158 | queued = queued.exclude(id=self.task.id) |
| 159 | return queued.count() == 0 |
| 160 | |
| 161 | |
| 162 | class VerifyTask(PreJobTask): |
| 163 | TASK_TYPE = models.SpecialTask.Task.VERIFY |
| 164 | |
| 165 | |
| 166 | def __init__(self, task): |
Alex Miller | ec21225 | 2014-02-28 16:48:34 -0800 | [diff] [blame] | 167 | args = ['-v'] |
| 168 | if task.queue_entry: |
| 169 | args.extend(self._generate_autoserv_label_args(task)) |
| 170 | super(VerifyTask, self).__init__(task, args) |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 171 | self._set_ids(host=self.host, queue_entries=[self.queue_entry]) |
| 172 | |
| 173 | |
| 174 | def prolog(self): |
| 175 | super(VerifyTask, self).prolog() |
| 176 | |
| 177 | logging.info("starting verify on %s", self.host.hostname) |
| 178 | if self.queue_entry: |
| 179 | self.queue_entry.set_status(models.HostQueueEntry.Status.VERIFYING) |
| 180 | self.host.set_status(models.Host.Status.VERIFYING) |
| 181 | |
| 182 | # Delete any queued manual reverifies for this host. One verify will do |
| 183 | # and there's no need to keep records of other requests. |
| 184 | self.remove_special_tasks(models.SpecialTask.Task.VERIFY, |
| 185 | keep_last_one=True) |
| 186 | |
| 187 | |
| 188 | def epilog(self): |
| 189 | super(VerifyTask, self).epilog() |
| 190 | if self.success: |
| 191 | if self._should_pending(): |
| 192 | self.queue_entry.on_pending() |
| 193 | else: |
| 194 | self.host.set_status(models.Host.Status.READY) |
| 195 | |
| 196 | |
| 197 | class CleanupTask(PreJobTask): |
| 198 | # note this can also run post-job, but when it does, it's running standalone |
| 199 | # against the host (not related to the job), so it's not considered a |
| 200 | # PostJobTask |
| 201 | |
| 202 | TASK_TYPE = models.SpecialTask.Task.CLEANUP |
| 203 | |
| 204 | |
| 205 | def __init__(self, task, recover_run_monitor=None): |
Alex Miller | ec21225 | 2014-02-28 16:48:34 -0800 | [diff] [blame] | 206 | args = ['--cleanup'] |
| 207 | if task.queue_entry: |
| 208 | args.extend(self._generate_autoserv_label_args(task)) |
| 209 | super(CleanupTask, self).__init__(task, args) |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 210 | self._set_ids(host=self.host, queue_entries=[self.queue_entry]) |
| 211 | |
| 212 | |
| 213 | def prolog(self): |
| 214 | super(CleanupTask, self).prolog() |
| 215 | logging.info("starting cleanup task for host: %s", self.host.hostname) |
| 216 | self.host.set_status(models.Host.Status.CLEANING) |
| 217 | if self.queue_entry: |
| 218 | self.queue_entry.set_status(models.HostQueueEntry.Status.CLEANING) |
| 219 | |
| 220 | |
| 221 | def _finish_epilog(self): |
| 222 | if not self.queue_entry or not self.success: |
| 223 | return |
| 224 | |
| 225 | do_not_verify_protection = host_protections.Protection.DO_NOT_VERIFY |
| 226 | should_run_verify = ( |
| 227 | self.queue_entry.job.run_verify |
| 228 | and self.host.protection != do_not_verify_protection) |
| 229 | if should_run_verify: |
| 230 | entry = models.HostQueueEntry.objects.get(id=self.queue_entry.id) |
| 231 | models.SpecialTask.objects.create( |
| 232 | host=models.Host.objects.get(id=self.host.id), |
| 233 | queue_entry=entry, |
| 234 | task=models.SpecialTask.Task.VERIFY) |
| 235 | else: |
| 236 | if self._should_pending(): |
| 237 | self.queue_entry.on_pending() |
| 238 | |
| 239 | |
| 240 | def epilog(self): |
| 241 | super(CleanupTask, self).epilog() |
| 242 | |
| 243 | if self.success: |
| 244 | self.host.update_field('dirty', 0) |
| 245 | self.host.set_status(models.Host.Status.READY) |
| 246 | |
| 247 | self._finish_epilog() |
| 248 | |
| 249 | |
| 250 | class ResetTask(PreJobTask): |
| 251 | """Task to reset a DUT, including cleanup and verify.""" |
| 252 | # note this can also run post-job, but when it does, it's running standalone |
| 253 | # against the host (not related to the job), so it's not considered a |
| 254 | # PostJobTask |
| 255 | |
| 256 | TASK_TYPE = models.SpecialTask.Task.RESET |
| 257 | |
| 258 | |
| 259 | def __init__(self, task, recover_run_monitor=None): |
Alex Miller | ec21225 | 2014-02-28 16:48:34 -0800 | [diff] [blame] | 260 | args = ['--reset'] |
| 261 | if task.queue_entry: |
| 262 | args.extend(self._generate_autoserv_label_args(task)) |
| 263 | super(ResetTask, self).__init__(task, args) |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 264 | self._set_ids(host=self.host, queue_entries=[self.queue_entry]) |
| 265 | |
| 266 | |
| 267 | def prolog(self): |
| 268 | super(ResetTask, self).prolog() |
| 269 | logging.info('starting reset task for host: %s', |
| 270 | self.host.hostname) |
| 271 | self.host.set_status(models.Host.Status.RESETTING) |
| 272 | if self.queue_entry: |
| 273 | self.queue_entry.set_status(models.HostQueueEntry.Status.RESETTING) |
| 274 | |
| 275 | # Delete any queued cleanups for this host. |
| 276 | self.remove_special_tasks(models.SpecialTask.Task.CLEANUP, |
| 277 | keep_last_one=False) |
| 278 | |
| 279 | # Delete any queued reverifies for this host. |
| 280 | self.remove_special_tasks(models.SpecialTask.Task.VERIFY, |
| 281 | keep_last_one=False) |
| 282 | |
| 283 | # Only one reset is needed. |
| 284 | self.remove_special_tasks(models.SpecialTask.Task.RESET, |
| 285 | keep_last_one=True) |
| 286 | |
| 287 | |
| 288 | def epilog(self): |
| 289 | super(ResetTask, self).epilog() |
| 290 | |
| 291 | if self.success: |
| 292 | self.host.update_field('dirty', 0) |
| 293 | |
| 294 | if self._should_pending(): |
| 295 | self.queue_entry.on_pending() |
| 296 | else: |
| 297 | self.host.set_status(models.Host.Status.READY) |
| 298 | |
| 299 | |
| 300 | class ProvisionTask(PreJobTask): |
| 301 | TASK_TYPE = models.SpecialTask.Task.PROVISION |
| 302 | |
| 303 | def __init__(self, task): |
| 304 | # Provisioning requires that we be associated with a job/queue entry |
| 305 | assert task.queue_entry, "No HQE associated with provision task!" |
| 306 | # task.queue_entry is an afe model HostQueueEntry object. |
| 307 | # self.queue_entry is a scheduler models HostQueueEntry object, but |
| 308 | # it gets constructed and assigned in __init__, so it's not available |
| 309 | # yet. Therefore, we're stuck pulling labels off of the afe model |
| 310 | # so that we can pass the --provision args into the __init__ call. |
Alex Miller | ec21225 | 2014-02-28 16:48:34 -0800 | [diff] [blame] | 311 | labels = {x.name for x in task.queue_entry.job.labels} |
Dan Shi | 7279a5a | 2016-04-07 11:04:28 -0700 | [diff] [blame] | 312 | _, provisionable = provision.Provision.partition(labels) |
Alex Miller | df15ec5 | 2014-02-28 18:18:48 -0800 | [diff] [blame] | 313 | extra_command_args = ['--provision', |
Alex Miller | ec21225 | 2014-02-28 16:48:34 -0800 | [diff] [blame] | 314 | '--job-labels', ','.join(provisionable)] |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 315 | super(ProvisionTask, self).__init__(task, extra_command_args) |
| 316 | self._set_ids(host=self.host, queue_entries=[self.queue_entry]) |
| 317 | |
| 318 | |
| 319 | def _command_line(self): |
| 320 | # If we give queue_entry to _autoserv_command_line, then it will append |
| 321 | # -c for this invocation if the queue_entry is a client side test. We |
| 322 | # don't want that, as it messes with provisioning, so we just drop it |
| 323 | # from the arguments here. |
| 324 | # Note that we also don't verify job_repo_url as provisioining tasks are |
| 325 | # required to stage whatever content we need, and the job itself will |
| 326 | # force autotest to be staged if it isn't already. |
| 327 | return autoserv_utils._autoserv_command_line(self.host.hostname, |
Simran Basi | 8e6affb | 2015-12-16 11:54:11 -0800 | [diff] [blame] | 328 | self._extra_command_args, |
| 329 | in_lab=True) |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 330 | |
| 331 | |
| 332 | def prolog(self): |
| 333 | super(ProvisionTask, self).prolog() |
| 334 | # add check for previous provision task and abort if exist. |
| 335 | logging.info("starting provision task for host: %s", self.host.hostname) |
| 336 | self.queue_entry.set_status( |
| 337 | models.HostQueueEntry.Status.PROVISIONING) |
| 338 | self.host.set_status(models.Host.Status.PROVISIONING) |
| 339 | |
| 340 | |
| 341 | def epilog(self): |
| 342 | super(ProvisionTask, self).epilog() |
| 343 | |
beeps | ec1c4b2 | 2013-11-18 08:26:39 -0800 | [diff] [blame] | 344 | # If we were not successful in provisioning the machine |
| 345 | # leave the DUT in whatever status was set in the PreJobTask's |
| 346 | # epilog. If this task was successful the host status will get |
| 347 | # set appropriately as a fallout of the hqe's on_pending. If |
| 348 | # we don't call on_pending, it can only be because: |
| 349 | # 1. This task was not successful: |
| 350 | # a. Another repair is queued: this repair job will set the host |
| 351 | # status, and it will remain in 'Provisioning' till then. |
| 352 | # b. We have hit the max_repair_limit: in which case the host |
| 353 | # status is set to 'RepairFailed' in the epilog of PreJobTask. |
| 354 | # 2. The task was successful, but there are other special tasks: |
| 355 | # Those special tasks will set the host status appropriately. |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 356 | if self._should_pending(): |
| 357 | self.queue_entry.on_pending() |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 358 | |
| 359 | |
| 360 | class RepairTask(agent_task.SpecialAgentTask): |
| 361 | TASK_TYPE = models.SpecialTask.Task.REPAIR |
| 362 | |
| 363 | |
| 364 | def __init__(self, task): |
| 365 | """\ |
| 366 | queue_entry: queue entry to mark failed if this repair fails. |
| 367 | """ |
| 368 | protection = host_protections.Protection.get_string( |
| 369 | task.host.protection) |
| 370 | # normalize the protection name |
| 371 | protection = host_protections.Protection.get_attr_name(protection) |
| 372 | |
Alex Miller | ec21225 | 2014-02-28 16:48:34 -0800 | [diff] [blame] | 373 | args = ['-R', '--host-protection', protection] |
| 374 | if task.queue_entry: |
| 375 | args.extend(self._generate_autoserv_label_args(task)) |
| 376 | |
| 377 | super(RepairTask, self).__init__(task, args) |
beeps | 5e2bb4a | 2013-10-28 11:26:45 -0700 | [diff] [blame] | 378 | |
| 379 | # *don't* include the queue entry in IDs -- if the queue entry is |
| 380 | # aborted, we want to leave the repair task running |
| 381 | self._set_ids(host=self.host) |
| 382 | |
| 383 | |
| 384 | def prolog(self): |
| 385 | super(RepairTask, self).prolog() |
| 386 | logging.info("repair_task starting") |
| 387 | self.host.set_status(models.Host.Status.REPAIRING) |
| 388 | |
| 389 | |
| 390 | def epilog(self): |
| 391 | super(RepairTask, self).epilog() |
| 392 | |
| 393 | if self.success: |
| 394 | self.host.set_status(models.Host.Status.READY) |
| 395 | else: |
| 396 | self.host.set_status(models.Host.Status.REPAIR_FAILED) |
| 397 | if self.queue_entry: |
| 398 | self._fail_queue_entry() |