Blame - scheduler/agent_task.py - platform/external/autotest

blob: 99cc462214dca5d190b0807a3ac5b327523dea3a [file] [log] [blame]

Aviv Keshet	0c93b86	2017-07-13 11:58:03 -0700	[diff] [blame]	1	# pylint: disable=missing-docstring
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	2
				3	""" This is the module for everything related to the AgentTask.
				4
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	5	The AgentTask imposes an interface through which the scheduler can monitor
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	6	a processes; Examples of such processes include Verify, Cleanup and the Queue
				7	Tasks that run the tests. The scheduler itself only understands Agents.
				8	Agents:
				9	The Agent is the bridge between the scheduler and the AgentTask. The
				10	schedulers tick has a method called handle_agents, which calls the
				11	tick of each agent in the Dispatchers queue. This leads to the Agent
				12	polling its AgentTask. The scheduler will keep polling a task through
				13	the associated Agent till the Agent is removed from the dispatcher.
				14
				15	At a high level:
				16	agents finished = tasks done
				17	agent polls till finished
				18	task polls till done
				19	task sets done
				20	agent is removed from dispatcher
				21	AgentTasks:
				22	Basic AgentTasks are created when an hqe changes state. Examples of these
				23	are the QueueTask, which is created when a hqe goes into the Starting state
				24	and the FinalReparseTask, which is created when the hqe goes into parsing.
				25	SpecialAgentTasks:
MK Ryu	35d661e	2014-09-25 17:44:10 -0700	[diff] [blame]	26	Unlike AgentTasks, SpecialAgentTasks are only created when a row is inserted
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	27	in the afe_special_tasks table. All PrejobTasks are SpecialAgentTasks.
				28
				29	Monitor_db.get_agent_task_for_special_task/get_agent_task_for_queue_entry maps
				30	an AgentTask to an Agent, which the scheduler understands. From this point
				31	onward, the scheduler manages the task through the Agents interface,as follows:
				32	At a high level:
				33	task poll
				34	start
				35	prolog
				36	tick till we get an exit code
				37	finished(exit==0)
				38	done=True
				39	epilog
				40	cleanup
				41	set is_active, is_complete, success (checked in scheduler)
				42
				43	The first special task for an HQE is usually Reset.
				44	-poll: The first poll will start the task, polls thereafter will call the tasks
				45	tick method. A started task will have the started bit set.
				46	- start: Call prolog, run the process and set the start bit.
				47	- prolog: Usually where one puts any model state changes that happen before
				48	the actual task. Different per Task. Examples of things that might
				49	happen in a prolog:
				50	- state of Host, HQE (to something like Resetting)
				51	- delete any unwanted queued special tasks
				52	- register a pidfile
				53	- set the is_active bit on the special task
				54	- run:
				55	- create a PidfileRunMonitor
				56	- pass the autoserv command, working directory etc to drone manager.
				57	This will start the actual autoserv process.
				58	- set the start bit: so subsequent polls do not 'start' again
				59
				60	- tick: For as long as a started tasks done bit is not set, a poll will lead
				61	to a tick. The tick monitors the pid file of the autoserv process
				62	running on the drone through the PidfileRunMonitor created in prolog.
				63	If the autoserv process has finished we call finished with true/false
				64	depending on autoserv exit code.
				65
				66	- finished: sets the done and success values, then calls epilog. The
				67	done bit is important because the Agent polls this bit to
				68	measure the success or failure of its task.
				69
				70	- epilog: Is generally where we set status of the Host/HQE again,
				71	requeue any other task that needs to run after this one
				72	and perform cleanup. Just like the prolog, this step is
				73	different per task.
				74
				75	- cleanup: Sets the is_active and is_complete and success
				76	states on the tasks model. Also uses the
				77	drone_manager to:
				78	unregister the pidfile
				79	copy results of the task
				80	(Note this is not to be confused with the
				81	special task called cleanup).
				82
				83	The actions we take in the epilog are based on the
				84	success/failure of the autoserv process set in cleanup,
				85	eg: if reset failed we will enqueue a repair, but if all
				86	is well the epilog will just return. Prejob task epilogs
				87	also have an on_pending method that change the status of
				88	the HQE to pending/starting, which gets picked up in the
				89	scheduler.
				90	By this point the is_done flag is set, which results in the Agent noticing that
				91	the task has finished and unregistering it from the dispatcher.Class hierarchy:
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	92	AgentTask
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	93	\|--->SpecialAgentTask (prejob_task.py)
				94	\|--->RepairTask
				95	\|--->PreJobTask
				96	\|--->Verify, Cleanup, Reset, Provision
				97
				98	\|--->AbstractQueueTask (monitor_db.py)
				99	\|--->QueueTask
				100	\|--->HostlessQueueTask
				101
				102	\|--->PostJobTask (postjob_task.py)
				103	\|--->GatherLogsTask
				104	\|--->SelfThrottledPostJobTask
				105	\|--->FinalReparseTask
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	106
				107	"""
				108
				109	import logging
				110	import os
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	111	import time
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	112	import urllib
				113
				114	import common
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	115
Dan Shi	e680323	2016-01-19 17:30:20 -0800	[diff] [blame]	116	from autotest_lib.client.common_lib import global_config
Dan Shi	80f7c53	2015-08-25 10:23:14 -0700	[diff] [blame]	117	from autotest_lib.client.common_lib import utils
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	118	from autotest_lib.frontend.afe import models
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	119	from autotest_lib.scheduler import drone_manager
				120	from autotest_lib.scheduler import email_manager
				121	from autotest_lib.scheduler import pidfile_monitor
beeps	cc9fc70	2013-12-02 12:45:38 -0800	[diff] [blame]	122	from autotest_lib.scheduler import rdb_lib
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	123	from autotest_lib.scheduler import scheduler_lib
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	124	from autotest_lib.scheduler import scheduler_models
				125	from autotest_lib.server import autoserv_utils
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	126	from autotest_lib.server import system_utils
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	127
Dan Shi	5e2efb7	2017-02-07 11:40:23 -0800	[diff] [blame]	128	try:
				129	from chromite.lib import metrics
				130	except ImportError:
				131	metrics = utils.metrics_mock
				132
				133
Dan Shi	e680323	2016-01-19 17:30:20 -0800	[diff] [blame]	134	CONFIG = global_config.global_config
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	135	AUTOSERV_NICE_LEVEL = 10
				136
Dan Shi	e680323	2016-01-19 17:30:20 -0800	[diff] [blame]	137	ENABLE_DRONE_IN_RESTRICTED_SUBNET = CONFIG.get_config_value(
				138	'CROS', 'enable_drone_in_restricted_subnet', type=bool,
				139	default=False)
				140
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	141
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	142	class AgentTask(object):
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	143	class _NullMonitor(object):
				144	pidfile_id = None
				145
				146	def has_process(self):
				147	return True
				148
				149
				150	def __init__(self, log_file_name=None):
				151	"""
				152	@param log_file_name: (optional) name of file to log command output to
				153	"""
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	154	self._drone_manager = drone_manager.instance()
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	155	self.done = False
				156	self.started = False
				157	self.success = None
				158	self.aborted = False
				159	self.monitor = None
				160	self.queue_entry_ids = []
				161	self.host_ids = []
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	162	# A map between host id and hostname.
				163	self.hostnames = {}
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	164	self._log_file_name = log_file_name
				165
				166
				167	def _set_ids(self, host=None, queue_entries=None):
				168	if queue_entries and queue_entries != [None]:
xixuan	2d56888	2017-04-24 14:21:41 -0700	[diff] [blame]	169	self.host_ids = []
				170	self.queue_entry_ids = []
				171	self.hostnames = {}
				172	for entry in queue_entries:
				173	if entry.host is not None:
				174	self.host_ids.append(entry.host.id)
				175	self.queue_entry_ids.append(entry.id)
				176	self.hostnames[entry.host.id] = entry.host.hostname
				177	else:
				178	logging.debug(
				179	'No host is found for host_queue_entry_id: %r',
				180	entry.id)
Aviv Keshet	0c93b86	2017-07-13 11:58:03 -0700	[diff] [blame]	181	raise scheduler_lib.NoHostIdError(
xixuan	2d56888	2017-04-24 14:21:41 -0700	[diff] [blame]	182	'Failed to schedule a job whose '
Aviv Keshet	0c93b86	2017-07-13 11:58:03 -0700	[diff] [blame]	183	'host_queue_entry_id=%r due to no host_id.'
				184	% entry.id)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	185	else:
				186	assert host
				187	self.host_ids = [host.id]
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	188	self.hostnames = {host.id: host.hostname}
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	189
				190
				191	def poll(self):
				192	if not self.started:
				193	self.start()
				194	if not self.done:
				195	self.tick()
				196
				197
				198	def tick(self):
				199	assert self.monitor
				200	exit_code = self.monitor.exit_code()
				201	if exit_code is None:
				202	return
				203
				204	success = (exit_code == 0)
				205	self.finished(success)
				206
				207
				208	def is_done(self):
				209	return self.done
				210
				211
				212	def finished(self, success):
				213	if self.done:
				214	assert self.started
				215	return
				216	self.started = True
				217	self.done = True
				218	self.success = success
				219	self.epilog()
				220
				221
				222	def prolog(self):
				223	"""
				224	To be overridden.
				225	"""
				226	assert not self.monitor
				227	self.register_necessary_pidfiles()
				228
				229
				230	def _log_file(self):
				231	if not self._log_file_name:
				232	return None
				233	return os.path.join(self._working_directory(), self._log_file_name)
				234
				235
				236	def cleanup(self):
				237	log_file = self._log_file()
				238	if self.monitor and log_file:
				239	self.monitor.try_copy_to_results_repository(log_file)
				240
				241
				242	def epilog(self):
				243	"""
				244	To be overridden.
				245	"""
				246	self.cleanup()
				247	logging.info("%s finished with success=%s", type(self).__name__,
				248	self.success)
				249
				250
				251	def start(self):
				252	if not self.started:
				253	self.prolog()
				254	self.run()
				255
				256	self.started = True
				257
				258
				259	def abort(self):
				260	if self.monitor:
				261	self.monitor.kill()
				262	self.done = True
				263	self.aborted = True
				264	self.cleanup()
				265
				266
				267	def _get_consistent_execution_path(self, execution_entries):
				268	first_execution_path = execution_entries[0].execution_path()
				269	for execution_entry in execution_entries[1:]:
				270	assert execution_entry.execution_path() == first_execution_path, (
				271	'%s (%s) != %s (%s)' % (execution_entry.execution_path(),
				272	execution_entry,
				273	first_execution_path,
				274	execution_entries[0]))
				275	return first_execution_path
				276
				277
				278	def _copy_results(self, execution_entries, use_monitor=None):
				279	"""
				280	@param execution_entries: list of objects with execution_path() method
				281	"""
				282	if use_monitor is not None and not use_monitor.has_process():
				283	return
				284
				285	assert len(execution_entries) > 0
				286	if use_monitor is None:
				287	assert self.monitor
				288	use_monitor = self.monitor
				289	assert use_monitor.has_process()
				290	execution_path = self._get_consistent_execution_path(execution_entries)
				291	results_path = execution_path + '/'
				292	use_monitor.try_copy_to_results_repository(results_path)
				293
				294
				295	def _parse_results(self, queue_entries):
				296	for queue_entry in queue_entries:
				297	queue_entry.set_status(models.HostQueueEntry.Status.PARSING)
				298
				299
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	300	def _command_line(self):
				301	"""
				302	Return the command line to run. Must be overridden.
				303	"""
				304	raise NotImplementedError
				305
				306
				307	@property
				308	def num_processes(self):
				309	"""
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	310	Return the number of processes forked by this AgentTask's process.
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	311	It may only be approximate. To be overridden if necessary.
				312	"""
				313	return 1
				314
				315
				316	def _paired_with_monitor(self):
				317	"""
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	318	If this AgentTask's process must run on the same machine as some
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	319	previous process, this method should be overridden to return a
				320	PidfileRunMonitor for that process.
				321	"""
				322	return self._NullMonitor()
				323
				324
				325	@property
				326	def owner_username(self):
				327	"""
				328	Return login of user responsible for this task. May be None. Must be
				329	overridden.
				330	"""
				331	raise NotImplementedError
				332
				333
				334	def _working_directory(self):
				335	"""
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	336	Return the directory where this AgentTask's process executes.
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	337	Must be overridden.
				338	"""
				339	raise NotImplementedError
				340
				341
				342	def _pidfile_name(self):
				343	"""
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	344	Return the name of the pidfile this AgentTask's process uses. To be
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	345	overridden if necessary.
				346	"""
				347	return drone_manager.AUTOSERV_PID_FILE
				348
				349
				350	def _check_paired_results_exist(self):
				351	if not self._paired_with_monitor().has_process():
Aviv Keshet	14cac44	2016-11-20 21:44:11 -0800	[diff] [blame]	352	metrics.Counter(
				353	'chromeos/autotest/errors/scheduler/no_paired_results'
				354	).increment()
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	355	self.finished(False)
				356	return False
				357	return True
				358
				359
				360	def _create_monitor(self):
				361	assert not self.monitor
				362	self.monitor = pidfile_monitor.PidfileRunMonitor()
				363
				364
				365	def run(self):
				366	if not self._check_paired_results_exist():
				367	return
				368
				369	self._create_monitor()
				370	self.monitor.run(
				371	self._command_line(), self._working_directory(),
				372	num_processes=self.num_processes,
				373	nice_level=AUTOSERV_NICE_LEVEL, log_file=self._log_file(),
				374	pidfile_name=self._pidfile_name(),
				375	paired_with_pidfile=self._paired_with_monitor().pidfile_id,
				376	username=self.owner_username,
				377	drone_hostnames_allowed=self.get_drone_hostnames_allowed())
				378
				379
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	380	def get_drone_hostnames_allowed(
Dan Shi	e680323	2016-01-19 17:30:20 -0800	[diff] [blame]	381	self, restricted_subnets=utils.RESTRICTED_SUBNETS,
				382	enable_drone_in_subnet=ENABLE_DRONE_IN_RESTRICTED_SUBNET):
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	383	filtered_drones = None
				384	has_unrestricted_host = False
Dan Shi	e680323	2016-01-19 17:30:20 -0800	[diff] [blame]	385	if (self.hostnames and restricted_subnets and enable_drone_in_subnet):
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	386	for hostname in self.hostnames.values():
				387	subnet = utils.get_restricted_subnet(hostname,
				388	restricted_subnets)
				389
				390	# Return an empty set if the list of hosts exists both in
				391	# restricted and unrestricted subnet. No drone can work in such
				392	# case.
				393	if ((not subnet and filtered_drones is not None) or
				394	(subnet and has_unrestricted_host)):
				395	logging.error('The test has some DUT in restricted subnet, '
				396	'but some in unrestricted subnet. Therefore, '
				397	'no drone is available to run the test.')
				398	return set()
				399
				400	if not subnet:
				401	has_unrestricted_host = True
				402	continue
				403
				404	server_ip_map=system_utils.DroneCache.get_drone_ip_map()
				405	filtered_drones_for_host = set(
				406	utils.get_servers_in_same_subnet(
				407	subnet[0], subnet[1],
				408	server_ip_map=server_ip_map))
				409	logging.info('DUT %s is in restricted subnet, drone can only '
				410	'be chosen from %s', hostname,
				411	filtered_drones_for_host)
				412	if filtered_drones is None:
				413	filtered_drones = filtered_drones_for_host
				414	else:
				415	filtered_drones = set.intersection(
				416	filtered_drones, filtered_drones_for_host)
				417
				418	# If filtered_drones is an empty set, that means no drone is
				419	# allowed to run the task. This is different fron None, which
				420	# means all drones are allowed.
				421	if filtered_drones == set():
				422	logging.error('DUT(s) is in restricted subnet, but no '
				423	'drone is available to run the test.')
				424	return filtered_drones
				425
				426	# If host is not in restricted subnet, use the unrestricted drones only.
Dan Shi	e680323	2016-01-19 17:30:20 -0800	[diff] [blame]	427	if (filtered_drones is None and restricted_subnets and
				428	enable_drone_in_subnet):
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	429	filtered_drones = set(
				430	system_utils.DroneCache.get_unrestricted_drones(
				431	restricted_subnets=restricted_subnets))
				432
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	433	if not models.DroneSet.drone_sets_enabled():
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	434	return filtered_drones
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	435
				436	hqes = models.HostQueueEntry.objects.filter(id__in=self.queue_entry_ids)
				437	if not hqes:
				438	# Only special tasks could be missing host queue entries
				439	assert isinstance(self, SpecialAgentTask)
				440	return self._user_or_global_default_drone_set(
				441	self.task, self.task.requested_by)
				442
				443	job_ids = hqes.values_list('job', flat=True).distinct()
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	444	assert job_ids.count() == 1, ("AgentTask's queue entries "
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	445	"span multiple jobs")
				446
				447	job = models.Job.objects.get(id=job_ids[0])
				448	drone_set = job.drone_set
				449	if not drone_set:
				450	return self._user_or_global_default_drone_set(job, job.user())
				451
Dan Shi	114e172	2016-01-10 18:12:53 -0800	[diff] [blame]	452	if filtered_drones:
				453	return set.intersection(filtered_drones,
				454	drone_set.get_drone_hostnames())
				455	else:
				456	return drone_set.get_drone_hostnames()
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	457
				458
				459	def _user_or_global_default_drone_set(self, obj_with_owner, user):
				460	"""
				461	Returns the user's default drone set, if present.
				462
				463	Otherwise, returns the global default drone set.
				464	"""
				465	default_hostnames = models.DroneSet.get_default().get_drone_hostnames()
				466	if not user:
Ilja H. Friedel	04be2bd	2014-05-07 21:29:59 -0700	[diff] [blame]	467	logging.warning('%s had no owner; using default drone set',
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	468	obj_with_owner)
				469	return default_hostnames
				470	if not user.drone_set:
Ilja H. Friedel	04be2bd	2014-05-07 21:29:59 -0700	[diff] [blame]	471	logging.warning('User %s has no default drone set, using global '
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	472	'default', user.login)
				473	return default_hostnames
				474	return user.drone_set.get_drone_hostnames()
				475
				476
				477	def register_necessary_pidfiles(self):
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	478	pidfile_id = self._drone_manager.get_pidfile_id_from(
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	479	self._working_directory(), self._pidfile_name())
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	480	self._drone_manager.register_pidfile(pidfile_id)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	481
				482	paired_pidfile_id = self._paired_with_monitor().pidfile_id
				483	if paired_pidfile_id:
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	484	self._drone_manager.register_pidfile(paired_pidfile_id)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	485
				486
				487	def recover(self):
				488	if not self._check_paired_results_exist():
				489	return
				490
				491	self._create_monitor()
				492	self.monitor.attach_to_existing_process(
				493	self._working_directory(), pidfile_name=self._pidfile_name(),
				494	num_processes=self.num_processes)
				495	if not self.monitor.has_process():
				496	# no process to recover; wait to be started normally
				497	self.monitor = None
				498	return
				499
				500	self.started = True
				501	logging.info('Recovering process %s for %s at %s',
				502	self.monitor.get_process(), type(self).__name__,
				503	self._working_directory())
				504
				505
				506	def _check_queue_entry_statuses(self, queue_entries, allowed_hqe_statuses,
				507	allowed_host_statuses=None):
				508	class_name = self.__class__.__name__
				509	for entry in queue_entries:
				510	if entry.status not in allowed_hqe_statuses:
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	511	# In the orignal code, here we raise an exception. In an
				512	# effort to prevent downtime we will instead abort the job and
				513	# send out an email notifying us this has occured.
				514	error_message = ('%s attempting to start entry with invalid '
				515	'status %s: %s. Aborting Job: %s.'
				516	% (class_name, entry.status, entry,
				517	entry.job))
				518	logging.error(error_message)
				519	email_manager.manager.enqueue_notify_email(
				520	'Job Aborted - Invalid Host Queue Entry Status',
				521	error_message)
				522	entry.job.request_abort()
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	523	invalid_host_status = (
				524	allowed_host_statuses is not None
				525	and entry.host.status not in allowed_host_statuses)
				526	if invalid_host_status:
Allen Li	e8adf59	2017-02-06 17:28:28 -0800	[diff] [blame]	527	# In the orignal code, here we raise an exception. In an
				528	# effort to prevent downtime we will instead abort the job and
				529	# send out an email notifying us this has occured.
				530	error_message = ('%s attempting to start on queue entry with '
				531	'invalid host status %s: %s. Aborting Job: %s'
				532	% (class_name, entry.host.status, entry,
				533	entry.job))
				534	logging.error(error_message)
				535	email_manager.manager.enqueue_notify_email(
				536	'Job Aborted - Invalid Host Status', error_message)
				537	entry.job.request_abort()
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	538
				539
				540	class TaskWithJobKeyvals(object):
				541	"""AgentTask mixin providing functionality to help with job keyval files."""
				542	_KEYVAL_FILE = 'keyval'
				543	def _format_keyval(self, key, value):
				544	return '%s=%s' % (key, value)
				545
				546
				547	def _keyval_path(self):
				548	"""Subclasses must override this"""
				549	raise NotImplementedError
				550
				551
				552	def _write_keyval_after_job(self, field, value):
				553	assert self.monitor
				554	if not self.monitor.has_process():
				555	return
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	556	self._drone_manager.write_lines_to_file(
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	557	self._keyval_path(), [self._format_keyval(field, value)],
				558	paired_with_process=self.monitor.get_process())
				559
				560
				561	def _job_queued_keyval(self, job):
				562	return 'job_queued', int(time.mktime(job.created_on.timetuple()))
				563
				564
				565	def _write_job_finished(self):
				566	self._write_keyval_after_job("job_finished", int(time.time()))
				567
				568
				569	def _write_keyvals_before_job_helper(self, keyval_dict, keyval_path):
				570	keyval_contents = '\n'.join(self._format_keyval(key, value)
				571	for key, value in keyval_dict.iteritems())
				572	# always end with a newline to allow additional keyvals to be written
				573	keyval_contents += '\n'
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	574	self._drone_manager.attach_file_to_execution(self._working_directory(),
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	575	keyval_contents,
				576	file_path=keyval_path)
				577
				578
				579	def _write_keyvals_before_job(self, keyval_dict):
				580	self._write_keyvals_before_job_helper(keyval_dict, self._keyval_path())
				581
				582
				583	def _write_host_keyvals(self, host):
				584	keyval_path = os.path.join(self._working_directory(), 'host_keyvals',
				585	host.hostname)
				586	platform, all_labels = host.platform_and_labels()
				587	all_labels = [ urllib.quote(label) for label in all_labels ]
				588	keyval_dict = dict(platform=platform, labels=','.join(all_labels))
				589	self._write_keyvals_before_job_helper(keyval_dict, keyval_path)
				590
				591
				592	class SpecialAgentTask(AgentTask, TaskWithJobKeyvals):
				593	"""
				594	Subclass for AgentTasks that correspond to a SpecialTask entry in the DB.
				595	"""
				596
				597	TASK_TYPE = None
				598	host = None
				599	queue_entry = None
Aviv Keshet	e6a6e3d	2016-11-17 12:37:38 -0800	[diff] [blame]	600	_COUNT_METRIC = 'chromeos/autotest/scheduler/special_task_count'
				601	_DUT_METRIC = 'chromeos/autotest/scheduler/special_task_by_dut'
				602	_DURATION_METRIC = 'chromeos/autotest/scheduler/special_task_durations'
Aviv Keshet	f6b8fc7	2016-07-12 16:16:54 -0700	[diff] [blame]	603
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	604
				605	def __init__(self, task, extra_command_args):
				606	super(SpecialAgentTask, self).__init__()
				607
				608	assert self.TASK_TYPE is not None, 'self.TASK_TYPE must be overridden'
				609
beeps	cc9fc70	2013-12-02 12:45:38 -0800	[diff] [blame]	610	self.host = rdb_lib.get_hosts([task.host.id])[0]
				611	self.host.dbg_str = 'Task: %s' % str(task)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	612	self.queue_entry = None
				613	if task.queue_entry:
				614	self.queue_entry = scheduler_models.HostQueueEntry(
				615	id=task.queue_entry.id)
beeps	cc9fc70	2013-12-02 12:45:38 -0800	[diff] [blame]	616	self.host.dbg_str += self.queue_entry.get_dbg_str()
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	617
Aviv Keshet	f6b8fc7	2016-07-12 16:16:54 -0700	[diff] [blame]	618	# This is of type SpecialTask (as defined in frontend/afe/models.py)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	619	self.task = task
				620	self._extra_command_args = extra_command_args
Dan Shi	7cf3d84	2014-08-13 11:20:38 -0700	[diff] [blame]	621	self.host.metadata = self.get_metadata()
Allen Li	02d7e74	2016-10-14 15:30:36 -0700	[diff] [blame]	622	self._milestone = ''
Dan Shi	7cf3d84	2014-08-13 11:20:38 -0700	[diff] [blame]	623
				624
				625	def get_metadata(self):
				626	"""Get a dictionary that contains task information.
				627
				628	The return value is a dictionary that includes task information like id,
				629	name and related job information. The value will be stored in metadata
				630	database.
				631	@return: A dictionary containing the task id, name and related job id.
				632	If some attributes are failed to be accessed, an empty
				633	dictionary will be returned, and error will be logged.
				634	"""
				635	try:
				636	metadata = {'task_id':self.task.id, 'task_name':self.task.task,
				637	'hostname':self.task.host.hostname}
				638	if self.task.queue_entry:
				639	job = self.task.queue_entry.job
				640	metadata.update(
				641	scheduler_models.get_job_metadata(job))
				642	return metadata
				643	except AttributeError as e:
				644	logging.error('Task has missing attribute: %s', e)
				645	return {}
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	646
				647
				648	def _keyval_path(self):
				649	return os.path.join(self._working_directory(), self._KEYVAL_FILE)
				650
				651
				652	def _command_line(self):
				653	return autoserv_utils._autoserv_command_line(self.host.hostname,
				654	self._extra_command_args,
Simran Basi	8e6affb	2015-12-16 11:54:11 -0800	[diff] [blame]	655	queue_entry=self.queue_entry,
				656	in_lab=True)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	657
				658
				659	def _working_directory(self):
				660	return self.task.execution_path()
				661
				662
				663	@property
				664	def owner_username(self):
				665	if self.task.requested_by:
				666	return self.task.requested_by.login
				667	return None
				668
				669
				670	def prolog(self):
				671	super(SpecialAgentTask, self).prolog()
				672	self.task.activate()
				673	self._write_host_keyvals(self.host)
				674
				675
				676	def _fail_queue_entry(self):
				677	assert self.queue_entry
				678
				679	if self.queue_entry.meta_host:
				680	return # don't fail metahost entries, they'll be reassigned
				681
				682	self.queue_entry.update_from_database()
				683	if self.queue_entry.status != models.HostQueueEntry.Status.QUEUED:
				684	return # entry has been aborted
				685
				686	self._actually_fail_queue_entry()
				687
				688
Paul Hobbs	e7a59b1	2016-06-23 16:59:19 -0700	[diff] [blame]	689	def epilog(self):
				690	super(SpecialAgentTask, self).epilog()
				691	self._emit_special_task_status_metric()
				692
				693
				694	def _emit_special_task_status_metric(self):
				695	"""Increments an accumulator associated with this special task."""
Aviv Keshet	f6b8fc7	2016-07-12 16:16:54 -0700	[diff] [blame]	696	fields = {'type': self.TASK_TYPE,
				697	'success': bool(self.success),
Allen Li	02d7e74	2016-10-14 15:30:36 -0700	[diff] [blame]	698	'board': str(self.host.board),
				699	'milestone': self._milestone}
Aviv Keshet	e6a6e3d	2016-11-17 12:37:38 -0800	[diff] [blame]	700	metrics.Counter(self._COUNT_METRIC).increment(
Paul Hobbs	eedcb8b	2016-10-05 16:44:27 -0700	[diff] [blame]	701	fields=fields)
Aviv Keshet	e6a6e3d	2016-11-17 12:37:38 -0800	[diff] [blame]	702
Aviv Keshet	3082faa	2016-07-13 12:49:41 -0700	[diff] [blame]	703	if (self.task.time_finished and self.task.time_started):
				704	duration = (self.task.time_finished -
				705	self.task.time_started).total_seconds()
Aviv Keshet	e6a6e3d	2016-11-17 12:37:38 -0800	[diff] [blame]	706	metrics.SecondsDistribution(self._DURATION_METRIC).add(
Paul Hobbs	eedcb8b	2016-10-05 16:44:27 -0700	[diff] [blame]	707	duration, fields=fields)
Paul Hobbs	e7a59b1	2016-06-23 16:59:19 -0700	[diff] [blame]	708
Aviv Keshet	e6a6e3d	2016-11-17 12:37:38 -0800	[diff] [blame]	709	dut_fields = {
				710	'type': self.TASK_TYPE,
				711	'success': bool(self.success),
				712	'board': str(self.host.board),
				713	'dut_host_name': self.host.hostname
				714	}
				715	metrics.Counter(self._DUT_METRIC).increment(fields=dut_fields)
Paul Hobbs	e7a59b1	2016-06-23 16:59:19 -0700	[diff] [blame]	716
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	717	# TODO(milleral): http://crbug.com/268607
				718	# All this used to be a part of _fail_queue_entry. The
				719	# exact semantics of when one should and should not be failing a queue
				720	# entry need to be worked out, because provisioning has placed us in a
				721	# case where we want to fail a queue entry that could be requeued,
				722	# which makes us fail the two above if statements, and thus
				723	# _fail_queue_entry() would exit early and have no effect.
				724	# What's left here with _actually_fail_queue_entry is a hack to be able to
				725	# bypass the checks and unconditionally execute the code.
				726	def _actually_fail_queue_entry(self):
				727	self.queue_entry.set_execution_subdir()
				728	queued_key, queued_time = self._job_queued_keyval(
				729	self.queue_entry.job)
				730	self._write_keyval_after_job(queued_key, queued_time)
				731	self._write_job_finished()
				732
				733	# copy results logs into the normal place for job results
				734	self.monitor.try_copy_results_on_drone(
				735	source_path=self._working_directory() + '/',
				736	destination_path=self.queue_entry.execution_path() + '/')
				737
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	738	pidfile_id = self._drone_manager.get_pidfile_id_from(
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	739	self.queue_entry.execution_path(),
				740	pidfile_name=drone_manager.AUTOSERV_PID_FILE)
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	741	self._drone_manager.register_pidfile(pidfile_id)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	742
Allen Li	372400c	2017-05-16 18:51:20 -0700	[diff] [blame]	743	# TODO(ayatane): This should obey self.queue_entry.job.parse_failed_repair
				744	# But nothing sets self.queue_entry.job.parse_failed_repair?
				745	# Check Git blame
				746	self._parse_results([self.queue_entry])
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	747
				748	# Also fail all other special tasks that have not yet run for this HQE
				749	pending_tasks = models.SpecialTask.objects.filter(
				750	queue_entry__id=self.queue_entry.id,
				751	is_complete=0)
				752	for task in pending_tasks:
				753	task.finish(False)
				754
				755
				756	def cleanup(self):
				757	super(SpecialAgentTask, self).cleanup()
				758
				759	# We will consider an aborted task to be "Failed"
				760	self.task.finish(bool(self.success))
				761
				762	if self.monitor:
				763	if self.monitor.has_process():
				764	self._copy_results([self.task])
				765	if self.monitor.pidfile_id is not None:
Jakob Jülich	36accc6	2014-07-23 10:26:55 -0700	[diff] [blame]	766	self._drone_manager.unregister_pidfile(self.monitor.pidfile_id)
beeps	5e2bb4a	2013-10-28 11:26:45 -0700	[diff] [blame]	767
				768
				769	def remove_special_tasks(self, special_task_to_remove, keep_last_one=False):
				770	"""Remove a type of special task in all tasks, keep last one if needed.
				771
				772	@param special_task_to_remove: type of special task to be removed, e.g.,
				773	models.SpecialTask.Task.VERIFY.
				774	@param keep_last_one: True to keep the last special task if its type is
				775	the same as of special_task_to_remove.
				776
				777	"""
				778	queued_special_tasks = models.SpecialTask.objects.filter(
				779	host__id=self.host.id,
				780	task=special_task_to_remove,
				781	is_active=False, is_complete=False, queue_entry=None)
				782	if keep_last_one:
				783	queued_special_tasks = queued_special_tasks.exclude(id=self.task.id)
				784	queued_special_tasks.delete()
Alex Miller	ec21225	2014-02-28 16:48:34 -0800	[diff] [blame]	785
				786
				787	def _generate_autoserv_label_args(self, task):
				788	"""
				789	@param task: An instance of afe model's SpecialTask.
				790	@returns: The list of arguments to pass to autoserv to tell it what the
				791	labels of a job are.
				792
				793	"""
				794	labels = {x.name for x in task.queue_entry.job.labels}
				795	return ['--job-labels', ','.join(labels)]