- e60e44e Special tasks show "Failed" as their status instead of "Completed" if by showard · 15 years ago
- 1b0ffc3 Address shutil.copy() failure when running a scheduler instance without by showard · 15 years ago
- 7ca9e01 Remove the synch_job_start_timeout_minutes scheduler "feature" as it is by showard · 15 years ago
- a21b949 Added functional test for recovering jobs with atomic hosts, with HQEs by showard · 15 years ago
- 65db393 * impose prioritization on SpecialTasks based on task type: Repair, then Cleanup, then Verify. remove prioritization of STs with queue entry over those without. this leads to more sane ordering of execution in certain unusual contexts -- the added functional test cases illustrate a few (in some cases, it's not just more sane, it eliminates bugs as well). by showard · 15 years ago
- 7b2d7cb We never considered the handling of DO_NOT_VERIFY hosts in certain situations. This adds handling of those cases to the scheduler and adds tests to the scheduler functional test. by showard · 15 years ago
- 4a60479 add a bunch of tests to the scheduler functional test to cover pre- and post-job cleanup, including failure cases by showard · 15 years ago
- 37757f3 Change "unrecovered active host queue entries" to be a more accurate by showard · 15 years ago
- ac5b000 * get rid of the code to create the drone temp dir in drones.py. This used to be necessary because we needed that directory just to run drone_utility (so we could put the pickle file there). But now we use stdin, so we don't need this anymore. (drone_utility still initializes the temp dir for its own use.) by showard · 15 years ago
- 202343e On the results drone, execute code from the results dir. by showard · 15 years ago
- 2aafd90 Need to get the drone temporary directory under the results dir as well. Added unit tests to check this and to check the behavior of attach_file_to_execution, which was being affected by this bug (but wasn't actually buggy itself). by showard · 15 years ago
- c75fded Fix the drone results dir computation. I forgot that the results don't just go under the drone_installation_directory, they go under "results" in there. by showard · 15 years ago
- 093a068 Added string stdin support to utils.BgJob and all its users that give it by jadmanski · 15 years ago
- 8375ce0 Fix unindexable object error raised on the error path within by showard · 15 years ago
- 42d4498 Use drone_installation_dir for all activities on drones, including results dirs and temp dirs. Previously it would use the drone_installation_dir for executing drone_utility, but would use the scheduler results dir for everything else. by showard · 15 years ago
- 786da9a Escalate to a SIGKILL in DroneUtility.kill_process() if the SIGTERM didn't work by showard · 15 years ago
- b890045 In scheduler recovery, allow Running HQEs with no process. The tick code already handles them fine (by re-executing Autoserv), but the recovery code was explicitly disallowing them. With this change, it turns out there's only one status that's not allowed to go unrecovered -- Verifying -- so I changed the code to reflect that and I made the failure conditions more accurate. by showard · 15 years ago
- 5682407 Added more logging, and fixed logging in HostQueueEntry.set_status() by showard · 15 years ago
- 0db3d43 Recheck queue entry status in Dispatcher._get_unassigned_entries() by showard · 15 years ago
- d201482 When a delayed call task finishes waiting for extra hosts to enter by showard · 15 years ago
- dae680a Ignore microsecond differences in datetimes when checking existing in by showard · 15 years ago
- e55955f Rewrite a conditional that was very confusing to me. by showard · 15 years ago
- f85a0b7 Explicitly release pidfiles after we're done with them. This does it in a kind of lazy way, but it should work just fine. Also extended the new scheduler functional test with a few more cases and added a test to check pidfile release under these various cases. In the process, I changed how some of the code works to allow the tests to more cleanly express their intentions. by showard · 15 years ago
- 34ab099 beginnings of a new scheduler functional test. this aims to test the entire monitor_db.py file holistically, made possible by the fact that monitor_db.py is already isolated from all direct system access through drone_manager (this was a necessary separation for distributed scheduling). by mocking out the entire drone_manager, as well as other major dependencies (email manager, global config), and filling a test database, we can allow the dispatcher to execute normally and allow it to interact with all the other code in monitor_db. at the end, we can check the state of the database and the drone_manager, and (probably most importantly, given the usual failure mode of the scheduler) we can ensure no exceptions get raised from monitor_db. by showard · 15 years ago
- 8d3dbca Make the maximum number of refreshes before forgetting a pidfile by showard · 15 years ago
- ec6a3b9 Make the pidfile timeout in the scheduler configurable. Raise the by showard · 15 years ago
- 0c5c18d Changed error message to be more useful by showard · 15 years ago
- d791dcb Give all scheduler launched child processes a mark and check for by showard · 15 years ago
- 828fc4c Make assertion in _choose_group_to_run non-fatal and log an error by showard · 15 years ago
- b6a186f Email notification currently relies on an MTA installed by showard · 15 years ago
- db50276 Write host keyvals for all verify/cleanup/repair tasks. by showard · 15 years ago
- 775300b Cleanups on hosts marked DO_NOT_VERIFY should continue to run as if they by showard · 15 years ago
- dabf6cf It is okay for hosts to have multiple atomic group labels so long as all by showard · 15 years ago
- b593fa8 Prevent email_manager from hiding exceptions when sending email fails. by showard · 15 years ago
- 8cc058f Make scheduler more stateless. Agents are now scheduled only by the by showard · 15 years ago
- 8de3713 Renamed process_is_alive to program_is_alive. by showard · 15 years ago
- cdaeae8 Fixed bug where scheduler would crash if the autoserv process is lost by showard · 15 years ago
- 4ac4754 Don't mark HQEs as Failed before the GatherLogsTask and the by showard · 15 years ago
- 6631273 Make a bunch of stuff executable by mbligh · 15 years ago
- 549afad Added pid file checks to monitor_db and monitor_db_babysitter, so that by showard · 15 years ago
- 70a294f Don't expect aborted "Pending" entries to be recovered. They'll be immediately picked up by _find_aborting() so they don't need to be recovered. by showard · 15 years ago
- 58721a8 One-off fix to address the issue where a scheduler shutdown immediately by showard · 15 years ago
- 3739978 Instrument the drone manager to allow debugging why it lost track of by showard · 15 years ago
- 6bba3d1 Don't assert if we were unable to load the pidfile in num_tests_failed. by showard · 15 years ago
- e8e3707 Treat unrecoverable host queue entries as a fatal error. Their existance by showard · 15 years ago
- 6d1c143 Fix scheduler's handling of jobs when the PID file can't be found. by showard · 15 years ago
- 708b352 Do not go through a DelayedCallTask on atomic group jobs when all Hosts by showard · 15 years ago
- 9b6ec50 Turn an assertion into a more useful error message. by showard · 15 years ago
- 1ef218d This is the result of a batch reindent.py across our tree. by mbligh · 15 years ago
- 5fa9e11 By default, only warn when orphaned autoservs are found by mbligh · 15 years ago
- 6fbdb80 Change print msg to logging.error(msg) so that we actually get the error in the scheduler log about the scheduler not being enalbed. by mbligh · 15 years ago
- c6a5687 Remove an assertion error that was preventing recovered atomic group by showard · 15 years ago
- f4a2e50 log aborts in the scheduler more explicitly by showard · 15 years ago
- a5288b4 Upgrade from Django 0.96 to Django 1.0.2. by showard · 15 years ago
- b000a8d Added logging and email code to help track down a bug (asynchronous jobs are by showard · 15 years ago
- 6af73ad "Recover" HQEs in "Starting" status by requeuing them. This is what it used to do, but it was lost in the new recovery code. This restores legacy bahavior until we implement proper recovery. by showard · 15 years ago
- 6878e8b Never kill processes in scheduler recovery. Instead, consider it an error if any unrecovered orphan process exists. Since we recover special tasks now, we should recover all processes, so if we find any extra, that means something went wrong and it's not safe to continue. by showard · 15 years ago
- a640b2d Fix scheduler bug with aborting a pre-job task. Scheduler was by showard · 15 years ago
- 8ac6f2a When a SpecialAgentTask is passed an existing SpecialTask, set the _working_directory upon object construction. It was previously set in prolog(), but recovery agents don't run prolog, but they still need _working_directory sometimes (i.e. when a RepairTask fails). by showard · 15 years ago
- 381341a Enter the mock objects created in AgentTasksTest of monitor_db_unittest by showard · 15 years ago
- cfd4a7e With the new SpecialTask recovery code, a RepairTask can be passed a queue entry that was previously requeued. So make sure the task leaves the HQE alone in that case. by showard · 15 years ago
- 1ae7308 svn propset svn:executable on scheduler/monitor_db_babysitter by mbligh · 15 years ago
- b6681aa SpecialAgentTasks can be aborted if they're tied to a job that gets aborted while they're active. In that case, we still need to update the SpecialTask entry to mark it as complete. by showard · 15 years ago
- 4460ee8 When a drone fails to initialize, let the scheduler die. We used to try to carry on gracefully, but it turns out this really isn't safe. If the drone failure is due to a network condition, and the drone is actually still up, then Autoserv processes will continue to run on that drone, but the scheduler will be unable to detect or stop these processes. This can lead to dual Autoserv processes running against the same machine. by showard · 15 years ago
- ed2afea make SpecialTasks recoverable. this involves quite a few changes. by showard · 15 years ago
- bf9695d Add a call to _drop_old_pidfiles(). This method has been present since the beginning but nothing ever called it, so long-running schedulers just keep accumulating pidfiles to check for, which makes them gradually get slower...and slower.... by showard · 15 years ago
- 6157c63 Make the scheduler robust to finding a HostQueueEntry with more than one by showard · 15 years ago
- 2fe3f1d Enter all Verify/Cleanup/Repair tasks into the special_tasks table. Also by showard · 15 years ago
- e7d9c60 Make the job executiontag available in both the server and client side job by mbligh · 15 years ago
- e9c6936 Pass --verbose flag for verify/repair/cleanup. Since we currently log these via piped console output, we want verbose output. by showard · 15 years ago
- b562645 ensure hosts get cleaned up even in the rare but possible case that a QueueTask finds no process at all by showard · 15 years ago
- 7c8ea99 Not all distros put a symlink in for the python version. However by mbligh · 15 years ago
- 2924b0a Ensure one-time-hosts aren't in the Everyone ACL, and make the scheduler ignore this. by showard · 15 years ago
- e39ebe9 temporary fix for bug in scheduling when at capacity. if no drone has capacity, pick the one with the least load. by showard · 15 years ago
- cbe6f94 add a log message to the scheduler thats useful for debugging atomic groups by showard · 15 years ago
- af8b4ca Fix _atomic_and_has_started() to check *only* for states that are a by showard · 15 years ago
- 08356c1 Do not call .set_host if the host is already set. by showard · 15 years ago
- 043c62a Ensure all entry points get the import-time logging logic executed before other autotest imports. by showard · 15 years ago
- 136e6dc Make scheduler and babysitter use the new logging_manager system. by showard · 15 years ago
- 6d7b2ff Redesign the reverify hosts feature. Host status is no longer changed by showard · 15 years ago
- 7718256 Have the scheduler wait a configurable amount of time before starting by showard · 15 years ago
- f098ebd convert a few straggling print statements in the scheduler code to logging calls by showard · 15 years ago
- 5613c66 Add an option to global config to disable to the scheduler isn't accidentally started on drones. by showard · 15 years ago
- 5debf85 Add logging info for drones so we know what drone drone_utility is running on. This will help identify slow drones and also keep track of where we are spending time. by showard · 15 years ago
- a64e52a Change behavior of Force Reverify: no longer executes cleanup before. by showard · 15 years ago
- 01a5167 Have the scheduler check for and sometimes cleanup various DB inconsistencies. by showard · 15 years ago
- 184a5e8 make AgentTasksTest inherit from BaseSchedulerTest. it didn't used to, since it didn't have any DB dependencies, but the recent introduction of SpecialTasks has changed that, so we need AgentTasksTest to setup the DB now like everything else. It doesn't increase the unit test runtime too drastically. by showard · 15 years ago
- 844960a make the readonly connection fallback to the regular Django connection when running in the scheduer. this is really important, because otherwise the readonly connection is not autocommit and bad, bad things could happen, though i'm not sure exactly what existing problems there might have been. we used to do this only for testing, but since we do it in another context here, i renamed the method to be more generic and appropriate. by showard · 15 years ago
- b6d1662 fix JobManager.get_status_counts, which was returning incorrect counts in some cases when jobs were aborted. the problem was that it's possible for a complete entry to have aborted set or not and have the same full status, which was violating an assumption of the method. by showard · 15 years ago
- 5add1c8 Make recovered tasks correctly handle being aborted before being started. Unlike other tasks, recovered tasks are effectively "started" as soon as they're created, since they're recovering a previously started task. So implement that properly so that when they're aborted, they do all the necessary killing and cleanup stuff. by showard · 15 years ago
- 29caa4b Explcitly catch SystemExit so we don't stack trace when we exit with sys.exit by showard · 15 years ago
- 54c1ea9 Sort hosts when choosing them for use in an atomic group and when by showard · 15 years ago
- 1ff7b2e Add ability to reverify a host from the Host List. by showard · 15 years ago
- 83d41dd Update debug_scheduler logging config to use INFO instead of debug. by showard · 15 years ago
- a9435c0 Fix recurring run code to reflect recent changes to rpc_utils.create_new_job(). by showard · 15 years ago
- ebc0fb7 Add an extra check for existence of Autoserv results in GatherLogsTask -- in certain recovery cases this can be false, previously leading to an exception. by showard · 15 years ago
- 12f3e32 Add job maximum runtime, a new per-job timeout that counts time since the job actually started. by showard · 15 years ago
- 2d7c8bd Fix scheduler unittest for parser's new -P flag by mbligh · 15 years ago
- 9e93640 Add post-parse site hooks (parse -P to trigger, default = off) by mbligh · 15 years ago
- a1e74b3 Add job option for whether or not to parse failed repair results as part of a job, with a default value in global_config. Since the number of options associated with a job is getting out of hand, I packaged them up into a dict in the RPC entry point and passed them around that way from then on. by showard · 15 years ago