RPM Servers: Fix stuck threads issue.

This cl fixes the issue of the rpm servers running out file descriptors
due to stuck sockets. Debugged it as the following problem:

* RPM dispatcher kicks off a thread for each rpm_controller.
* Thread for rpm_controller calls set_power, something bad happens
  exception thrown. Thread is stuck.
* RPM dispatcher gets new request for that RPM and sends it to that
  stuck thread.
* Client times out, job logs error and continues.
* The drone -> frontend connection becomes close_wait and is stuck.
  Client closed it but the frontend thread is still waiting to hear
  from the dispatcher. This uses up a file descriptor.
* Frontend -> dispatcher connection is stuck as established.
  Another file descriptor used up.
* Over time (roughly a few weeks to a month) we run out of file
  descriptors and can no longer open new sockets causing all calls to
  the infrastructure to fail until it is manually restarted.

A side effect of this is now all calls to this RPM via our infrastructure
will fail, sadly this occurs silently and can only be seen in the logs
of the autoserv jobs who timed out when calling the rpm infrastructure.

In order to address this, I added a catch of all exceptions that occur
when trying to change power state. The exception will be caught and
emailed out to the team.

Also updated the error emails to go to chromeos-lab-errors@google.com

BUG=chromium:243567
TEST=Put in set_power_state an explict raise exception which recreated
the conditions we see on the live server. Then applied my fix and
verified we don't get stuck/use up file descriptors as before.

Change-Id: I69bf68564fcfbda6c387faa74202c7c4b9bbcdef
Reviewed-on: https://gerrit.chromium.org/gerrit/66608
Reviewed-by: Scott Zawalski <scottz@chromium.org>
Commit-Queue: Simran Basi <sbasi@chromium.org>
Reviewed-by: Simran Basi <sbasi@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>
2 files changed