shill: Service: Track DHCP failures and inform ShouldUseMinimalDHCPConfig()

The default DHCP configuration used by ChromeOS requests a
large number of options in order to support features like
Web Proxy Auto-Discovery.  Unfortunately this practice
sometimes runs afoul of issues in some network topologies.
In some networks, a large number of option values in the
DHCP response can cause MTU problems with proxies or other
network elements.  This may prevent the reply from being
forwarded back to the client.

Although the problem may not lie with the outgoing request,
it may be possible to mitigate this issue by modifying these
requests.  This CL provides a means for the Service to track
such failures and infer from a series of experiments whether
it is likely to be due to an MTU issue.  To do this, the
Device informs the selected Service of each DHCP success and
failure, and queries the Service before starting each DHCP
session to inquire whether it should request an extensive or
minimal DHCP options from the server.

In order to detect and respond to such issues, this CL
maintains state in the Service about how DHCP has been
performing.  If there have been a spate of recent DHCP
failures, we should suspect that this may be due to the
number of options we are requesting from the DHCP server.

We should confirm that this is in fact the issue by testing
whether a request for a smaller response succeeds.  If this
request succeeds we can consider the hunch confirmed and the
client should switch to using small requests for some period
of time.  If our test request fails, we can assume for now
that the problem isn't realated to a DHCP response size and
return to the default behavior.

If we have confirmed our hunch earlier and the time period
expires, we should try again to use a more comprehensive
DHCP request.  If this succeeds, we can assume either the
network infrastructure has repaired itself or that the
previous hunch was in error, and return to the "not
detected" state.  If it fails, we should confirm whether
this is the identical problem to before by re-testing a
small DHCP request size.  Since the "confirmed" state did
not pay attention to DHCP failures, it's possible that they
have been failing across the board as of late.  If indeed
both large and small DHCP responses fail to reach us, we
should put off our re-test until we start receiving DHCP
replies again.

The state machine implemented for DHCP failures is
illustrated in state machine diagram below (all events not
shown do not cause a state change):

   [ Not Detected (send full request) ] <------------
         |                  ^                       |
         |                  |                       |
      n * failure        failure                    |
         |                  |                       |
         V                  |                       |
   [ Suspected (send minimal request) ]             |
                       |                            |
                    success                         |
                       |                            |
                       V                            |
   [ Confirmed (send minimal request) ]             |
         ^             |                            |
         |      hold timer elapsed                  |
         |             |                            |
      success          V                            |
         |     [ Retest Full Request ] ----success--/
         |             |          ^
         |          failure       |
         |             |          |
         |             V          |
   [ Retest Minimal Request ]     |
                       |       success
                    failure       |
                       |          |
                       V          |
                   [ Retest With No Reply (send minimal requests) ]

The Service only persists two states: "Not Detected" and
"Confirmed".  This is done via the presence or absence of
the stored "LastDHCPOptionFailure" property, which is the
time the system last entered the "Confirmed" state.  A
third state, "Reset Full Request" is implicitly persisted
if this timestamp is old enough that the hold timer has
expired.

BUG=chromium:297607
TEST=Unit tests

Change-Id: I1ee83debf4d11f25678fe3586574ec04f254a83f
Reviewed-on: https://chromium-review.googlesource.com/174634
Reviewed-by: Paul Stewart <pstew@chromium.org>
Commit-Queue: Paul Stewart <pstew@chromium.org>
Tested-by: Paul Stewart <pstew@chromium.org>
diff --git a/metrics.h b/metrics.h
index fc214de..1c9f776 100644
--- a/metrics.h
+++ b/metrics.h
@@ -261,6 +261,11 @@
     kVpnUserAuthenticationTypeMax
   };
 
+  enum DHCPOptionFailure {
+    kDHCPOptionFailure = 1,
+    kDHCPOptionFailureMax
+  };
+
   static const char kMetricDisconnect[];
   static const int kMetricDisconnectMax;
   static const int kMetricDisconnectMin;
@@ -418,6 +423,11 @@
   static const char kMetricVpnUserAuthenticationType[];
   static const int kMetricVpnUserAuthenticationTypeMax;
 
+  // We have detected that a DHCP server can only deliver leases if
+  // we reduce the number of options that we request of it.  This
+  // implies an infrastructure issue.
+  static const char kMetricDHCPOptionFailureDetected[];
+
   explicit Metrics(EventDispatcher *dispatcher);
   virtual ~Metrics();
 
@@ -578,6 +588,9 @@
   // Notifies this object about a corrupted profile.
   virtual void NotifyCorruptedProfile();
 
+  // Notifies this object about a service with DHCP infrastructure problems.
+  virtual void NotifyDHCPOptionFailure(const Service &service);
+
   // Sends linear histogram data to UMA.
   virtual bool SendEnumToUMA(const std::string &name, int sample, int max);