blob: 7094e737c510942d49a02984d1c7394f543cae7f [file] [log] [blame] [view]
Eric Anderson611e7e12015-05-11 09:38:18 -07001GRPC Connection Backoff Protocol
2================================
3
4When we do a connection to a backend which fails, it is typically desirable to
5not retry immediately (to avoid flooding the network or the server with
6requests) and instead do some form of exponential backoff.
7
8We have several parameters:
9 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying)
10 2. MULTIPLIER (factor with which to multiply backoff after a failed retry)
David Klempnere00b0c32015-07-13 18:02:06 -070011 3. MAX_BACKOFF (upper bound on backoff)
12 4. MIN_CONNECT_TIMEOUT (minimum time we're willing to give a connection to
13 complete)
Eric Anderson611e7e12015-05-11 09:38:18 -070014
15## Proposed Backoff Algorithm
16
17Exponentially back off the start time of connection attempts up to a limit of
David Klempner0e5d2ef2015-06-15 14:48:31 -070018MAX_BACKOFF, with jitter.
Eric Anderson611e7e12015-05-11 09:38:18 -070019
20```
21ConnectWithBackoff()
22 current_backoff = INITIAL_BACKOFF
23 current_deadline = now() + INITIAL_BACKOFF
David Klempnere00b0c32015-07-13 18:02:06 -070024 while (TryConnect(Max(current_deadline, now() + MIN_CONNECT_TIMEOUT))
Eric Anderson611e7e12015-05-11 09:38:18 -070025 != SUCCESS)
26 SleepUntil(current_deadline)
27 current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF)
David Klempner0e5d2ef2015-06-15 14:48:31 -070028 current_deadline = now() + current_backoff +
David Klempner08d16ee2015-06-15 15:09:38 -070029 UniformRandom(-JITTER * current_backoff, JITTER * current_backoff)
David Klempner0e5d2ef2015-06-15 14:48:31 -070030
Eric Anderson611e7e12015-05-11 09:38:18 -070031```
32
David Klempner0e5d2ef2015-06-15 14:48:31 -070033With specific parameters of
David Klempnerca5add62015-06-17 18:20:31 -070034MIN_CONNECT_TIMEOUT = 20 seconds
35INITIAL_BACKOFF = 1 second
David Klempner0e5d2ef2015-06-15 14:48:31 -070036MULTIPLIER = 1.6
37MAX_BACKOFF = 120 seconds
38JITTER = 0.2
39
40Implementations with pressing concerns (such as minimizing the number of wakeups
41on a mobile phone) may wish to use a different algorithm, and in particular
42different jitter logic.
43
44Alternate implementations must ensure that connection backoffs started at the
45same time disperse, and must not attempt connections substantially more often
46than the above algorithm.