| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 1 | GRPC Connection Backoff Protocol |
| 2 | ================================ |
| 3 | |
| 4 | When we do a connection to a backend which fails, it is typically desirable to |
| 5 | not retry immediately (to avoid flooding the network or the server with |
| 6 | requests) and instead do some form of exponential backoff. |
| 7 | |
| 8 | We have several parameters: |
| 9 | 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying) |
| 10 | 2. MULTIPLIER (factor with which to multiply backoff after a failed retry) |
| David Klempner | e00b0c3 | 2015-07-13 18:02:06 -0700 | [diff] [blame] | 11 | 3. MAX_BACKOFF (upper bound on backoff) |
| 12 | 4. MIN_CONNECT_TIMEOUT (minimum time we're willing to give a connection to |
| 13 | complete) |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 14 | |
| 15 | ## Proposed Backoff Algorithm |
| 16 | |
| 17 | Exponentially back off the start time of connection attempts up to a limit of |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 18 | MAX_BACKOFF, with jitter. |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 19 | |
| 20 | ``` |
| 21 | ConnectWithBackoff() |
| 22 | current_backoff = INITIAL_BACKOFF |
| 23 | current_deadline = now() + INITIAL_BACKOFF |
| David Klempner | e00b0c3 | 2015-07-13 18:02:06 -0700 | [diff] [blame] | 24 | while (TryConnect(Max(current_deadline, now() + MIN_CONNECT_TIMEOUT)) |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 25 | != SUCCESS) |
| 26 | SleepUntil(current_deadline) |
| 27 | current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 28 | current_deadline = now() + current_backoff + |
| David Klempner | 08d16ee | 2015-06-15 15:09:38 -0700 | [diff] [blame] | 29 | UniformRandom(-JITTER * current_backoff, JITTER * current_backoff) |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 30 | |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 31 | ``` |
| 32 | |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 33 | With specific parameters of |
| David Klempner | ca5add6 | 2015-06-17 18:20:31 -0700 | [diff] [blame] | 34 | MIN_CONNECT_TIMEOUT = 20 seconds |
| 35 | INITIAL_BACKOFF = 1 second |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 36 | MULTIPLIER = 1.6 |
| 37 | MAX_BACKOFF = 120 seconds |
| 38 | JITTER = 0.2 |
| 39 | |
| 40 | Implementations with pressing concerns (such as minimizing the number of wakeups |
| 41 | on a mobile phone) may wish to use a different algorithm, and in particular |
| 42 | different jitter logic. |
| 43 | |
| 44 | Alternate implementations must ensure that connection backoffs started at the |
| 45 | same time disperse, and must not attempt connections substantially more often |
| 46 | than the above algorithm. |