| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 1 | GRPC Connection Backoff Protocol |
| 2 | ================================ |
| 3 | |
| 4 | When we do a connection to a backend which fails, it is typically desirable to |
| 5 | not retry immediately (to avoid flooding the network or the server with |
| 6 | requests) and instead do some form of exponential backoff. |
| 7 | |
| 8 | We have several parameters: |
| 9 | 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying) |
| 10 | 2. MULTIPLIER (factor with which to multiply backoff after a failed retry) |
| 11 | 3. MAX_BACKOFF (Upper bound on backoff) |
| 12 | 4. MIN_CONNECTION_TIMEOUT |
| 13 | |
| 14 | ## Proposed Backoff Algorithm |
| 15 | |
| 16 | Exponentially back off the start time of connection attempts up to a limit of |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 17 | MAX_BACKOFF, with jitter. |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 18 | |
| 19 | ``` |
| 20 | ConnectWithBackoff() |
| 21 | current_backoff = INITIAL_BACKOFF |
| 22 | current_deadline = now() + INITIAL_BACKOFF |
| 23 | while (TryConnect(Max(current_deadline, MIN_CONNECT_TIMEOUT)) |
| 24 | != SUCCESS) |
| 25 | SleepUntil(current_deadline) |
| 26 | current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 27 | current_deadline = now() + current_backoff + |
| David Klempner | 08d16ee | 2015-06-15 15:09:38 -0700 | [diff] [blame] | 28 | UniformRandom(-JITTER * current_backoff, JITTER * current_backoff) |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 29 | |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 30 | ``` |
| 31 | |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 32 | With specific parameters of |
| David Klempner | ca5add6 | 2015-06-17 18:20:31 -0700 | [diff] [blame^] | 33 | MIN_CONNECT_TIMEOUT = 20 seconds |
| 34 | INITIAL_BACKOFF = 1 second |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame] | 35 | MULTIPLIER = 1.6 |
| 36 | MAX_BACKOFF = 120 seconds |
| 37 | JITTER = 0.2 |
| 38 | |
| 39 | Implementations with pressing concerns (such as minimizing the number of wakeups |
| 40 | on a mobile phone) may wish to use a different algorithm, and in particular |
| 41 | different jitter logic. |
| 42 | |
| 43 | Alternate implementations must ensure that connection backoffs started at the |
| 44 | same time disperse, and must not attempt connections substantially more often |
| 45 | than the above algorithm. |