| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 1 | GRPC Connection Backoff Protocol |
| 2 | ================================ |
| 3 | |
| 4 | When we do a connection to a backend which fails, it is typically desirable to |
| 5 | not retry immediately (to avoid flooding the network or the server with |
| 6 | requests) and instead do some form of exponential backoff. |
| 7 | |
| 8 | We have several parameters: |
| 9 | 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying) |
| 10 | 2. MULTIPLIER (factor with which to multiply backoff after a failed retry) |
| 11 | 3. MAX_BACKOFF (Upper bound on backoff) |
| 12 | 4. MIN_CONNECTION_TIMEOUT |
| 13 | |
| 14 | ## Proposed Backoff Algorithm |
| 15 | |
| 16 | Exponentially back off the start time of connection attempts up to a limit of |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame^] | 17 | MAX_BACKOFF, with jitter. |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 18 | |
| 19 | ``` |
| 20 | ConnectWithBackoff() |
| 21 | current_backoff = INITIAL_BACKOFF |
| 22 | current_deadline = now() + INITIAL_BACKOFF |
| 23 | while (TryConnect(Max(current_deadline, MIN_CONNECT_TIMEOUT)) |
| 24 | != SUCCESS) |
| 25 | SleepUntil(current_deadline) |
| 26 | current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame^] | 27 | current_deadline = now() + current_backoff + |
| 28 | UniformRandom(-JITTER * backoff, JITTER * backoff) |
| 29 | |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 30 | ``` |
| 31 | |
| David Klempner | 0e5d2ef | 2015-06-15 14:48:31 -0700 | [diff] [blame^] | 32 | With specific parameters of |
| 33 | INITIAL_BACKOFF = 20 seconds |
| 34 | MULTIPLIER = 1.6 |
| 35 | MAX_BACKOFF = 120 seconds |
| 36 | JITTER = 0.2 |
| 37 | |
| 38 | Implementations with pressing concerns (such as minimizing the number of wakeups |
| 39 | on a mobile phone) may wish to use a different algorithm, and in particular |
| 40 | different jitter logic. |
| 41 | |
| 42 | Alternate implementations must ensure that connection backoffs started at the |
| 43 | same time disperse, and must not attempt connections substantially more often |
| 44 | than the above algorithm. |
| 45 | |
| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame] | 46 | ## Historical Algorithm in Stubby |
| 47 | |
| 48 | Exponentially increase up to a limit of MAX_BACKOFF the intervals between |
| 49 | connection attempts. This is what stubby 2 uses, and is equivalent if |
| 50 | TryConnect() fails instantly. |
| 51 | |
| 52 | ``` |
| 53 | LegacyConnectWithBackoff() |
| 54 | current_backoff = INITIAL_BACKOFF |
| 55 | while (TryConnect(MIN_CONNECT_TIMEOUT) != SUCCESS) |
| 56 | SleepFor(current_backoff) |
| 57 | current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
| 58 | ``` |
| 59 | |
| 60 | The grpc C implementation currently uses this approach with an initial backoff |
| 61 | of 1 second, multiplier of 2, and maximum backoff of 120 seconds. (This will |
| 62 | change) |
| 63 | |
| 64 | Stubby, or at least rpc2, uses exactly this algorithm with an initial backoff |
| 65 | of 1 second, multiplier of 1.2, and a maximum backoff of 120 seconds. |
| 66 | |
| 67 | ## Use Cases to Consider |
| 68 | |
| 69 | * Client tries to connect to a server which is down for multiple hours, eg for |
| 70 | maintenance |
| 71 | * Client tries to connect to a server which is overloaded |
| 72 | * User is bringing up both a client and a server at the same time |
| 73 | * In particular, we would like to avoid a large unnecessary delay if the |
| 74 | client connects to a server which is about to come up |
| 75 | * Client/server are misconfigured such that connection attempts always fail |
| 76 | * We want to make sure these don’t put too much load on the server by |
| 77 | default. |
| 78 | * Server is overloaded and wants to transiently make clients back off |
| 79 | * Application has out of band reason to believe a server is back |
| 80 | * We should consider an out of band mechanism for the client to hint that |
| 81 | we should short circuit the backoff. |