| Eric Anderson | 611e7e1 | 2015-05-11 09:38:18 -0700 | [diff] [blame^] | 1 | GRPC Connection Backoff Protocol |
| 2 | ================================ |
| 3 | |
| 4 | When we do a connection to a backend which fails, it is typically desirable to |
| 5 | not retry immediately (to avoid flooding the network or the server with |
| 6 | requests) and instead do some form of exponential backoff. |
| 7 | |
| 8 | We have several parameters: |
| 9 | 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying) |
| 10 | 2. MULTIPLIER (factor with which to multiply backoff after a failed retry) |
| 11 | 3. MAX_BACKOFF (Upper bound on backoff) |
| 12 | 4. MIN_CONNECTION_TIMEOUT |
| 13 | |
| 14 | ## Proposed Backoff Algorithm |
| 15 | |
| 16 | Exponentially back off the start time of connection attempts up to a limit of |
| 17 | MAX_BACKOFF. |
| 18 | |
| 19 | ``` |
| 20 | ConnectWithBackoff() |
| 21 | current_backoff = INITIAL_BACKOFF |
| 22 | current_deadline = now() + INITIAL_BACKOFF |
| 23 | while (TryConnect(Max(current_deadline, MIN_CONNECT_TIMEOUT)) |
| 24 | != SUCCESS) |
| 25 | SleepUntil(current_deadline) |
| 26 | current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
| 27 | current_deadline = now() + current_backoff |
| 28 | ``` |
| 29 | |
| 30 | ## Historical Algorithm in Stubby |
| 31 | |
| 32 | Exponentially increase up to a limit of MAX_BACKOFF the intervals between |
| 33 | connection attempts. This is what stubby 2 uses, and is equivalent if |
| 34 | TryConnect() fails instantly. |
| 35 | |
| 36 | ``` |
| 37 | LegacyConnectWithBackoff() |
| 38 | current_backoff = INITIAL_BACKOFF |
| 39 | while (TryConnect(MIN_CONNECT_TIMEOUT) != SUCCESS) |
| 40 | SleepFor(current_backoff) |
| 41 | current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
| 42 | ``` |
| 43 | |
| 44 | The grpc C implementation currently uses this approach with an initial backoff |
| 45 | of 1 second, multiplier of 2, and maximum backoff of 120 seconds. (This will |
| 46 | change) |
| 47 | |
| 48 | Stubby, or at least rpc2, uses exactly this algorithm with an initial backoff |
| 49 | of 1 second, multiplier of 1.2, and a maximum backoff of 120 seconds. |
| 50 | |
| 51 | ## Use Cases to Consider |
| 52 | |
| 53 | * Client tries to connect to a server which is down for multiple hours, eg for |
| 54 | maintenance |
| 55 | * Client tries to connect to a server which is overloaded |
| 56 | * User is bringing up both a client and a server at the same time |
| 57 | * In particular, we would like to avoid a large unnecessary delay if the |
| 58 | client connects to a server which is about to come up |
| 59 | * Client/server are misconfigured such that connection attempts always fail |
| 60 | * We want to make sure these don’t put too much load on the server by |
| 61 | default. |
| 62 | * Server is overloaded and wants to transiently make clients back off |
| 63 | * Application has out of band reason to believe a server is back |
| 64 | * We should consider an out of band mechanism for the client to hint that |
| 65 | we should short circuit the backoff. |