blob: 47b71f927bfcd1e4861cbb342d0587f2ab166a7e [file] [log] [blame] [view]
Eric Anderson611e7e12015-05-11 09:38:18 -07001GRPC Connection Backoff Protocol
2================================
3
4When we do a connection to a backend which fails, it is typically desirable to
5not retry immediately (to avoid flooding the network or the server with
6requests) and instead do some form of exponential backoff.
7
8We have several parameters:
9 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying)
10 2. MULTIPLIER (factor with which to multiply backoff after a failed retry)
11 3. MAX_BACKOFF (Upper bound on backoff)
12 4. MIN_CONNECTION_TIMEOUT
13
14## Proposed Backoff Algorithm
15
16Exponentially back off the start time of connection attempts up to a limit of
17MAX_BACKOFF.
18
19```
20ConnectWithBackoff()
21 current_backoff = INITIAL_BACKOFF
22 current_deadline = now() + INITIAL_BACKOFF
23 while (TryConnect(Max(current_deadline, MIN_CONNECT_TIMEOUT))
24 != SUCCESS)
25 SleepUntil(current_deadline)
26 current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF)
27 current_deadline = now() + current_backoff
28```
29
30## Historical Algorithm in Stubby
31
32Exponentially increase up to a limit of MAX_BACKOFF the intervals between
33connection attempts. This is what stubby 2 uses, and is equivalent if
34TryConnect() fails instantly.
35
36```
37LegacyConnectWithBackoff()
38 current_backoff = INITIAL_BACKOFF
39 while (TryConnect(MIN_CONNECT_TIMEOUT) != SUCCESS)
40 SleepFor(current_backoff)
41 current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF)
42```
43
44The grpc C implementation currently uses this approach with an initial backoff
45of 1 second, multiplier of 2, and maximum backoff of 120 seconds. (This will
46change)
47
48Stubby, or at least rpc2, uses exactly this algorithm with an initial backoff
49of 1 second, multiplier of 1.2, and a maximum backoff of 120 seconds.
50
51## Use Cases to Consider
52
53* Client tries to connect to a server which is down for multiple hours, eg for
54 maintenance
55* Client tries to connect to a server which is overloaded
56* User is bringing up both a client and a server at the same time
57 * In particular, we would like to avoid a large unnecessary delay if the
58 client connects to a server which is about to come up
59* Client/server are misconfigured such that connection attempts always fail
60 * We want to make sure these dont put too much load on the server by
61 default.
62* Server is overloaded and wants to transiently make clients back off
63* Application has out of band reason to believe a server is back
64 * We should consider an out of band mechanism for the client to hint that
65 we should short circuit the backoff.