Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 1 | TCP protocol |
| 2 | ============ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 3 | |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 4 | Last updated: 3 June 2017 |
Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 5 | |
| 6 | Contents |
| 7 | ======== |
| 8 | |
| 9 | - Congestion control |
| 10 | - How the new TCP output machine [nyi] works |
| 11 | |
| 12 | Congestion control |
| 13 | ================== |
| 14 | |
| 15 | The following variables are used in the tcp_sock for congestion control: |
| 16 | snd_cwnd The size of the congestion window |
| 17 | snd_ssthresh Slow start threshold. We are in slow start if |
| 18 | snd_cwnd is less than this. |
| 19 | snd_cwnd_cnt A counter used to slow down the rate of increase |
| 20 | once we exceed slow start threshold. |
| 21 | snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to. |
| 22 | snd_cwnd_stamp Timestamp for when congestion window last validated. |
| 23 | snd_cwnd_used Used as a highwater mark for how much of the |
| 24 | congestion window is in use. It is used to adjust |
| 25 | snd_cwnd down when the link is limited by the |
| 26 | application rather than the network. |
| 27 | |
| 28 | As of 2.6.13, Linux supports pluggable congestion control algorithms. |
| 29 | A congestion control mechanism can be registered through functions in |
| 30 | tcp_cong.c. The functions used by the congestion control mechanism are |
| 31 | registered via passing a tcp_congestion_ops struct to |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 32 | tcp_register_congestion_control. As a minimum, the congestion control |
| 33 | mechanism must provide a valid name and must implement either ssthresh, |
| 34 | cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook. |
Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 35 | |
| 36 | Private data for a congestion control mechanism is stored in tp->ca_priv. |
| 37 | tcp_ca(tp) returns a pointer to this space. This is preallocated space - it |
| 38 | is important to check the size of your private data will fit this space, or |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 39 | alternatively, space could be allocated elsewhere and a pointer to it could |
Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 40 | be stored here. |
| 41 | |
| 42 | There are three kinds of congestion control algorithms currently: The |
| 43 | simplest ones are derived from TCP reno (highspeed, scalable) and just |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 44 | provide an alternative congestion window calculation. More complex |
Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 45 | ones like BIC try to look at other events to provide better |
| 46 | heuristics. There are also round trip time based algorithms like |
| 47 | Vegas and Westwood+. |
| 48 | |
| 49 | Good TCP congestion control is a complex problem because the algorithm |
| 50 | needs to maintain fairness and performance. Please review current |
| 51 | research and RFC's before developing new modules. |
| 52 | |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 53 | The default congestion control mechanism is chosen based on the |
| 54 | DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default |
| 55 | value then you can set it using sysctl net.ipv4.tcp_congestion_control. The |
| 56 | module will be autoloaded if needed and you will get the expected protocol. If |
| 57 | you ask for an unknown congestion method, then the sysctl attempt will fail. |
Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 58 | |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 59 | If you remove a TCP congestion control module, then you will get the next |
Matt LaPlante | 84eb8d0 | 2006-10-03 22:53:09 +0200 | [diff] [blame] | 60 | available one. Since reno cannot be built as a module, and cannot be |
Anmol Sarma | 1e0ce2a | 2017-06-03 17:40:54 +0530 | [diff] [blame] | 61 | removed, it will always be available. |
Stephen Hemminger | 9d7bcfc | 2005-06-23 12:22:36 -0700 | [diff] [blame] | 62 | |
| 63 | How the new TCP output machine [nyi] works. |
| 64 | =========================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 65 | |
| 66 | Data is kept on a single queue. The skb->users flag tells us if the frame is |
| 67 | one that has been queued already. To add a frame we throw it on the end. Ack |
| 68 | walks down the list from the start. |
| 69 | |
| 70 | We keep a set of control flags |
| 71 | |
| 72 | |
| 73 | sk->tcp_pend_event |
| 74 | |
| 75 | TCP_PEND_ACK Ack needed |
| 76 | TCP_ACK_NOW Needed now |
| 77 | TCP_WINDOW Window update check |
| 78 | TCP_WINZERO Zero probing |
| 79 | |
| 80 | |
| 81 | sk->transmit_queue The transmission frame begin |
| 82 | sk->transmit_new First new frame pointer |
| 83 | sk->transmit_end Where to add frames |
| 84 | |
| 85 | sk->tcp_last_tx_ack Last ack seen |
| 86 | sk->tcp_dup_ack Dup ack count for fast retransmit |
| 87 | |
| 88 | |
| 89 | Frames are queued for output by tcp_write. We do our best to send the frames |
| 90 | off immediately if possible, but otherwise queue and compute the body |
| 91 | checksum in the copy. |
| 92 | |
| 93 | When a write is done we try to clear any pending events and piggy back them. |
| 94 | If the window is full we queue full sized frames. On the first timeout in |
| 95 | zero window we split this. |
| 96 | |
| 97 | On a timer we walk the retransmit list to send any retransmits, update the |
| 98 | backoff timers etc. A change of route table stamp causes a change of header |
| 99 | and recompute. We add any new tcp level headers and refinish the checksum |
| 100 | before sending. |
| 101 | |