Commit | Line | Data |
---|---|---|
9d7bcfc6 SH |
1 | TCP protocol |
2 | ============ | |
3 | ||
32b90fc8 | 4 | Last updated: 9 February 2008 |
9d7bcfc6 SH |
5 | |
6 | Contents | |
7 | ======== | |
8 | ||
9 | - Congestion control | |
10 | - How the new TCP output machine [nyi] works | |
11 | ||
12 | Congestion control | |
13 | ================== | |
14 | ||
15 | The following variables are used in the tcp_sock for congestion control: | |
16 | snd_cwnd The size of the congestion window | |
17 | snd_ssthresh Slow start threshold. We are in slow start if | |
18 | snd_cwnd is less than this. | |
19 | snd_cwnd_cnt A counter used to slow down the rate of increase | |
20 | once we exceed slow start threshold. | |
21 | snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to. | |
22 | snd_cwnd_stamp Timestamp for when congestion window last validated. | |
23 | snd_cwnd_used Used as a highwater mark for how much of the | |
24 | congestion window is in use. It is used to adjust | |
25 | snd_cwnd down when the link is limited by the | |
26 | application rather than the network. | |
27 | ||
28 | As of 2.6.13, Linux supports pluggable congestion control algorithms. | |
29 | A congestion control mechanism can be registered through functions in | |
30 | tcp_cong.c. The functions used by the congestion control mechanism are | |
31 | registered via passing a tcp_congestion_ops struct to | |
32 | tcp_register_congestion_control. As a minimum name, ssthresh, | |
33 | cong_avoid, min_cwnd must be valid. | |
1da177e4 | 34 | |
9d7bcfc6 SH |
35 | Private data for a congestion control mechanism is stored in tp->ca_priv. |
36 | tcp_ca(tp) returns a pointer to this space. This is preallocated space - it | |
37 | is important to check the size of your private data will fit this space, or | |
38 | alternatively space could be allocated elsewhere and a pointer to it could | |
39 | be stored here. | |
40 | ||
41 | There are three kinds of congestion control algorithms currently: The | |
42 | simplest ones are derived from TCP reno (highspeed, scalable) and just | |
43 | provide an alternative the congestion window calculation. More complex | |
44 | ones like BIC try to look at other events to provide better | |
45 | heuristics. There are also round trip time based algorithms like | |
46 | Vegas and Westwood+. | |
47 | ||
48 | Good TCP congestion control is a complex problem because the algorithm | |
49 | needs to maintain fairness and performance. Please review current | |
50 | research and RFC's before developing new modules. | |
51 | ||
52 | The method that is used to determine which congestion control mechanism is | |
53 | determined by the setting of the sysctl net.ipv4.tcp_congestion_control. | |
54 | The default congestion control will be the last one registered (LIFO); | |
32b90fc8 ML |
55 | so if you built everything as modules, the default will be reno. If you |
56 | build with the defaults from Kconfig, then CUBIC will be builtin (not a | |
57 | module) and it will end up the default. | |
9d7bcfc6 SH |
58 | |
59 | If you really want a particular default value then you will need | |
60 | to set it with the sysctl. If you use a sysctl, the module will be autoloaded | |
61 | if needed and you will get the expected protocol. If you ask for an | |
62 | unknown congestion method, then the sysctl attempt will fail. | |
63 | ||
64 | If you remove a tcp congestion control module, then you will get the next | |
84eb8d06 | 65 | available one. Since reno cannot be built as a module, and cannot be |
9d7bcfc6 SH |
66 | deleted, it will always be available. |
67 | ||
68 | How the new TCP output machine [nyi] works. | |
69 | =========================================== | |
1da177e4 LT |
70 | |
71 | Data is kept on a single queue. The skb->users flag tells us if the frame is | |
72 | one that has been queued already. To add a frame we throw it on the end. Ack | |
73 | walks down the list from the start. | |
74 | ||
75 | We keep a set of control flags | |
76 | ||
77 | ||
78 | sk->tcp_pend_event | |
79 | ||
80 | TCP_PEND_ACK Ack needed | |
81 | TCP_ACK_NOW Needed now | |
82 | TCP_WINDOW Window update check | |
83 | TCP_WINZERO Zero probing | |
84 | ||
85 | ||
86 | sk->transmit_queue The transmission frame begin | |
87 | sk->transmit_new First new frame pointer | |
88 | sk->transmit_end Where to add frames | |
89 | ||
90 | sk->tcp_last_tx_ack Last ack seen | |
91 | sk->tcp_dup_ack Dup ack count for fast retransmit | |
92 | ||
93 | ||
94 | Frames are queued for output by tcp_write. We do our best to send the frames | |
95 | off immediately if possible, but otherwise queue and compute the body | |
96 | checksum in the copy. | |
97 | ||
98 | When a write is done we try to clear any pending events and piggy back them. | |
99 | If the window is full we queue full sized frames. On the first timeout in | |
100 | zero window we split this. | |
101 | ||
102 | On a timer we walk the retransmit list to send any retransmits, update the | |
103 | backoff timers etc. A change of route table stamp causes a change of header | |
104 | and recompute. We add any new tcp level headers and refinish the checksum | |
105 | before sending. | |
106 |