Commit | Line | Data |
---|---|---|
e8ae7b00 EC |
1 | Checksum Offloads in the Linux Networking Stack |
2 | ||
3 | ||
4 | Introduction | |
5 | ============ | |
6 | ||
7 | This document describes a set of techniques in the Linux networking stack | |
8 | to take advantage of checksum offload capabilities of various NICs. | |
9 | ||
10 | The following technologies are described: | |
11 | * TX Checksum Offload | |
12 | * LCO: Local Checksum Offload | |
13 | * RCO: Remote Checksum Offload | |
14 | ||
15 | Things that should be documented here but aren't yet: | |
16 | * RX Checksum Offload | |
17 | * CHECKSUM_UNNECESSARY conversion | |
18 | ||
19 | ||
20 | TX Checksum Offload | |
21 | =================== | |
22 | ||
23 | The interface for offloading a transmit checksum to a device is explained | |
24 | in detail in comments near the top of include/linux/skbuff.h. | |
25 | In brief, it allows to request the device fill in a single ones-complement | |
26 | checksum defined by the sk_buff fields skb->csum_start and | |
27 | skb->csum_offset. The device should compute the 16-bit ones-complement | |
28 | checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the | |
29 | packet, and fill in the result at (csum_start + csum_offset). | |
30 | Because csum_offset cannot be negative, this ensures that the previous | |
31 | value of the checksum field is included in the checksum computation, thus | |
32 | it can be used to supply any needed corrections to the checksum (such as | |
33 | the sum of the pseudo-header for UDP or TCP). | |
34 | This interface only allows a single checksum to be offloaded. Where | |
35 | encapsulation is used, the packet may have multiple checksum fields in | |
36 | different header layers, and the rest will have to be handled by another | |
37 | mechanism such as LCO or RCO. | |
38 | No offloading of the IP header checksum is performed; it is always done in | |
39 | software. This is OK because when we build the IP header, we obviously | |
40 | have it in cache, so summing it isn't expensive. It's also rather short. | |
41 | The requirements for GSO are more complicated, because when segmenting an | |
42 | encapsulated packet both the inner and outer checksums may need to be | |
43 | edited or recomputed for each resulting segment. See the skbuff.h comment | |
44 | (section 'E') for more details. | |
45 | ||
46 | A driver declares its offload capabilities in netdev->hw_features; see | |
47 | Documentation/networking/netdev-features for more. Note that a device | |
48 | which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start | |
49 | and csum_offset given in the SKB; if it tries to deduce these itself in | |
50 | hardware (as some NICs do) the driver should check that the values in the | |
51 | SKB match those which the hardware will deduce, and if not, fall back to | |
52 | checksumming in software instead (with skb_checksum_help or one of the | |
53 | skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h). This | |
54 | is a pain, but that's what you get when hardware tries to be clever. | |
55 | ||
56 | The stack should, for the most part, assume that checksum offload is | |
57 | supported by the underlying device. The only place that should check is | |
58 | validate_xmit_skb(), and the functions it calls directly or indirectly. | |
59 | That function compares the offload features requested by the SKB (which | |
60 | may include other offloads besides TX Checksum Offload) and, if they are | |
61 | not supported or enabled on the device (determined by netdev->features), | |
62 | performs the corresponding offload in software. In the case of TX | |
63 | Checksum Offload, that means calling skb_checksum_help(skb). | |
64 | ||
65 | ||
66 | LCO: Local Checksum Offload | |
67 | =========================== | |
68 | ||
69 | LCO is a technique for efficiently computing the outer checksum of an | |
70 | encapsulated datagram when the inner checksum is due to be offloaded. | |
71 | The ones-complement sum of a correctly checksummed TCP or UDP packet is | |
c81aa797 SL |
72 | equal to the complement of the sum of the pseudo header, because everything |
73 | else gets 'cancelled out' by the checksum field. This is because the sum was | |
e8ae7b00 EC |
74 | complemented before being written to the checksum field. |
75 | More generally, this holds in any case where the 'IP-style' ones complement | |
76 | checksum is used, and thus any checksum that TX Checksum Offload supports. | |
77 | That is, if we have set up TX Checksum Offload with a start/offset pair, we | |
c81aa797 | 78 | know that after the device has filled in that checksum, the ones |
e8ae7b00 | 79 | complement sum from csum_start to the end of the packet will be equal to |
c81aa797 SL |
80 | the complement of whatever value we put in the checksum field beforehand. |
81 | This allows us to compute the outer checksum without looking at the payload: | |
82 | we simply stop summing when we get to csum_start, then add the complement of | |
83 | the 16-bit word at (csum_start + csum_offset). | |
e8ae7b00 EC |
84 | Then, when the true inner checksum is filled in (either by hardware or by |
85 | skb_checksum_help()), the outer checksum will become correct by virtue of | |
86 | the arithmetic. | |
87 | ||
88 | LCO is performed by the stack when constructing an outer UDP header for an | |
89 | encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for | |
90 | the IPv6 equivalents, in udp6_set_csum(). | |
91 | It is also performed when constructing an IPv4 GRE header, in | |
92 | net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when | |
93 | constructing an IPv6 GRE header; the GRE checksum is computed over the | |
94 | whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be | |
95 | possible to use LCO here as IPv6 GRE still uses an IP-style checksum. | |
96 | All of the LCO implementations use a helper function lco_csum(), in | |
97 | include/linux/skbuff.h. | |
98 | ||
99 | LCO can safely be used for nested encapsulations; in this case, the outer | |
100 | encapsulation layer will sum over both its own header and the 'middle' | |
101 | header. This does mean that the 'middle' header will get summed multiple | |
102 | times, but there doesn't seem to be a way to avoid that without incurring | |
103 | bigger costs (e.g. in SKB bloat). | |
104 | ||
105 | ||
106 | RCO: Remote Checksum Offload | |
107 | ============================ | |
108 | ||
109 | RCO is a technique for eliding the inner checksum of an encapsulated | |
110 | datagram, allowing the outer checksum to be offloaded. It does, however, | |
111 | involve a change to the encapsulation protocols, which the receiver must | |
112 | also support. For this reason, it is disabled by default. | |
113 | RCO is detailed in the following Internet-Drafts: | |
114 | https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00 | |
115 | https://tools.ietf.org/html/draft-herbert-vxlan-rco-00 | |
116 | In Linux, RCO is implemented individually in each encapsulation protocol, | |
117 | and most tunnel types have flags controlling its use. For instance, VXLAN | |
118 | has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that | |
119 | RCO should be used when transmitting to a given remote destination. |