Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | -------------------------------------------------------------------------------- |
2 | + ABSTRACT | |
3 | -------------------------------------------------------------------------------- | |
4 | ||
5 | This file documents the CONFIG_PACKET_MMAP option available with the PACKET | |
6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for | |
7 | capture network traffic with utilities like tcpdump or any other that uses | |
8 | the libpcap library. | |
9 | ||
10 | You can find the latest version of this document at | |
11 | ||
12 | http://pusa.uv.es/~ulisses/packet_mmap/ | |
13 | ||
14 | Please send me your comments to | |
15 | ||
be2a608b | 16 | Ulisses Alonso CamarĂ³ <uaca@i.hate.spam.alumni.uv.es> |
1da177e4 LT |
17 | |
18 | ------------------------------------------------------------------------------- | |
19 | + Why use PACKET_MMAP | |
20 | -------------------------------------------------------------------------------- | |
21 | ||
22 | In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very | |
23 | inefficient. It uses very limited buffers and requires one system call | |
24 | to capture each packet, it requires two if you want to get packet's | |
25 | timestamp (like libpcap always does). | |
26 | ||
27 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | |
28 | configurable circular buffer mapped in user space. This way reading packets just | |
29 | needs to wait for them, most of the time there is no need to issue a single | |
30 | system call. By using a shared buffer between the kernel and the user | |
31 | also has the benefit of minimizing packet copies. | |
32 | ||
33 | It's fine to use PACKET_MMAP to improve the performance of the capture process, | |
34 | but it isn't everything. At least, if you are capturing at high speeds (this | |
35 | is relative to the cpu speed), you should check if the device driver of your | |
36 | network interface card supports some sort of interrupt load mitigation or | |
37 | (even better) if it supports NAPI, also make sure it is enabled. | |
38 | ||
39 | -------------------------------------------------------------------------------- | |
40 | + How to use CONFIG_PACKET_MMAP | |
41 | -------------------------------------------------------------------------------- | |
42 | ||
c30fe7f7 | 43 | From the user standpoint, you should use the higher level libpcap library, which |
1da177e4 LT |
44 | is a de facto standard, portable across nearly all operating systems |
45 | including Win32. | |
46 | ||
47 | Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include | |
48 | support for PACKET_MMAP, and also probably the libpcap included in your distribution. | |
49 | ||
50 | I'm aware of two implementations of PACKET_MMAP in libpcap: | |
51 | ||
52 | http://pusa.uv.es/~ulisses/packet_mmap/ (by Simon Patarin, based on libpcap 0.6.2) | |
53 | http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap) | |
54 | ||
55 | The rest of this document is intended for people who want to understand | |
56 | the low level details or want to improve libpcap by including PACKET_MMAP | |
57 | support. | |
58 | ||
59 | -------------------------------------------------------------------------------- | |
60 | + How to use CONFIG_PACKET_MMAP directly | |
61 | -------------------------------------------------------------------------------- | |
62 | ||
63 | From the system calls stand point, the use of PACKET_MMAP involves | |
64 | the following process: | |
65 | ||
66 | ||
67 | [setup] socket() -------> creation of the capture socket | |
68 | setsockopt() ---> allocation of the circular buffer (ring) | |
6c28f2c0 | 69 | mmap() ---------> mapping of the allocated buffer to the |
1da177e4 LT |
70 | user process |
71 | ||
72 | [capture] poll() ---------> to wait for incoming packets | |
73 | ||
74 | [shutdown] close() --------> destruction of the capture socket and | |
75 | deallocation of all associated | |
76 | resources. | |
77 | ||
78 | ||
79 | socket creation and destruction is straight forward, and is done | |
80 | the same way with or without PACKET_MMAP: | |
81 | ||
82 | int fd; | |
83 | ||
84 | fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) | |
85 | ||
86 | where mode is SOCK_RAW for the raw interface were link level | |
87 | information can be captured or SOCK_DGRAM for the cooked | |
88 | interface where link level information capture is not | |
89 | supported and a link level pseudo-header is provided | |
90 | by the kernel. | |
91 | ||
92 | The destruction of the socket and all associated resources | |
93 | is done by a simple call to close(fd). | |
94 | ||
95 | Next I will describe PACKET_MMAP settings and it's constraints, | |
6c28f2c0 | 96 | also the mapping of the circular buffer in the user process and |
1da177e4 LT |
97 | the use of this buffer. |
98 | ||
99 | -------------------------------------------------------------------------------- | |
100 | + PACKET_MMAP settings | |
101 | -------------------------------------------------------------------------------- | |
102 | ||
103 | ||
104 | To setup PACKET_MMAP from user level code is done with a call like | |
105 | ||
106 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) | |
107 | ||
108 | The most significant argument in the previous call is the req parameter, | |
109 | this parameter must to have the following structure: | |
110 | ||
111 | struct tpacket_req | |
112 | { | |
113 | unsigned int tp_block_size; /* Minimal size of contiguous block */ | |
114 | unsigned int tp_block_nr; /* Number of blocks */ | |
115 | unsigned int tp_frame_size; /* Size of frame */ | |
116 | unsigned int tp_frame_nr; /* Total number of frames */ | |
117 | }; | |
118 | ||
119 | This structure is defined in /usr/include/linux/if_packet.h and establishes a | |
120 | circular buffer (ring) of unswappable memory mapped in the capture process. | |
121 | Being mapped in the capture process allows reading the captured frames and | |
122 | related meta-information like timestamps without requiring a system call. | |
123 | ||
124 | Captured frames are grouped in blocks. Each block is a physically contiguous | |
125 | region of memory and holds tp_block_size/tp_frame_size frames. The total number | |
126 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | |
127 | ||
128 | frames_per_block = tp_block_size/tp_frame_size | |
129 | ||
130 | indeed, packet_set_ring checks that the following condition is true | |
131 | ||
132 | frames_per_block * tp_block_nr == tp_frame_nr | |
133 | ||
134 | ||
135 | Lets see an example, with the following values: | |
136 | ||
137 | tp_block_size= 4096 | |
138 | tp_frame_size= 2048 | |
139 | tp_block_nr = 4 | |
140 | tp_frame_nr = 8 | |
141 | ||
142 | we will get the following buffer structure: | |
143 | ||
144 | block #1 block #2 | |
145 | +---------+---------+ +---------+---------+ | |
146 | | frame 1 | frame 2 | | frame 3 | frame 4 | | |
147 | +---------+---------+ +---------+---------+ | |
148 | ||
149 | block #3 block #4 | |
150 | +---------+---------+ +---------+---------+ | |
151 | | frame 5 | frame 6 | | frame 7 | frame 8 | | |
152 | +---------+---------+ +---------+---------+ | |
153 | ||
154 | A frame can be of any size with the only condition it can fit in a block. A block | |
155 | can only hold an integer number of frames, or in other words, a frame cannot | |
6c28f2c0 ML |
156 | be spawned accross two blocks, so there are some details you have to take into |
157 | account when choosing the frame_size. See "Mapping and use of the circular | |
1da177e4 LT |
158 | buffer (ring)". |
159 | ||
160 | ||
161 | -------------------------------------------------------------------------------- | |
162 | + PACKET_MMAP setting constraints | |
163 | -------------------------------------------------------------------------------- | |
164 | ||
165 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), | |
166 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or | |
167 | 16384 in a 64 bit architecture. For information on these kernel versions | |
168 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt | |
169 | ||
170 | Block size limit | |
171 | ------------------ | |
172 | ||
173 | As stated earlier, each block is a contiguous physical region of memory. These | |
174 | memory regions are allocated with calls to the __get_free_pages() function. As | |
175 | the name indicates, this function allocates pages of memory, and the second | |
176 | argument is "order" or a power of two number of pages, that is | |
177 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, | |
178 | order=2 ==> 16384 bytes, etc. The maximum size of a | |
179 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More | |
180 | precisely the limit can be calculated as: | |
181 | ||
182 | PAGE_SIZE << MAX_ORDER | |
183 | ||
184 | In a i386 architecture PAGE_SIZE is 4096 bytes | |
185 | In a 2.4/i386 kernel MAX_ORDER is 10 | |
186 | In a 2.6/i386 kernel MAX_ORDER is 11 | |
187 | ||
188 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel | |
189 | respectively, with an i386 architecture. | |
190 | ||
191 | User space programs can include /usr/include/sys/user.h and | |
192 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. | |
193 | ||
194 | The pagesize can also be determined dynamically with the getpagesize (2) | |
195 | system call. | |
196 | ||
197 | ||
198 | Block number limit | |
199 | -------------------- | |
200 | ||
201 | To understand the constraints of PACKET_MMAP, we have to see the structure | |
202 | used to hold the pointers to each block. | |
203 | ||
204 | Currently, this structure is a dynamically allocated vector with kmalloc | |
205 | called pg_vec, its size limits the number of blocks that can be allocated. | |
206 | ||
207 | +---+---+---+---+ | |
208 | | x | x | x | x | | |
209 | +---+---+---+---+ | |
210 | | | | | | |
211 | | | | v | |
212 | | | v block #4 | |
213 | | v block #3 | |
214 | v block #2 | |
215 | block #1 | |
216 | ||
217 | ||
2fe0ae78 ML |
218 | kmalloc allocates any number of bytes of physically contiguous memory from |
219 | a pool of pre-determined sizes. This pool of memory is maintained by the slab | |
c30fe7f7 UZ |
220 | allocator which is at the end the responsible for doing the allocation and |
221 | hence which imposes the maximum memory that kmalloc can allocate. | |
1da177e4 LT |
222 | |
223 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The | |
224 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" | |
225 | entries of /proc/slabinfo | |
226 | ||
227 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of | |
228 | pointers to blocks is | |
229 | ||
230 | 131072/4 = 32768 blocks | |
231 | ||
232 | ||
233 | PACKET_MMAP buffer size calculator | |
234 | ------------------------------------ | |
235 | ||
236 | Definitions: | |
237 | ||
238 | <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) | |
239 | <pointer size>: depends on the architecture -- sizeof(void *) | |
240 | <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) | |
241 | <max-order> : is the value defined with MAX_ORDER | |
242 | <frame size> : it's an upper bound of frame's capture size (more on this later) | |
243 | ||
244 | from these definitions we will derive | |
245 | ||
246 | <block number> = <size-max>/<pointer size> | |
247 | <block size> = <pagesize> << <max-order> | |
248 | ||
249 | so, the max buffer size is | |
250 | ||
251 | <block number> * <block size> | |
252 | ||
253 | and, the number of frames be | |
254 | ||
255 | <block number> * <block size> / <frame size> | |
256 | ||
2e150f6e | 257 | Suppose the following parameters, which apply for 2.6 kernel and an |
1da177e4 LT |
258 | i386 architecture: |
259 | ||
260 | <size-max> = 131072 bytes | |
261 | <pointer size> = 4 bytes | |
262 | <pagesize> = 4096 bytes | |
263 | <max-order> = 11 | |
264 | ||
6c28f2c0 | 265 | and a value for <frame size> of 2048 bytes. These parameters will yield |
1da177e4 LT |
266 | |
267 | <block number> = 131072/4 = 32768 blocks | |
268 | <block size> = 4096 << 11 = 8 MiB. | |
269 | ||
270 | and hence the buffer will have a 262144 MiB size. So it can hold | |
271 | 262144 MiB / 2048 bytes = 134217728 frames | |
272 | ||
273 | ||
274 | Actually, this buffer size is not possible with an i386 architecture. | |
275 | Remember that the memory is allocated in kernel space, in the case of | |
276 | an i386 kernel's memory size is limited to 1GiB. | |
277 | ||
278 | All memory allocations are not freed until the socket is closed. The memory | |
279 | allocations are done with GFP_KERNEL priority, this basically means that | |
280 | the allocation can wait and swap other process' memory in order to allocate | |
992caacf | 281 | the necessary memory, so normally limits can be reached. |
1da177e4 LT |
282 | |
283 | Other constraints | |
284 | ------------------- | |
285 | ||
286 | If you check the source code you will see that what I draw here as a frame | |
5d3f083d | 287 | is not only the link level frame. At the beginning of each frame there is a |
1da177e4 LT |
288 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame |
289 | meta information like timestamp. So what we draw here a frame it's really | |
290 | the following (from include/linux/if_packet.h): | |
291 | ||
292 | /* | |
293 | Frame structure: | |
294 | ||
295 | - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 | |
296 | - struct tpacket_hdr | |
297 | - pad to TPACKET_ALIGNMENT=16 | |
298 | - struct sockaddr_ll | |
3f6dee9b | 299 | - Gap, chosen so that packet data (Start+tp_net) aligns to |
1da177e4 LT |
300 | TPACKET_ALIGNMENT=16 |
301 | - Start+tp_mac: [ Optional MAC header ] | |
302 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | |
303 | - Pad to align to TPACKET_ALIGNMENT=16 | |
304 | */ | |
305 | ||
306 | ||
307 | The following are conditions that are checked in packet_set_ring | |
308 | ||
309 | tp_block_size must be a multiple of PAGE_SIZE (1) | |
310 | tp_frame_size must be greater than TPACKET_HDRLEN (obvious) | |
311 | tp_frame_size must be a multiple of TPACKET_ALIGNMENT | |
312 | tp_frame_nr must be exactly frames_per_block*tp_block_nr | |
313 | ||
6c28f2c0 | 314 | Note that tp_block_size should be chosen to be a power of two or there will |
1da177e4 LT |
315 | be a waste of memory. |
316 | ||
317 | -------------------------------------------------------------------------------- | |
6c28f2c0 | 318 | + Mapping and use of the circular buffer (ring) |
1da177e4 LT |
319 | -------------------------------------------------------------------------------- |
320 | ||
6c28f2c0 | 321 | The mapping of the buffer in the user process is done with the conventional |
1da177e4 LT |
322 | mmap function. Even the circular buffer is compound of several physically |
323 | discontiguous blocks of memory, they are contiguous to the user space, hence | |
324 | just one call to mmap is needed: | |
325 | ||
326 | mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | |
327 | ||
328 | If tp_frame_size is a divisor of tp_block_size frames will be | |
d9195881 | 329 | contiguously spaced by tp_frame_size bytes. If not, each |
1da177e4 LT |
330 | tp_block_size/tp_frame_size frames there will be a gap between |
331 | the frames. This is because a frame cannot be spawn across two | |
332 | blocks. | |
333 | ||
334 | At the beginning of each frame there is an status field (see | |
335 | struct tpacket_hdr). If this field is 0 means that the frame is ready | |
336 | to be used for the kernel, If not, there is a frame the user can read | |
337 | and the following flags apply: | |
338 | ||
339 | from include/linux/if_packet.h | |
340 | ||
341 | #define TP_STATUS_COPY 2 | |
342 | #define TP_STATUS_LOSING 4 | |
343 | #define TP_STATUS_CSUMNOTREADY 8 | |
344 | ||
345 | ||
346 | TP_STATUS_COPY : This flag indicates that the frame (and associated | |
347 | meta information) has been truncated because it's | |
348 | larger than tp_frame_size. This packet can be | |
349 | read entirely with recvfrom(). | |
350 | ||
351 | In order to make this work it must to be | |
352 | enabled previously with setsockopt() and | |
353 | the PACKET_COPY_THRESH option. | |
354 | ||
355 | The number of frames than can be buffered to | |
356 | be read with recvfrom is limited like a normal socket. | |
357 | See the SO_RCVBUF option in the socket (7) man page. | |
358 | ||
359 | TP_STATUS_LOSING : indicates there were packet drops from last time | |
360 | statistics where checked with getsockopt() and | |
361 | the PACKET_STATISTICS option. | |
362 | ||
c30fe7f7 | 363 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which |
1da177e4 LT |
364 | it's checksum will be done in hardware. So while |
365 | reading the packet we should not try to check the | |
366 | checksum. | |
367 | ||
368 | for convenience there are also the following defines: | |
369 | ||
370 | #define TP_STATUS_KERNEL 0 | |
371 | #define TP_STATUS_USER 1 | |
372 | ||
373 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel | |
374 | receives a packet it puts in the buffer and updates the status with | |
375 | at least the TP_STATUS_USER flag. Then the user can read the packet, | |
376 | once the packet is read the user must zero the status field, so the kernel | |
377 | can use again that frame buffer. | |
378 | ||
379 | The user can use poll (any other variant should apply too) to check if new | |
380 | packets are in the ring: | |
381 | ||
382 | struct pollfd pfd; | |
383 | ||
384 | pfd.fd = fd; | |
385 | pfd.revents = 0; | |
386 | pfd.events = POLLIN|POLLRDNORM|POLLERR; | |
387 | ||
388 | if (status == TP_STATUS_KERNEL) | |
389 | retval = poll(&pfd, 1, timeout); | |
390 | ||
391 | It doesn't incur in a race condition to first check the status value and | |
392 | then poll for frames. | |
393 | ||
394 | -------------------------------------------------------------------------------- | |
395 | + THANKS | |
396 | -------------------------------------------------------------------------------- | |
397 | ||
398 | Jesse Brandeburg, for fixing my grammathical/spelling errors | |
399 |