Commit | Line | Data |
---|---|---|
8b2c70d1 SR |
1 | Lockless Ring Buffer Design |
2 | =========================== | |
3 | ||
4 | Copyright 2009 Red Hat Inc. | |
5 | Author: Steven Rostedt <srostedt@redhat.com> | |
6 | License: The GNU Free Documentation License, Version 1.2 | |
7 | (dual licensed under the GPL v2) | |
8 | Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, | |
9 | and Frederic Weisbecker. | |
10 | ||
11 | ||
12 | Written for: 2.6.31 | |
13 | ||
14 | Terminology used in this Document | |
15 | --------------------------------- | |
16 | ||
17 | tail - where new writes happen in the ring buffer. | |
18 | ||
19 | head - where new reads happen in the ring buffer. | |
20 | ||
21 | producer - the task that writes into the ring buffer (same as writer) | |
22 | ||
23 | writer - same as producer | |
24 | ||
25 | consumer - the task that reads from the buffer (same as reader) | |
26 | ||
27 | reader - same as consumer. | |
28 | ||
29 | reader_page - A page outside the ring buffer used solely (for the most part) | |
30 | by the reader. | |
31 | ||
32 | head_page - a pointer to the page that the reader will use next | |
33 | ||
34 | tail_page - a pointer to the page that will be written to next | |
35 | ||
006b4298 | 36 | commit_page - a pointer to the page with the last finished non-nested write. |
8b2c70d1 | 37 | |
006b4298 | 38 | cmpxchg - hardware-assisted atomic transaction that performs the following: |
8b2c70d1 SR |
39 | |
40 | A = B iff previous A == C | |
41 | ||
42 | R = cmpxchg(A, C, B) is saying that we replace A with B if and only if | |
43 | current A is equal to C, and we put the old (current) A into R | |
44 | ||
45 | R gets the previous A regardless if A is updated with B or not. | |
46 | ||
47 | To see if the update was successful a compare of R == C may be used. | |
48 | ||
49 | The Generic Ring Buffer | |
50 | ----------------------- | |
51 | ||
52 | The ring buffer can be used in either an overwrite mode or in | |
53 | producer/consumer mode. | |
54 | ||
006b4298 | 55 | Producer/consumer mode is where if the producer were to fill up the |
8b2c70d1 SR |
56 | buffer before the consumer could free up anything, the producer |
57 | will stop writing to the buffer. This will lose most recent events. | |
58 | ||
006b4298 | 59 | Overwrite mode is where if the producer were to fill up the buffer |
8b2c70d1 SR |
60 | before the consumer could free up anything, the producer will |
61 | overwrite the older data. This will lose the oldest events. | |
62 | ||
006b4298 | 63 | No two writers can write at the same time (on the same per-cpu buffer), |
8b2c70d1 SR |
64 | but a writer may interrupt another writer, but it must finish writing |
65 | before the previous writer may continue. This is very important to the | |
66 | algorithm. The writers act like a "stack". The way interrupts works | |
67 | enforces this behavior. | |
68 | ||
69 | ||
70 | writer1 start | |
71 | <preempted> writer2 start | |
72 | <preempted> writer3 start | |
73 | writer3 finishes | |
74 | writer2 finishes | |
75 | writer1 finishes | |
76 | ||
77 | This is very much like a writer being preempted by an interrupt and | |
78 | the interrupt doing a write as well. | |
79 | ||
80 | Readers can happen at any time. But no two readers may run at the | |
81 | same time, nor can a reader preempt/interrupt another reader. A reader | |
006b4298 | 82 | cannot preempt/interrupt a writer, but it may read/consume from the |
8b2c70d1 SR |
83 | buffer at the same time as a writer is writing, but the reader must be |
84 | on another processor to do so. A reader may read on its own processor | |
85 | and can be preempted by a writer. | |
86 | ||
006b4298 | 87 | A writer can preempt a reader, but a reader cannot preempt a writer. |
8b2c70d1 SR |
88 | But a reader can read the buffer at the same time (on another processor) |
89 | as a writer. | |
90 | ||
006b4298 | 91 | The ring buffer is made up of a list of pages held together by a linked list. |
8b2c70d1 SR |
92 | |
93 | At initialization a reader page is allocated for the reader that is not | |
94 | part of the ring buffer. | |
95 | ||
96 | The head_page, tail_page and commit_page are all initialized to point | |
97 | to the same page. | |
98 | ||
99 | The reader page is initialized to have its next pointer pointing to | |
100 | the head page, and its previous pointer pointing to a page before | |
101 | the head page. | |
102 | ||
103 | The reader has its own page to use. At start up time, this page is | |
104 | allocated but is not attached to the list. When the reader wants | |
006b4298 | 105 | to read from the buffer, if its page is empty (like it is on start-up), |
8b2c70d1 SR |
106 | it will swap its page with the head_page. The old reader page will |
107 | become part of the ring buffer and the head_page will be removed. | |
108 | The page after the inserted page (old reader_page) will become the | |
109 | new head page. | |
110 | ||
111 | Once the new page is given to the reader, the reader could do what | |
112 | it wants with it, as long as a writer has left that page. | |
113 | ||
114 | A sample of how the reader page is swapped: Note this does not | |
115 | show the head page in the buffer, it is for demonstrating a swap | |
116 | only. | |
117 | ||
118 | +------+ | |
119 | |reader| RING BUFFER | |
120 | |page | | |
121 | +------+ | |
122 | +---+ +---+ +---+ | |
123 | | |-->| |-->| | | |
124 | | |<--| |<--| | | |
125 | +---+ +---+ +---+ | |
126 | ^ | ^ | | |
127 | | +-------------+ | | |
128 | +-----------------+ | |
129 | ||
130 | ||
131 | +------+ | |
132 | |reader| RING BUFFER | |
133 | |page |-------------------+ | |
134 | +------+ v | |
135 | | +---+ +---+ +---+ | |
136 | | | |-->| |-->| | | |
137 | | | |<--| |<--| |<-+ | |
138 | | +---+ +---+ +---+ | | |
139 | | ^ | ^ | | | |
140 | | | +-------------+ | | | |
141 | | +-----------------+ | | |
142 | +------------------------------------+ | |
143 | ||
144 | +------+ | |
145 | |reader| RING BUFFER | |
146 | |page |-------------------+ | |
147 | +------+ <---------------+ v | |
148 | | ^ +---+ +---+ +---+ | |
149 | | | | |-->| |-->| | | |
150 | | | | | | |<--| |<-+ | |
151 | | | +---+ +---+ +---+ | | |
152 | | | | ^ | | | |
153 | | | +-------------+ | | | |
154 | | +-----------------------------+ | | |
155 | +------------------------------------+ | |
156 | ||
157 | +------+ | |
158 | |buffer| RING BUFFER | |
159 | |page |-------------------+ | |
160 | +------+ <---------------+ v | |
161 | | ^ +---+ +---+ +---+ | |
162 | | | | | | |-->| | | |
163 | | | New | | | |<--| |<-+ | |
164 | | | Reader +---+ +---+ +---+ | | |
165 | | | page ----^ | | | |
166 | | | | | | |
167 | | +-----------------------------+ | | |
168 | +------------------------------------+ | |
169 | ||
170 | ||
171 | ||
172 | It is possible that the page swapped is the commit page and the tail page, | |
173 | if what is in the ring buffer is less than what is held in a buffer page. | |
174 | ||
175 | ||
176 | reader page commit page tail page | |
177 | | | | | |
178 | v | | | |
179 | +---+ | | | |
180 | | |<----------+ | | |
181 | | |<------------------------+ | |
182 | | |------+ | |
183 | +---+ | | |
184 | | | |
185 | v | |
186 | +---+ +---+ +---+ +---+ | |
187 | <---| |--->| |--->| |--->| |---> | |
188 | --->| |<---| |<---| |<---| |<--- | |
189 | +---+ +---+ +---+ +---+ | |
190 | ||
191 | This case is still valid for this algorithm. | |
192 | When the writer leaves the page, it simply goes into the ring buffer | |
193 | since the reader page still points to the next location in the ring | |
194 | buffer. | |
195 | ||
196 | ||
197 | The main pointers: | |
198 | ||
199 | reader page - The page used solely by the reader and is not part | |
200 | of the ring buffer (may be swapped in) | |
201 | ||
202 | head page - the next page in the ring buffer that will be swapped | |
203 | with the reader page. | |
204 | ||
205 | tail page - the page where the next write will take place. | |
206 | ||
207 | commit page - the page that last finished a write. | |
208 | ||
006b4298 | 209 | The commit page only is updated by the outermost writer in the |
8b2c70d1 SR |
210 | writer stack. A writer that preempts another writer will not move the |
211 | commit page. | |
212 | ||
213 | When data is written into the ring buffer, a position is reserved | |
214 | in the ring buffer and passed back to the writer. When the writer | |
215 | is finished writing data into that position, it commits the write. | |
216 | ||
217 | Another write (or a read) may take place at anytime during this | |
218 | transaction. If another write happens it must finish before continuing | |
219 | with the previous write. | |
220 | ||
221 | ||
222 | Write reserve: | |
223 | ||
224 | Buffer page | |
225 | +---------+ | |
226 | |written | | |
227 | +---------+ <--- given back to writer (current commit) | |
228 | |reserved | | |
229 | +---------+ <--- tail pointer | |
230 | | empty | | |
231 | +---------+ | |
232 | ||
233 | Write commit: | |
234 | ||
235 | Buffer page | |
236 | +---------+ | |
237 | |written | | |
238 | +---------+ | |
239 | |written | | |
25985edc | 240 | +---------+ <--- next position for write (current commit) |
8b2c70d1 SR |
241 | | empty | |
242 | +---------+ | |
243 | ||
244 | ||
245 | If a write happens after the first reserve: | |
246 | ||
247 | Buffer page | |
248 | +---------+ | |
249 | |written | | |
250 | +---------+ <-- current commit | |
251 | |reserved | | |
252 | +---------+ <--- given back to second writer | |
253 | |reserved | | |
254 | +---------+ <--- tail pointer | |
255 | ||
256 | After second writer commits: | |
257 | ||
258 | ||
259 | Buffer page | |
260 | +---------+ | |
261 | |written | | |
262 | +---------+ <--(last full commit) | |
263 | |reserved | | |
264 | +---------+ | |
265 | |pending | | |
266 | |commit | | |
267 | +---------+ <--- tail pointer | |
268 | ||
269 | When the first writer commits: | |
270 | ||
271 | Buffer page | |
272 | +---------+ | |
273 | |written | | |
274 | +---------+ | |
275 | |written | | |
276 | +---------+ | |
277 | |written | | |
278 | +---------+ <--(last full commit and tail pointer) | |
279 | ||
280 | ||
281 | The commit pointer points to the last write location that was | |
282 | committed without preempting another write. When a write that | |
283 | preempted another write is committed, it only becomes a pending commit | |
006b4298 | 284 | and will not be a full commit until all writes have been committed. |
8b2c70d1 SR |
285 | |
286 | The commit page points to the page that has the last full commit. | |
287 | The tail page points to the page with the last write (before | |
288 | committing). | |
289 | ||
290 | The tail page is always equal to or after the commit page. It may | |
291 | be several pages ahead. If the tail page catches up to the commit | |
292 | page then no more writes may take place (regardless of the mode | |
293 | of the ring buffer: overwrite and produce/consumer). | |
294 | ||
006b4298 | 295 | The order of pages is: |
8b2c70d1 SR |
296 | |
297 | head page | |
298 | commit page | |
299 | tail page | |
300 | ||
301 | Possible scenario: | |
302 | tail page | |
303 | head page commit page | | |
304 | | | | | |
305 | v v v | |
306 | +---+ +---+ +---+ +---+ | |
307 | <---| |--->| |--->| |--->| |---> | |
308 | --->| |<---| |<---| |<---| |<--- | |
309 | +---+ +---+ +---+ +---+ | |
310 | ||
311 | There is a special case that the head page is after either the commit page | |
312 | and possibly the tail page. That is when the commit (and tail) page has been | |
313 | swapped with the reader page. This is because the head page is always | |
006b4298 | 314 | part of the ring buffer, but the reader page is not. Whenever there |
8b2c70d1 SR |
315 | has been less than a full page that has been committed inside the ring buffer, |
316 | and a reader swaps out a page, it will be swapping out the commit page. | |
317 | ||
318 | ||
319 | reader page commit page tail page | |
320 | | | | | |
321 | v | | | |
322 | +---+ | | | |
323 | | |<----------+ | | |
324 | | |<------------------------+ | |
325 | | |------+ | |
326 | +---+ | | |
327 | | | |
328 | v | |
329 | +---+ +---+ +---+ +---+ | |
330 | <---| |--->| |--->| |--->| |---> | |
331 | --->| |<---| |<---| |<---| |<--- | |
332 | +---+ +---+ +---+ +---+ | |
333 | ^ | |
334 | | | |
335 | head page | |
336 | ||
337 | ||
338 | In this case, the head page will not move when the tail and commit | |
339 | move back into the ring buffer. | |
340 | ||
006b4298 | 341 | The reader cannot swap a page into the ring buffer if the commit page |
8b2c70d1 SR |
342 | is still on that page. If the read meets the last commit (real commit |
343 | not pending or reserved), then there is nothing more to read. | |
344 | The buffer is considered empty until another full commit finishes. | |
345 | ||
346 | When the tail meets the head page, if the buffer is in overwrite mode, | |
347 | the head page will be pushed ahead one. If the buffer is in producer/consumer | |
348 | mode, the write will fail. | |
349 | ||
350 | Overwrite mode: | |
351 | ||
352 | tail page | |
353 | | | |
354 | v | |
355 | +---+ +---+ +---+ +---+ | |
356 | <---| |--->| |--->| |--->| |---> | |
357 | --->| |<---| |<---| |<---| |<--- | |
358 | +---+ +---+ +---+ +---+ | |
359 | ^ | |
360 | | | |
361 | head page | |
362 | ||
363 | ||
364 | tail page | |
365 | | | |
366 | v | |
367 | +---+ +---+ +---+ +---+ | |
368 | <---| |--->| |--->| |--->| |---> | |
369 | --->| |<---| |<---| |<---| |<--- | |
370 | +---+ +---+ +---+ +---+ | |
371 | ^ | |
372 | | | |
373 | head page | |
374 | ||
375 | ||
376 | tail page | |
377 | | | |
378 | v | |
379 | +---+ +---+ +---+ +---+ | |
380 | <---| |--->| |--->| |--->| |---> | |
381 | --->| |<---| |<---| |<---| |<--- | |
382 | +---+ +---+ +---+ +---+ | |
383 | ^ | |
384 | | | |
385 | head page | |
386 | ||
387 | Note, the reader page will still point to the previous head page. | |
388 | But when a swap takes place, it will use the most recent head page. | |
389 | ||
390 | ||
391 | Making the Ring Buffer Lockless: | |
392 | -------------------------------- | |
393 | ||
394 | The main idea behind the lockless algorithm is to combine the moving | |
395 | of the head_page pointer with the swapping of pages with the reader. | |
396 | State flags are placed inside the pointer to the page. To do this, | |
397 | each page must be aligned in memory by 4 bytes. This will allow the 2 | |
006b4298 | 398 | least significant bits of the address to be used as flags, since |
8b2c70d1 SR |
399 | they will always be zero for the address. To get the address, |
400 | simply mask out the flags. | |
401 | ||
402 | MASK = ~3 | |
403 | ||
404 | address & MASK | |
405 | ||
406 | Two flags will be kept by these two bits: | |
407 | ||
408 | HEADER - the page being pointed to is a head page | |
409 | ||
410 | UPDATE - the page being pointed to is being updated by a writer | |
411 | and was or is about to be a head page. | |
412 | ||
413 | ||
414 | reader page | |
415 | | | |
416 | v | |
417 | +---+ | |
418 | | |------+ | |
419 | +---+ | | |
420 | | | |
421 | v | |
422 | +---+ +---+ +---+ +---+ | |
423 | <---| |--->| |-H->| |--->| |---> | |
424 | --->| |<---| |<---| |<---| |<--- | |
425 | +---+ +---+ +---+ +---+ | |
426 | ||
427 | ||
428 | The above pointer "-H->" would have the HEADER flag set. That is | |
429 | the next page is the next page to be swapped out by the reader. | |
430 | This pointer means the next page is the head page. | |
431 | ||
432 | When the tail page meets the head pointer, it will use cmpxchg to | |
433 | change the pointer to the UPDATE state: | |
434 | ||
435 | ||
436 | tail page | |
437 | | | |
438 | v | |
439 | +---+ +---+ +---+ +---+ | |
440 | <---| |--->| |-H->| |--->| |---> | |
441 | --->| |<---| |<---| |<---| |<--- | |
442 | +---+ +---+ +---+ +---+ | |
443 | ||
444 | tail page | |
445 | | | |
446 | v | |
447 | +---+ +---+ +---+ +---+ | |
448 | <---| |--->| |-U->| |--->| |---> | |
449 | --->| |<---| |<---| |<---| |<--- | |
450 | +---+ +---+ +---+ +---+ | |
451 | ||
452 | "-U->" represents a pointer in the UPDATE state. | |
453 | ||
454 | Any access to the reader will need to take some sort of lock to serialize | |
455 | the readers. But the writers will never take a lock to write to the | |
456 | ring buffer. This means we only need to worry about a single reader, | |
457 | and writes only preempt in "stack" formation. | |
458 | ||
459 | When the reader tries to swap the page with the ring buffer, it | |
460 | will also use cmpxchg. If the flag bit in the pointer to the | |
461 | head page does not have the HEADER flag set, the compare will fail | |
462 | and the reader will need to look for the new head page and try again. | |
006b4298 | 463 | Note, the flags UPDATE and HEADER are never set at the same time. |
8b2c70d1 SR |
464 | |
465 | The reader swaps the reader page as follows: | |
466 | ||
467 | +------+ | |
468 | |reader| RING BUFFER | |
469 | |page | | |
470 | +------+ | |
471 | +---+ +---+ +---+ | |
472 | | |--->| |--->| | | |
473 | | |<---| |<---| | | |
474 | +---+ +---+ +---+ | |
475 | ^ | ^ | | |
476 | | +---------------+ | | |
477 | +-----H-------------+ | |
478 | ||
479 | The reader sets the reader page next pointer as HEADER to the page after | |
480 | the head page. | |
481 | ||
482 | ||
483 | +------+ | |
484 | |reader| RING BUFFER | |
485 | |page |-------H-----------+ | |
486 | +------+ v | |
487 | | +---+ +---+ +---+ | |
488 | | | |--->| |--->| | | |
489 | | | |<---| |<---| |<-+ | |
490 | | +---+ +---+ +---+ | | |
491 | | ^ | ^ | | | |
492 | | | +---------------+ | | | |
493 | | +-----H-------------+ | | |
494 | +--------------------------------------+ | |
495 | ||
496 | It does a cmpxchg with the pointer to the previous head page to make it | |
497 | point to the reader page. Note that the new pointer does not have the HEADER | |
498 | flag set. This action atomically moves the head page forward. | |
499 | ||
500 | +------+ | |
501 | |reader| RING BUFFER | |
502 | |page |-------H-----------+ | |
503 | +------+ v | |
504 | | ^ +---+ +---+ +---+ | |
505 | | | | |-->| |-->| | | |
506 | | | | |<--| |<--| |<-+ | |
507 | | | +---+ +---+ +---+ | | |
508 | | | | ^ | | | |
509 | | | +-------------+ | | | |
510 | | +-----------------------------+ | | |
511 | +------------------------------------+ | |
512 | ||
513 | After the new head page is set, the previous pointer of the head page is | |
514 | updated to the reader page. | |
515 | ||
516 | +------+ | |
517 | |reader| RING BUFFER | |
518 | |page |-------H-----------+ | |
519 | +------+ <---------------+ v | |
520 | | ^ +---+ +---+ +---+ | |
521 | | | | |-->| |-->| | | |
522 | | | | | | |<--| |<-+ | |
523 | | | +---+ +---+ +---+ | | |
524 | | | | ^ | | | |
525 | | | +-------------+ | | | |
526 | | +-----------------------------+ | | |
527 | +------------------------------------+ | |
528 | ||
529 | +------+ | |
530 | |buffer| RING BUFFER | |
531 | |page |-------H-----------+ <--- New head page | |
532 | +------+ <---------------+ v | |
533 | | ^ +---+ +---+ +---+ | |
534 | | | | | | |-->| | | |
535 | | | New | | | |<--| |<-+ | |
536 | | | Reader +---+ +---+ +---+ | | |
537 | | | page ----^ | | | |
538 | | | | | | |
539 | | +-----------------------------+ | | |
540 | +------------------------------------+ | |
541 | ||
006b4298 | 542 | Another important point: The page that the reader page points back to |
8b2c70d1 SR |
543 | by its previous pointer (the one that now points to the new head page) |
544 | never points back to the reader page. That is because the reader page is | |
545 | not part of the ring buffer. Traversing the ring buffer via the next pointers | |
546 | will always stay in the ring buffer. Traversing the ring buffer via the | |
547 | prev pointers may not. | |
548 | ||
549 | Note, the way to determine a reader page is simply by examining the previous | |
550 | pointer of the page. If the next pointer of the previous page does not | |
551 | point back to the original page, then the original page is a reader page: | |
552 | ||
553 | ||
554 | +--------+ | |
555 | | reader | next +----+ | |
556 | | page |-------->| |<====== (buffer page) | |
557 | +--------+ +----+ | |
558 | | | ^ | |
559 | | v | next | |
560 | prev | +----+ | |
561 | +------------->| | | |
562 | +----+ | |
563 | ||
564 | The way the head page moves forward: | |
565 | ||
566 | When the tail page meets the head page and the buffer is in overwrite mode | |
567 | and more writes take place, the head page must be moved forward before the | |
568 | writer may move the tail page. The way this is done is that the writer | |
569 | performs a cmpxchg to convert the pointer to the head page from the HEADER | |
570 | flag to have the UPDATE flag set. Once this is done, the reader will | |
571 | not be able to swap the head page from the buffer, nor will it be able to | |
572 | move the head page, until the writer is finished with the move. | |
573 | ||
574 | This eliminates any races that the reader can have on the writer. The reader | |
006b4298 | 575 | must spin, and this is why the reader cannot preempt the writer. |
8b2c70d1 SR |
576 | |
577 | tail page | |
578 | | | |
579 | v | |
580 | +---+ +---+ +---+ +---+ | |
581 | <---| |--->| |-H->| |--->| |---> | |
582 | --->| |<---| |<---| |<---| |<--- | |
583 | +---+ +---+ +---+ +---+ | |
584 | ||
585 | tail page | |
586 | | | |
587 | v | |
588 | +---+ +---+ +---+ +---+ | |
589 | <---| |--->| |-U->| |--->| |---> | |
590 | --->| |<---| |<---| |<---| |<--- | |
591 | +---+ +---+ +---+ +---+ | |
592 | ||
593 | The following page will be made into the new head page. | |
594 | ||
595 | tail page | |
596 | | | |
597 | v | |
598 | +---+ +---+ +---+ +---+ | |
599 | <---| |--->| |-U->| |-H->| |---> | |
600 | --->| |<---| |<---| |<---| |<--- | |
601 | +---+ +---+ +---+ +---+ | |
602 | ||
603 | After the new head page has been set, we can set the old head page | |
604 | pointer back to NORMAL. | |
605 | ||
606 | tail page | |
607 | | | |
608 | v | |
609 | +---+ +---+ +---+ +---+ | |
610 | <---| |--->| |--->| |-H->| |---> | |
611 | --->| |<---| |<---| |<---| |<--- | |
612 | +---+ +---+ +---+ +---+ | |
613 | ||
614 | After the head page has been moved, the tail page may now move forward. | |
615 | ||
616 | tail page | |
617 | | | |
618 | v | |
619 | +---+ +---+ +---+ +---+ | |
620 | <---| |--->| |--->| |-H->| |---> | |
621 | --->| |<---| |<---| |<---| |<--- | |
622 | +---+ +---+ +---+ +---+ | |
623 | ||
624 | ||
625 | The above are the trivial updates. Now for the more complex scenarios. | |
626 | ||
627 | ||
628 | As stated before, if enough writes preempt the first write, the | |
629 | tail page may make it all the way around the buffer and meet the commit | |
630 | page. At this time, we must start dropping writes (usually with some kind | |
631 | of warning to the user). But what happens if the commit was still on the | |
632 | reader page? The commit page is not part of the ring buffer. The tail page | |
633 | must account for this. | |
634 | ||
635 | ||
636 | reader page commit page | |
637 | | | | |
638 | v | | |
639 | +---+ | | |
640 | | |<----------+ | |
641 | | | | |
642 | | |------+ | |
643 | +---+ | | |
644 | | | |
645 | v | |
646 | +---+ +---+ +---+ +---+ | |
647 | <---| |--->| |-H->| |--->| |---> | |
648 | --->| |<---| |<---| |<---| |<--- | |
649 | +---+ +---+ +---+ +---+ | |
650 | ^ | |
651 | | | |
652 | tail page | |
653 | ||
654 | If the tail page were to simply push the head page forward, the commit when | |
655 | leaving the reader page would not be pointing to the correct page. | |
656 | ||
657 | The solution to this is to test if the commit page is on the reader page | |
658 | before pushing the head page. If it is, then it can be assumed that the | |
659 | tail page wrapped the buffer, and we must drop new writes. | |
660 | ||
661 | This is not a race condition, because the commit page can only be moved | |
006b4298 | 662 | by the outermost writer (the writer that was preempted). |
8b2c70d1 | 663 | This means that the commit will not move while a writer is moving the |
006b4298 | 664 | tail page. The reader cannot swap the reader page if it is also being |
8b2c70d1 SR |
665 | used as the commit page. The reader can simply check that the commit |
666 | is off the reader page. Once the commit page leaves the reader page | |
667 | it will never go back on it unless a reader does another swap with the | |
668 | buffer page that is also the commit page. | |
669 | ||
670 | ||
671 | Nested writes | |
672 | ------------- | |
673 | ||
674 | In the pushing forward of the tail page we must first push forward | |
675 | the head page if the head page is the next page. If the head page | |
676 | is not the next page, the tail page is simply updated with a cmpxchg. | |
677 | ||
678 | Only writers move the tail page. This must be done atomically to protect | |
679 | against nested writers. | |
680 | ||
681 | temp_page = tail_page | |
682 | next_page = temp_page->next | |
683 | cmpxchg(tail_page, temp_page, next_page) | |
684 | ||
685 | The above will update the tail page if it is still pointing to the expected | |
686 | page. If this fails, a nested write pushed it forward, the the current write | |
687 | does not need to push it. | |
688 | ||
689 | ||
690 | temp page | |
691 | | | |
692 | v | |
693 | tail page | |
694 | | | |
695 | v | |
696 | +---+ +---+ +---+ +---+ | |
697 | <---| |--->| |--->| |--->| |---> | |
698 | --->| |<---| |<---| |<---| |<--- | |
699 | +---+ +---+ +---+ +---+ | |
700 | ||
701 | Nested write comes in and moves the tail page forward: | |
702 | ||
703 | tail page (moved by nested writer) | |
704 | temp page | | |
705 | | | | |
706 | v v | |
707 | +---+ +---+ +---+ +---+ | |
708 | <---| |--->| |--->| |--->| |---> | |
709 | --->| |<---| |<---| |<---| |<--- | |
710 | +---+ +---+ +---+ +---+ | |
711 | ||
712 | The above would fail the cmpxchg, but since the tail page has already | |
713 | been moved forward, the writer will just try again to reserve storage | |
714 | on the new tail page. | |
715 | ||
716 | But the moving of the head page is a bit more complex. | |
717 | ||
718 | tail page | |
719 | | | |
720 | v | |
721 | +---+ +---+ +---+ +---+ | |
722 | <---| |--->| |-H->| |--->| |---> | |
723 | --->| |<---| |<---| |<---| |<--- | |
724 | +---+ +---+ +---+ +---+ | |
725 | ||
726 | The write converts the head page pointer to UPDATE. | |
727 | ||
728 | tail page | |
729 | | | |
730 | v | |
731 | +---+ +---+ +---+ +---+ | |
732 | <---| |--->| |-U->| |--->| |---> | |
733 | --->| |<---| |<---| |<---| |<--- | |
734 | +---+ +---+ +---+ +---+ | |
735 | ||
006b4298 | 736 | But if a nested writer preempts here, it will see that the next |
8b2c70d1 SR |
737 | page is a head page, but it is also nested. It will detect that |
738 | it is nested and will save that information. The detection is the | |
739 | fact that it sees the UPDATE flag instead of a HEADER or NORMAL | |
740 | pointer. | |
741 | ||
742 | The nested writer will set the new head page pointer. | |
743 | ||
744 | tail page | |
745 | | | |
746 | v | |
747 | +---+ +---+ +---+ +---+ | |
748 | <---| |--->| |-U->| |-H->| |---> | |
749 | --->| |<---| |<---| |<---| |<--- | |
750 | +---+ +---+ +---+ +---+ | |
751 | ||
752 | But it will not reset the update back to normal. Only the writer | |
753 | that converted a pointer from HEAD to UPDATE will convert it back | |
754 | to NORMAL. | |
755 | ||
756 | tail page | |
757 | | | |
758 | v | |
759 | +---+ +---+ +---+ +---+ | |
760 | <---| |--->| |-U->| |-H->| |---> | |
761 | --->| |<---| |<---| |<---| |<--- | |
762 | +---+ +---+ +---+ +---+ | |
763 | ||
006b4298 | 764 | After the nested writer finishes, the outermost writer will convert |
8b2c70d1 SR |
765 | the UPDATE pointer to NORMAL. |
766 | ||
767 | ||
768 | tail page | |
769 | | | |
770 | v | |
771 | +---+ +---+ +---+ +---+ | |
772 | <---| |--->| |--->| |-H->| |---> | |
773 | --->| |<---| |<---| |<---| |<--- | |
774 | +---+ +---+ +---+ +---+ | |
775 | ||
776 | ||
777 | It can be even more complex if several nested writes came in and moved | |
778 | the tail page ahead several pages: | |
779 | ||
780 | ||
781 | (first writer) | |
782 | ||
783 | tail page | |
784 | | | |
785 | v | |
786 | +---+ +---+ +---+ +---+ | |
787 | <---| |--->| |-H->| |--->| |---> | |
788 | --->| |<---| |<---| |<---| |<--- | |
789 | +---+ +---+ +---+ +---+ | |
790 | ||
791 | The write converts the head page pointer to UPDATE. | |
792 | ||
793 | tail page | |
794 | | | |
795 | v | |
796 | +---+ +---+ +---+ +---+ | |
797 | <---| |--->| |-U->| |--->| |---> | |
798 | --->| |<---| |<---| |<---| |<--- | |
799 | +---+ +---+ +---+ +---+ | |
800 | ||
801 | Next writer comes in, and sees the update and sets up the new | |
802 | head page. | |
803 | ||
804 | (second writer) | |
805 | ||
806 | tail page | |
807 | | | |
808 | v | |
809 | +---+ +---+ +---+ +---+ | |
810 | <---| |--->| |-U->| |-H->| |---> | |
811 | --->| |<---| |<---| |<---| |<--- | |
812 | +---+ +---+ +---+ +---+ | |
813 | ||
814 | The nested writer moves the tail page forward. But does not set the old | |
006b4298 | 815 | update page to NORMAL because it is not the outermost writer. |
8b2c70d1 SR |
816 | |
817 | tail page | |
818 | | | |
819 | v | |
820 | +---+ +---+ +---+ +---+ | |
821 | <---| |--->| |-U->| |-H->| |---> | |
822 | --->| |<---| |<---| |<---| |<--- | |
823 | +---+ +---+ +---+ +---+ | |
824 | ||
825 | Another writer preempts and sees the page after the tail page is a head page. | |
826 | It changes it from HEAD to UPDATE. | |
827 | ||
828 | (third writer) | |
829 | ||
830 | tail page | |
831 | | | |
832 | v | |
833 | +---+ +---+ +---+ +---+ | |
834 | <---| |--->| |-U->| |-U->| |---> | |
835 | --->| |<---| |<---| |<---| |<--- | |
836 | +---+ +---+ +---+ +---+ | |
837 | ||
838 | The writer will move the head page forward: | |
839 | ||
840 | ||
841 | (third writer) | |
842 | ||
843 | tail page | |
844 | | | |
845 | v | |
846 | +---+ +---+ +---+ +---+ | |
847 | <---| |--->| |-U->| |-U->| |-H-> | |
848 | --->| |<---| |<---| |<---| |<--- | |
849 | +---+ +---+ +---+ +---+ | |
850 | ||
851 | But now that the third writer did change the HEAD flag to UPDATE it | |
852 | will convert it to normal: | |
853 | ||
854 | ||
855 | (third writer) | |
856 | ||
857 | tail page | |
858 | | | |
859 | v | |
860 | +---+ +---+ +---+ +---+ | |
861 | <---| |--->| |-U->| |--->| |-H-> | |
862 | --->| |<---| |<---| |<---| |<--- | |
863 | +---+ +---+ +---+ +---+ | |
864 | ||
865 | ||
866 | Then it will move the tail page, and return back to the second writer. | |
867 | ||
868 | ||
869 | (second writer) | |
870 | ||
871 | tail page | |
872 | | | |
873 | v | |
874 | +---+ +---+ +---+ +---+ | |
875 | <---| |--->| |-U->| |--->| |-H-> | |
876 | --->| |<---| |<---| |<---| |<--- | |
877 | +---+ +---+ +---+ +---+ | |
878 | ||
879 | ||
880 | The second writer will fail to move the tail page because it was already | |
881 | moved, so it will try again and add its data to the new tail page. | |
882 | It will return to the first writer. | |
883 | ||
884 | ||
885 | (first writer) | |
886 | ||
887 | tail page | |
888 | | | |
889 | v | |
890 | +---+ +---+ +---+ +---+ | |
891 | <---| |--->| |-U->| |--->| |-H-> | |
892 | --->| |<---| |<---| |<---| |<--- | |
893 | +---+ +---+ +---+ +---+ | |
894 | ||
006b4298 | 895 | The first writer cannot know atomically if the tail page moved |
8b2c70d1 SR |
896 | while it updates the HEAD page. It will then update the head page to |
897 | what it thinks is the new head page. | |
898 | ||
899 | ||
900 | (first writer) | |
901 | ||
902 | tail page | |
903 | | | |
904 | v | |
905 | +---+ +---+ +---+ +---+ | |
906 | <---| |--->| |-U->| |-H->| |-H-> | |
907 | --->| |<---| |<---| |<---| |<--- | |
908 | +---+ +---+ +---+ +---+ | |
909 | ||
910 | Since the cmpxchg returns the old value of the pointer the first writer | |
911 | will see it succeeded in updating the pointer from NORMAL to HEAD. | |
912 | But as we can see, this is not good enough. It must also check to see | |
913 | if the tail page is either where it use to be or on the next page: | |
914 | ||
915 | ||
916 | (first writer) | |
917 | ||
918 | A B tail page | |
919 | | | | | |
920 | v v v | |
921 | +---+ +---+ +---+ +---+ | |
922 | <---| |--->| |-U->| |-H->| |-H-> | |
923 | --->| |<---| |<---| |<---| |<--- | |
924 | +---+ +---+ +---+ +---+ | |
925 | ||
006b4298 RD |
926 | If tail page != A and tail page != B, then it must reset the pointer |
927 | back to NORMAL. The fact that it only needs to worry about nested | |
928 | writers means that it only needs to check this after setting the HEAD page. | |
8b2c70d1 SR |
929 | |
930 | ||
931 | (first writer) | |
932 | ||
933 | A B tail page | |
934 | | | | | |
935 | v v v | |
936 | +---+ +---+ +---+ +---+ | |
937 | <---| |--->| |-U->| |--->| |-H-> | |
938 | --->| |<---| |<---| |<---| |<--- | |
939 | +---+ +---+ +---+ +---+ | |
940 | ||
941 | Now the writer can update the head page. This is also why the head page must | |
006b4298 | 942 | remain in UPDATE and only reset by the outermost writer. This prevents |
8b2c70d1 SR |
943 | the reader from seeing the incorrect head page. |
944 | ||
945 | ||
946 | (first writer) | |
947 | ||
948 | A B tail page | |
949 | | | | | |
950 | v v v | |
951 | +---+ +---+ +---+ +---+ | |
952 | <---| |--->| |--->| |--->| |-H-> | |
953 | --->| |<---| |<---| |<---| |<--- | |
954 | +---+ +---+ +---+ +---+ | |
955 |