Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | Review Checklist for RCU Patches |
2 | ||
3 | ||
4 | This document contains a checklist for producing and reviewing patches | |
5 | that make use of RCU. Violating any of the rules listed below will | |
6 | result in the same sorts of problems that leaving out a locking primitive | |
7 | would cause. This list is based on experiences reviewing such patches | |
8 | over a rather long period of time, but improvements are always welcome! | |
9 | ||
10 | 0. Is RCU being applied to a read-mostly situation? If the data | |
11 | structure is updated more than about 10% of the time, then | |
12 | you should strongly consider some other approach, unless | |
13 | detailed performance measurements show that RCU is nonetheless | |
14 | the right tool for the job. | |
15 | ||
32300751 PM |
16 | Another exception is where performance is not an issue, and RCU |
17 | provides a simpler implementation. An example of this situation | |
18 | is the dynamic NMI code in the Linux 2.6 kernel, at least on | |
19 | architectures where NMIs are rare. | |
20 | ||
21 | Yet another exception is where the low real-time latency of RCU's | |
22 | read-side primitives is critically important. | |
1da177e4 LT |
23 | |
24 | 1. Does the update code have proper mutual exclusion? | |
25 | ||
26 | RCU does allow -readers- to run (almost) naked, but -writers- must | |
27 | still use some sort of mutual exclusion, such as: | |
28 | ||
29 | a. locking, | |
30 | b. atomic operations, or | |
31 | c. restricting updates to a single task. | |
32 | ||
33 | If you choose #b, be prepared to describe how you have handled | |
34 | memory barriers on weakly ordered machines (pretty much all of | |
35 | them -- even x86 allows reads to be reordered), and be prepared | |
36 | to explain why this added complexity is worthwhile. If you | |
37 | choose #c, be prepared to explain how this single task does not | |
a83f1fe2 PM |
38 | become a major bottleneck on big multiprocessor machines (for |
39 | example, if the task is updating information relating to itself | |
40 | that other tasks can read, there by definition can be no | |
41 | bottleneck). | |
1da177e4 LT |
42 | |
43 | 2. Do the RCU read-side critical sections make proper use of | |
44 | rcu_read_lock() and friends? These primitives are needed | |
32300751 PM |
45 | to prevent grace periods from ending prematurely, which |
46 | could result in data being unceremoniously freed out from | |
47 | under your read-side code, which can greatly increase the | |
48 | actuarial risk of your kernel. | |
1da177e4 | 49 | |
dd81eca8 PM |
50 | As a rough rule of thumb, any dereference of an RCU-protected |
51 | pointer must be covered by rcu_read_lock() or rcu_read_lock_bh() | |
52 | or by the appropriate update-side lock. | |
53 | ||
1da177e4 LT |
54 | 3. Does the update code tolerate concurrent accesses? |
55 | ||
56 | The whole point of RCU is to permit readers to run without | |
57 | any locks or atomic operations. This means that readers will | |
58 | be running while updates are in progress. There are a number | |
59 | of ways to handle this concurrency, depending on the situation: | |
60 | ||
32300751 PM |
61 | a. Use the RCU variants of the list and hlist update |
62 | primitives to add, remove, and replace elements on an | |
63 | RCU-protected list. Alternatively, use the RCU-protected | |
64 | trees that have been added to the Linux kernel. | |
65 | ||
66 | This is almost always the best approach. | |
67 | ||
68 | b. Proceed as in (a) above, but also maintain per-element | |
69 | locks (that are acquired by both readers and writers) | |
70 | that guard per-element state. Of course, fields that | |
71 | the readers refrain from accessing can be guarded by the | |
72 | update-side lock. | |
73 | ||
74 | This works quite well, also. | |
75 | ||
76 | c. Make updates appear atomic to readers. For example, | |
1da177e4 LT |
77 | pointer updates to properly aligned fields will appear |
78 | atomic, as will individual atomic primitives. Operations | |
79 | performed under a lock and sequences of multiple atomic | |
80 | primitives will -not- appear to be atomic. | |
81 | ||
32300751 | 82 | This can work, but is starting to get a bit tricky. |
1da177e4 | 83 | |
32300751 | 84 | d. Carefully order the updates and the reads so that |
1da177e4 LT |
85 | readers see valid data at all phases of the update. |
86 | This is often more difficult than it sounds, especially | |
87 | given modern CPUs' tendency to reorder memory references. | |
88 | One must usually liberally sprinkle memory barriers | |
89 | (smp_wmb(), smp_rmb(), smp_mb()) through the code, | |
90 | making it difficult to understand and to test. | |
91 | ||
92 | It is usually better to group the changing data into | |
93 | a separate structure, so that the change may be made | |
94 | to appear atomic by updating a pointer to reference | |
95 | a new structure containing updated values. | |
96 | ||
97 | 4. Weakly ordered CPUs pose special challenges. Almost all CPUs | |
98 | are weakly ordered -- even i386 CPUs allow reads to be reordered. | |
99 | RCU code must take all of the following measures to prevent | |
100 | memory-corruption problems: | |
101 | ||
102 | a. Readers must maintain proper ordering of their memory | |
103 | accesses. The rcu_dereference() primitive ensures that | |
104 | the CPU picks up the pointer before it picks up the data | |
105 | that the pointer points to. This really is necessary | |
106 | on Alpha CPUs. If you don't believe me, see: | |
107 | ||
108 | http://www.openvms.compaq.com/wizard/wiz_2637.html | |
109 | ||
110 | The rcu_dereference() primitive is also an excellent | |
111 | documentation aid, letting the person reading the code | |
112 | know exactly which pointers are protected by RCU. | |
113 | ||
114 | The rcu_dereference() primitive is used by the various | |
115 | "_rcu()" list-traversal primitives, such as the | |
dd81eca8 PM |
116 | list_for_each_entry_rcu(). Note that it is perfectly |
117 | legal (if redundant) for update-side code to use | |
118 | rcu_dereference() and the "_rcu()" list-traversal | |
119 | primitives. This is particularly useful in code | |
120 | that is common to readers and updaters. | |
1da177e4 | 121 | |
a83f1fe2 PM |
122 | b. If the list macros are being used, the list_add_tail_rcu() |
123 | and list_add_rcu() primitives must be used in order | |
124 | to prevent weakly ordered machines from misordering | |
125 | structure initialization and pointer planting. | |
1da177e4 | 126 | Similarly, if the hlist macros are being used, the |
a83f1fe2 | 127 | hlist_add_head_rcu() primitive is required. |
1da177e4 | 128 | |
a83f1fe2 PM |
129 | c. If the list macros are being used, the list_del_rcu() |
130 | primitive must be used to keep list_del()'s pointer | |
131 | poisoning from inflicting toxic effects on concurrent | |
132 | readers. Similarly, if the hlist macros are being used, | |
133 | the hlist_del_rcu() primitive is required. | |
134 | ||
135 | The list_replace_rcu() primitive may be used to | |
136 | replace an old structure with a new one in an | |
137 | RCU-protected list. | |
138 | ||
139 | d. Updates must ensure that initialization of a given | |
1da177e4 LT |
140 | structure happens before pointers to that structure are |
141 | publicized. Use the rcu_assign_pointer() primitive | |
142 | when publicizing a pointer to a structure that can | |
143 | be traversed by an RCU read-side critical section. | |
144 | ||
32300751 PM |
145 | 5. If call_rcu(), or a related primitive such as call_rcu_bh() or |
146 | call_rcu_sched(), is used, the callback function must be | |
147 | written to be called from softirq context. In particular, | |
148 | it cannot block. | |
1da177e4 | 149 | |
a83f1fe2 | 150 | 6. Since synchronize_rcu() can block, it cannot be called from |
32300751 PM |
151 | any sort of irq context. Ditto for synchronize_sched() and |
152 | synchronize_srcu(). | |
1da177e4 LT |
153 | |
154 | 7. If the updater uses call_rcu(), then the corresponding readers | |
155 | must use rcu_read_lock() and rcu_read_unlock(). If the updater | |
156 | uses call_rcu_bh(), then the corresponding readers must use | |
32300751 PM |
157 | rcu_read_lock_bh() and rcu_read_unlock_bh(). If the updater |
158 | uses call_rcu_sched(), then the corresponding readers must | |
159 | disable preemption. Mixing things up will result in confusion | |
160 | and broken kernels. | |
1da177e4 LT |
161 | |
162 | One exception to this rule: rcu_read_lock() and rcu_read_unlock() | |
163 | may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() | |
164 | in cases where local bottom halves are already known to be | |
165 | disabled, for example, in irq or softirq context. Commenting | |
166 | such cases is a must, of course! And the jury is still out on | |
167 | whether the increased speed is worth it. | |
168 | ||
32300751 PM |
169 | 8. Although synchronize_rcu() is slower than is call_rcu(), it |
170 | usually results in simpler code. So, unless update performance | |
171 | is critically important or the updaters cannot block, | |
165d6c78 PM |
172 | synchronize_rcu() should be used in preference to call_rcu(). |
173 | ||
174 | An especially important property of the synchronize_rcu() | |
175 | primitive is that it automatically self-limits: if grace periods | |
176 | are delayed for whatever reason, then the synchronize_rcu() | |
177 | primitive will correspondingly delay updates. In contrast, | |
178 | code using call_rcu() should explicitly limit update rate in | |
179 | cases where grace periods are delayed, as failing to do so can | |
180 | result in excessive realtime latencies or even OOM conditions. | |
181 | ||
182 | Ways of gaining this self-limiting property when using call_rcu() | |
183 | include: | |
184 | ||
185 | a. Keeping a count of the number of data-structure elements | |
186 | used by the RCU-protected data structure, including those | |
187 | waiting for a grace period to elapse. Enforce a limit | |
188 | on this number, stalling updates as needed to allow | |
189 | previously deferred frees to complete. | |
190 | ||
191 | Alternatively, limit only the number awaiting deferred | |
192 | free rather than the total number of elements. | |
193 | ||
194 | b. Limiting update rate. For example, if updates occur only | |
195 | once per hour, then no explicit rate limiting is required, | |
196 | unless your system is already badly broken. The dcache | |
197 | subsystem takes this approach -- updates are guarded | |
198 | by a global lock, limiting their rate. | |
199 | ||
200 | c. Trusted update -- if updates can only be done manually by | |
201 | superuser or some other trusted user, then it might not | |
202 | be necessary to automatically limit them. The theory | |
203 | here is that superuser already has lots of ways to crash | |
204 | the machine. | |
205 | ||
206 | d. Use call_rcu_bh() rather than call_rcu(), in order to take | |
207 | advantage of call_rcu_bh()'s faster grace periods. | |
208 | ||
209 | e. Periodically invoke synchronize_rcu(), permitting a limited | |
210 | number of updates per grace period. | |
1da177e4 LT |
211 | |
212 | 9. All RCU list-traversal primitives, which include | |
32300751 | 213 | rcu_dereference(), list_for_each_rcu(), list_for_each_entry_rcu(), |
1da177e4 | 214 | list_for_each_continue_rcu(), and list_for_each_safe_rcu(), |
32300751 PM |
215 | must be either within an RCU read-side critical section or |
216 | must be protected by appropriate update-side locks. RCU | |
1da177e4 LT |
217 | read-side critical sections are delimited by rcu_read_lock() |
218 | and rcu_read_unlock(), or by similar primitives such as | |
219 | rcu_read_lock_bh() and rcu_read_unlock_bh(). | |
220 | ||
32300751 PM |
221 | The reason that it is permissible to use RCU list-traversal |
222 | primitives when the update-side lock is held is that doing so | |
223 | can be quite helpful in reducing code bloat when common code is | |
224 | shared between readers and updaters. | |
1da177e4 LT |
225 | |
226 | 10. Conversely, if you are in an RCU read-side critical section, | |
32300751 PM |
227 | and you don't hold the appropriate update-side lock, you -must- |
228 | use the "_rcu()" variants of the list macros. Failing to do so | |
229 | will break Alpha and confuse people reading your code. | |
a83f1fe2 PM |
230 | |
231 | 11. Note that synchronize_rcu() -only- guarantees to wait until | |
232 | all currently executing rcu_read_lock()-protected RCU read-side | |
233 | critical sections complete. It does -not- necessarily guarantee | |
234 | that all currently running interrupts, NMIs, preempt_disable() | |
235 | code, or idle loops will complete. Therefore, if you do not have | |
236 | rcu_read_lock()-protected read-side critical sections, do -not- | |
237 | use synchronize_rcu(). | |
238 | ||
239 | If you want to wait for some of these other things, you might | |
240 | instead need to use synchronize_irq() or synchronize_sched(). | |
d19720a9 PM |
241 | |
242 | 12. Any lock acquired by an RCU callback must be acquired elsewhere | |
243 | with irq disabled, e.g., via spin_lock_irqsave(). Failing to | |
244 | disable irq on a given acquisition of that lock will result in | |
245 | deadlock as soon as the RCU callback happens to interrupt that | |
246 | acquisition's critical section. | |
621934ee | 247 | |
ef48bd24 PM |
248 | 13. RCU callbacks can be and are executed in parallel. In many cases, |
249 | the callback code simply wrappers around kfree(), so that this | |
250 | is not an issue (or, more accurately, to the extent that it is | |
251 | an issue, the memory-allocator locking handles it). However, | |
252 | if the callbacks do manipulate a shared data structure, they | |
253 | must use whatever locking or other synchronization is required | |
254 | to safely access and/or modify that data structure. | |
255 | ||
32300751 PM |
256 | RCU callbacks are -usually- executed on the same CPU that executed |
257 | the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(), | |
258 | but are by -no- means guaranteed to be. For example, if a given | |
259 | CPU goes offline while having an RCU callback pending, then that | |
260 | RCU callback will execute on some surviving CPU. (If this was | |
261 | not the case, a self-spawning RCU callback would prevent the | |
262 | victim CPU from ever going offline.) | |
263 | ||
ef48bd24 | 264 | 14. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu()) |
621934ee PM |
265 | may only be invoked from process context. Unlike other forms of |
266 | RCU, it -is- permissible to block in an SRCU read-side critical | |
267 | section (demarked by srcu_read_lock() and srcu_read_unlock()), | |
268 | hence the "SRCU": "sleepable RCU". Please note that if you | |
269 | don't need to sleep in read-side critical sections, you should | |
270 | be using RCU rather than SRCU, because RCU is almost always | |
271 | faster and easier to use than is SRCU. | |
272 | ||
273 | Also unlike other forms of RCU, explicit initialization | |
274 | and cleanup is required via init_srcu_struct() and | |
275 | cleanup_srcu_struct(). These are passed a "struct srcu_struct" | |
276 | that defines the scope of a given SRCU domain. Once initialized, | |
277 | the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock() | |
278 | and synchronize_srcu(). A given synchronize_srcu() waits only | |
279 | for SRCU read-side critical sections governed by srcu_read_lock() | |
280 | and srcu_read_unlock() calls that have been passd the same | |
281 | srcu_struct. This property is what makes sleeping read-side | |
282 | critical sections tolerable -- a given subsystem delays only | |
283 | its own updates, not those of other subsystems using SRCU. | |
284 | Therefore, SRCU is less prone to OOM the system than RCU would | |
285 | be if RCU's read-side critical sections were permitted to | |
286 | sleep. | |
287 | ||
288 | The ability to sleep in read-side critical sections does not | |
289 | come for free. First, corresponding srcu_read_lock() and | |
290 | srcu_read_unlock() calls must be passed the same srcu_struct. | |
291 | Second, grace-period-detection overhead is amortized only | |
292 | over those updates sharing a given srcu_struct, rather than | |
293 | being globally amortized as they are for other forms of RCU. | |
294 | Therefore, SRCU should be used in preference to rw_semaphore | |
295 | only in extremely read-intensive situations, or in situations | |
296 | requiring SRCU's read-side deadlock immunity or low read-side | |
297 | realtime latency. | |
298 | ||
299 | Note that, rcu_assign_pointer() and rcu_dereference() relate to | |
300 | SRCU just as they do to other forms of RCU. |