Commit | Line | Data |
---|---|---|
f3e97da3 IM |
1 | Runtime locking correctness validator |
2 | ===================================== | |
3 | ||
4 | started by Ingo Molnar <mingo@redhat.com> | |
5 | additions by Arjan van de Ven <arjan@linux.intel.com> | |
6 | ||
7 | Lock-class | |
8 | ---------- | |
9 | ||
10 | The basic object the validator operates upon is a 'class' of locks. | |
11 | ||
12 | A class of locks is a group of locks that are logically the same with | |
13 | respect to locking rules, even if the locks may have multiple (possibly | |
14 | tens of thousands of) instantiations. For example a lock in the inode | |
15 | struct is one class, while each inode has its own instantiation of that | |
16 | lock class. | |
17 | ||
18 | The validator tracks the 'state' of lock-classes, and it tracks | |
19 | dependencies between different lock-classes. The validator maintains a | |
20 | rolling proof that the state and the dependencies are correct. | |
21 | ||
22 | Unlike an lock instantiation, the lock-class itself never goes away: when | |
23 | a lock-class is used for the first time after bootup it gets registered, | |
24 | and all subsequent uses of that lock-class will be attached to this | |
25 | lock-class. | |
26 | ||
27 | State | |
28 | ----- | |
29 | ||
f510b233 | 30 | The validator tracks lock-class usage history into 4n + 1 separate state bits: |
f3e97da3 | 31 | |
f510b233 | 32 | - 'ever held in STATE context' |
0e692a94 LZ |
33 | - 'ever held as readlock in STATE context' |
34 | - 'ever held with STATE enabled' | |
35 | - 'ever held as readlock with STATE enabled' | |
f510b233 PZ |
36 | |
37 | Where STATE can be either one of (kernel/lockdep_states.h) | |
38 | - hardirq | |
39 | - softirq | |
40 | - reclaim_fs | |
f3e97da3 IM |
41 | |
42 | - 'ever used' [ == !unused ] | |
43 | ||
f510b233 PZ |
44 | When locking rules are violated, these state bits are presented in the |
45 | locking error messages, inside curlies. A contrived example: | |
fd7bcea3 JC |
46 | |
47 | modprobe/2287 is trying to acquire lock: | |
f510b233 | 48 | (&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24 |
fd7bcea3 JC |
49 | |
50 | but task is already holding lock: | |
f510b233 | 51 | (&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24 |
fd7bcea3 JC |
52 | |
53 | ||
f510b233 PZ |
54 | The bit position indicates STATE, STATE-read, for each of the states listed |
55 | above, and the character displayed in each indicates: | |
fd7bcea3 | 56 | |
992d7ced ML |
57 | '.' acquired while irqs disabled and not in irq context |
58 | '-' acquired in irq context | |
59 | '+' acquired with irqs enabled | |
f510b233 | 60 | '?' acquired in irq context with irqs enabled. |
fd7bcea3 JC |
61 | |
62 | Unused mutexes cannot be part of the cause of an error. | |
63 | ||
64 | ||
f3e97da3 IM |
65 | Single-lock state rules: |
66 | ------------------------ | |
67 | ||
68 | A softirq-unsafe lock-class is automatically hardirq-unsafe as well. The | |
69 | following states are exclusive, and only one of them is allowed to be | |
70 | set for any lock-class: | |
71 | ||
72 | <hardirq-safe> and <hardirq-unsafe> | |
73 | <softirq-safe> and <softirq-unsafe> | |
74 | ||
75 | The validator detects and reports lock usage that violate these | |
76 | single-lock state rules. | |
77 | ||
78 | Multi-lock dependency rules: | |
79 | ---------------------------- | |
80 | ||
81 | The same lock-class must not be acquired twice, because this could lead | |
82 | to lock recursion deadlocks. | |
83 | ||
84 | Furthermore, two locks may not be taken in different order: | |
85 | ||
86 | <L1> -> <L2> | |
87 | <L2> -> <L1> | |
88 | ||
89 | because this could lead to lock inversion deadlocks. (The validator | |
90 | finds such dependencies in arbitrary complexity, i.e. there can be any | |
91 | other locking sequence between the acquire-lock operations, the | |
92 | validator will still track all dependencies between locks.) | |
93 | ||
94 | Furthermore, the following usage based lock dependencies are not allowed | |
95 | between any two lock-classes: | |
96 | ||
97 | <hardirq-safe> -> <hardirq-unsafe> | |
98 | <softirq-safe> -> <softirq-unsafe> | |
99 | ||
100 | The first rule comes from the fact the a hardirq-safe lock could be | |
101 | taken by a hardirq context, interrupting a hardirq-unsafe lock - and | |
102 | thus could result in a lock inversion deadlock. Likewise, a softirq-safe | |
103 | lock could be taken by an softirq context, interrupting a softirq-unsafe | |
104 | lock. | |
105 | ||
106 | The above rules are enforced for any locking sequence that occurs in the | |
107 | kernel: when acquiring a new lock, the validator checks whether there is | |
108 | any rule violation between the new lock and any of the held locks. | |
109 | ||
110 | When a lock-class changes its state, the following aspects of the above | |
111 | dependency rules are enforced: | |
112 | ||
113 | - if a new hardirq-safe lock is discovered, we check whether it | |
114 | took any hardirq-unsafe lock in the past. | |
115 | ||
116 | - if a new softirq-safe lock is discovered, we check whether it took | |
117 | any softirq-unsafe lock in the past. | |
118 | ||
119 | - if a new hardirq-unsafe lock is discovered, we check whether any | |
120 | hardirq-safe lock took it in the past. | |
121 | ||
122 | - if a new softirq-unsafe lock is discovered, we check whether any | |
123 | softirq-safe lock took it in the past. | |
124 | ||
125 | (Again, we do these checks too on the basis that an interrupt context | |
126 | could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which | |
127 | could lead to a lock inversion deadlock - even if that lock scenario did | |
128 | not trigger in practice yet.) | |
129 | ||
130 | Exception: Nested data dependencies leading to nested locking | |
131 | ------------------------------------------------------------- | |
132 | ||
133 | There are a few cases where the Linux kernel acquires more than one | |
134 | instance of the same lock-class. Such cases typically happen when there | |
135 | is some sort of hierarchy within objects of the same type. In these | |
136 | cases there is an inherent "natural" ordering between the two objects | |
137 | (defined by the properties of the hierarchy), and the kernel grabs the | |
138 | locks in this fixed order on each of the objects. | |
139 | ||
2fe0ae78 | 140 | An example of such an object hierarchy that results in "nested locking" |
f3e97da3 IM |
141 | is that of a "whole disk" block-dev object and a "partition" block-dev |
142 | object; the partition is "part of" the whole device and as long as one | |
143 | always takes the whole disk lock as a higher lock than the partition | |
144 | lock, the lock ordering is fully correct. The validator does not | |
145 | automatically detect this natural ordering, as the locking rule behind | |
146 | the ordering is not static. | |
147 | ||
148 | In order to teach the validator about this correct usage model, new | |
149 | versions of the various locking primitives were added that allow you to | |
150 | specify a "nesting level". An example call, for the block device mutex, | |
151 | looks like this: | |
152 | ||
153 | enum bdev_bd_mutex_lock_class | |
154 | { | |
155 | BD_MUTEX_NORMAL, | |
156 | BD_MUTEX_WHOLE, | |
157 | BD_MUTEX_PARTITION | |
158 | }; | |
159 | ||
160 | mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION); | |
161 | ||
162 | In this case the locking is done on a bdev object that is known to be a | |
163 | partition. | |
164 | ||
a2ffd275 | 165 | The validator treats a lock that is taken in such a nested fashion as a |
f3e97da3 IM |
166 | separate (sub)class for the purposes of validation. |
167 | ||
168 | Note: When changing code to use the _nested() primitives, be careful and | |
2fe0ae78 | 169 | check really thoroughly that the hierarchy is correctly mapped; otherwise |
f3e97da3 IM |
170 | you can get false positives or false negatives. |
171 | ||
172 | Proof of 100% correctness: | |
173 | -------------------------- | |
174 | ||
175 | The validator achieves perfect, mathematical 'closure' (proof of locking | |
176 | correctness) in the sense that for every simple, standalone single-task | |
992caacf | 177 | locking sequence that occurred at least once during the lifetime of the |
f3e97da3 IM |
178 | kernel, the validator proves it with a 100% certainty that no |
179 | combination and timing of these locking sequences can cause any class of | |
180 | lock related deadlock. [*] | |
181 | ||
182 | I.e. complex multi-CPU and multi-task locking scenarios do not have to | |
183 | occur in practice to prove a deadlock: only the simple 'component' | |
184 | locking chains have to occur at least once (anytime, in any | |
185 | task/context) for the validator to be able to prove correctness. (For | |
186 | example, complex deadlocks that would normally need more than 3 CPUs and | |
187 | a very unlikely constellation of tasks, irq-contexts and timings to | |
188 | occur, can be detected on a plain, lightly loaded single-CPU system as | |
189 | well!) | |
190 | ||
191 | This radically decreases the complexity of locking related QA of the | |
192 | kernel: what has to be done during QA is to trigger as many "simple" | |
193 | single-task locking dependencies in the kernel as possible, at least | |
194 | once, to prove locking correctness - instead of having to trigger every | |
195 | possible combination of locking interaction between CPUs, combined with | |
196 | every possible hardirq and softirq nesting scenario (which is impossible | |
197 | to do in practice). | |
198 | ||
199 | [*] assuming that the validator itself is 100% correct, and no other | |
200 | part of the system corrupts the state of the validator in any way. | |
201 | We also assume that all NMI/SMM paths [which could interrupt | |
202 | even hardirq-disabled codepaths] are correct and do not interfere | |
203 | with the validator. We also assume that the 64-bit 'chain hash' | |
204 | value is unique for every lock-chain in the system. Also, lock | |
205 | recursion must not be higher than 20. | |
206 | ||
207 | Performance: | |
208 | ------------ | |
209 | ||
210 | The above rules require _massive_ amounts of runtime checking. If we did | |
211 | that for every lock taken and for every irqs-enable event, it would | |
212 | render the system practically unusably slow. The complexity of checking | |
213 | is O(N^2), so even with just a few hundred lock-classes we'd have to do | |
214 | tens of thousands of checks for every event. | |
215 | ||
216 | This problem is solved by checking any given 'locking scenario' (unique | |
217 | sequence of locks taken after each other) only once. A simple stack of | |
218 | held locks is maintained, and a lightweight 64-bit hash value is | |
219 | calculated, which hash is unique for every lock chain. The hash value, | |
220 | when the chain is validated for the first time, is then put into a hash | |
221 | table, which hash-table can be checked in a lockfree manner. If the | |
222 | locking chain occurs again later on, the hash table tells us that we | |
223 | dont have to validate the chain again. | |
b804cb9e PM |
224 | |
225 | Troubleshooting: | |
226 | ---------------- | |
227 | ||
228 | The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. | |
229 | Exceeding this number will trigger the following lockdep warning: | |
230 | ||
231 | (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) | |
232 | ||
233 | By default, MAX_LOCKDEP_KEYS is currently set to 8191, and typical | |
234 | desktop systems have less than 1,000 lock classes, so this warning | |
235 | normally results from lock-class leakage or failure to properly | |
236 | initialize locks. These two problems are illustrated below: | |
237 | ||
238 | 1. Repeated module loading and unloading while running the validator | |
239 | will result in lock-class leakage. The issue here is that each | |
240 | load of the module will create a new set of lock classes for | |
241 | that module's locks, but module unloading does not remove old | |
242 | classes (see below discussion of reuse of lock classes for why). | |
243 | Therefore, if that module is loaded and unloaded repeatedly, | |
244 | the number of lock classes will eventually reach the maximum. | |
245 | ||
246 | 2. Using structures such as arrays that have large numbers of | |
247 | locks that are not explicitly initialized. For example, | |
248 | a hash table with 8192 buckets where each bucket has its own | |
249 | spinlock_t will consume 8192 lock classes -unless- each spinlock | |
250 | is explicitly initialized at runtime, for example, using the | |
251 | run-time spin_lock_init() as opposed to compile-time initializers | |
252 | such as __SPIN_LOCK_UNLOCKED(). Failure to properly initialize | |
253 | the per-bucket spinlocks would guarantee lock-class overflow. | |
254 | In contrast, a loop that called spin_lock_init() on each lock | |
255 | would place all 8192 locks into a single lock class. | |
256 | ||
257 | The moral of this story is that you should always explicitly | |
258 | initialize your locks. | |
259 | ||
260 | One might argue that the validator should be modified to allow | |
261 | lock classes to be reused. However, if you are tempted to make this | |
262 | argument, first review the code and think through the changes that would | |
263 | be required, keeping in mind that the lock classes to be removed are | |
264 | likely to be linked into the lock-dependency graph. This turns out to | |
265 | be harder to do than to say. | |
266 | ||
267 | Of course, if you do run out of lock classes, the next thing to do is | |
268 | to find the offending lock classes. First, the following command gives | |
269 | you the number of lock classes currently in use along with the maximum: | |
270 | ||
271 | grep "lock-classes" /proc/lockdep_stats | |
272 | ||
273 | This command produces the following output on a modest system: | |
274 | ||
275 | lock-classes: 748 [max: 8191] | |
276 | ||
277 | If the number allocated (748 above) increases continually over time, | |
278 | then there is likely a leak. The following command can be used to | |
279 | identify the leaking lock classes: | |
280 | ||
281 | grep "BD" /proc/lockdep | |
282 | ||
283 | Run the command and save the output, then compare against the output from | |
284 | a later run of this command to identify the leakers. This same output | |
285 | can also help you find situations where runtime lock initialization has | |
286 | been omitted. |