Commit | Line | Data |
---|---|---|
7fe31d28 DM |
1 | Cluster-wide Power-up/power-down race avoidance algorithm |
2 | ========================================================= | |
3 | ||
4 | This file documents the algorithm which is used to coordinate CPU and | |
5 | cluster setup and teardown operations and to manage hardware coherency | |
6 | controls safely. | |
7 | ||
8 | The section "Rationale" explains what the algorithm is for and why it is | |
9 | needed. "Basic model" explains general concepts using a simplified view | |
10 | of the system. The other sections explain the actual details of the | |
11 | algorithm in use. | |
12 | ||
13 | ||
14 | Rationale | |
15 | --------- | |
16 | ||
17 | In a system containing multiple CPUs, it is desirable to have the | |
18 | ability to turn off individual CPUs when the system is idle, reducing | |
19 | power consumption and thermal dissipation. | |
20 | ||
21 | In a system containing multiple clusters of CPUs, it is also desirable | |
22 | to have the ability to turn off entire clusters. | |
23 | ||
24 | Turning entire clusters off and on is a risky business, because it | |
25 | involves performing potentially destructive operations affecting a group | |
26 | of independently running CPUs, while the OS continues to run. This | |
27 | means that we need some coordination in order to ensure that critical | |
28 | cluster-level operations are only performed when it is truly safe to do | |
29 | so. | |
30 | ||
31 | Simple locking may not be sufficient to solve this problem, because | |
32 | mechanisms like Linux spinlocks may rely on coherency mechanisms which | |
33 | are not immediately enabled when a cluster powers up. Since enabling or | |
34 | disabling those mechanisms may itself be a non-atomic operation (such as | |
35 | writing some hardware registers and invalidating large caches), other | |
36 | methods of coordination are required in order to guarantee safe | |
37 | power-down and power-up at the cluster level. | |
38 | ||
39 | The mechanism presented in this document describes a coherent memory | |
40 | based protocol for performing the needed coordination. It aims to be as | |
41 | lightweight as possible, while providing the required safety properties. | |
42 | ||
43 | ||
44 | Basic model | |
45 | ----------- | |
46 | ||
47 | Each cluster and CPU is assigned a state, as follows: | |
48 | ||
49 | DOWN | |
50 | COMING_UP | |
51 | UP | |
52 | GOING_DOWN | |
53 | ||
54 | +---------> UP ----------+ | |
55 | | v | |
56 | ||
57 | COMING_UP GOING_DOWN | |
58 | ||
59 | ^ | | |
60 | +--------- DOWN <--------+ | |
61 | ||
62 | ||
63 | DOWN: The CPU or cluster is not coherent, and is either powered off or | |
64 | suspended, or is ready to be powered off or suspended. | |
65 | ||
66 | COMING_UP: The CPU or cluster has committed to moving to the UP state. | |
67 | It may be part way through the process of initialisation and | |
68 | enabling coherency. | |
69 | ||
70 | UP: The CPU or cluster is active and coherent at the hardware | |
71 | level. A CPU in this state is not necessarily being used | |
72 | actively by the kernel. | |
73 | ||
74 | GOING_DOWN: The CPU or cluster has committed to moving to the DOWN | |
75 | state. It may be part way through the process of teardown and | |
76 | coherency exit. | |
77 | ||
78 | ||
79 | Each CPU has one of these states assigned to it at any point in time. | |
80 | The CPU states are described in the "CPU state" section, below. | |
81 | ||
82 | Each cluster is also assigned a state, but it is necessary to split the | |
83 | state value into two parts (the "cluster" state and "inbound" state) and | |
84 | to introduce additional states in order to avoid races between different | |
85 | CPUs in the cluster simultaneously modifying the state. The cluster- | |
86 | level states are described in the "Cluster state" section. | |
87 | ||
88 | To help distinguish the CPU states from cluster states in this | |
89 | discussion, the state names are given a CPU_ prefix for the CPU states, | |
90 | and a CLUSTER_ or INBOUND_ prefix for the cluster states. | |
91 | ||
92 | ||
93 | CPU state | |
94 | --------- | |
95 | ||
96 | In this algorithm, each individual core in a multi-core processor is | |
97 | referred to as a "CPU". CPUs are assumed to be single-threaded: | |
98 | therefore, a CPU can only be doing one thing at a single point in time. | |
99 | ||
100 | This means that CPUs fit the basic model closely. | |
101 | ||
102 | The algorithm defines the following states for each CPU in the system: | |
103 | ||
104 | CPU_DOWN | |
105 | CPU_COMING_UP | |
106 | CPU_UP | |
107 | CPU_GOING_DOWN | |
108 | ||
109 | cluster setup and | |
110 | CPU setup complete policy decision | |
111 | +-----------> CPU_UP ------------+ | |
112 | | v | |
113 | ||
114 | CPU_COMING_UP CPU_GOING_DOWN | |
115 | ||
116 | ^ | | |
117 | +----------- CPU_DOWN <----------+ | |
118 | policy decision CPU teardown complete | |
119 | or hardware event | |
120 | ||
121 | ||
122 | The definitions of the four states correspond closely to the states of | |
123 | the basic model. | |
124 | ||
125 | Transitions between states occur as follows. | |
126 | ||
127 | A trigger event (spontaneous) means that the CPU can transition to the | |
128 | next state as a result of making local progress only, with no | |
129 | requirement for any external event to happen. | |
130 | ||
131 | ||
132 | CPU_DOWN: | |
133 | ||
134 | A CPU reaches the CPU_DOWN state when it is ready for | |
135 | power-down. On reaching this state, the CPU will typically | |
136 | power itself down or suspend itself, via a WFI instruction or a | |
137 | firmware call. | |
138 | ||
139 | Next state: CPU_COMING_UP | |
140 | Conditions: none | |
141 | ||
142 | Trigger events: | |
143 | ||
144 | a) an explicit hardware power-up operation, resulting | |
145 | from a policy decision on another CPU; | |
146 | ||
147 | b) a hardware event, such as an interrupt. | |
148 | ||
149 | ||
150 | CPU_COMING_UP: | |
151 | ||
152 | A CPU cannot start participating in hardware coherency until the | |
153 | cluster is set up and coherent. If the cluster is not ready, | |
154 | then the CPU will wait in the CPU_COMING_UP state until the | |
155 | cluster has been set up. | |
156 | ||
157 | Next state: CPU_UP | |
158 | Conditions: The CPU's parent cluster must be in CLUSTER_UP. | |
159 | Trigger events: Transition of the parent cluster to CLUSTER_UP. | |
160 | ||
161 | Refer to the "Cluster state" section for a description of the | |
162 | CLUSTER_UP state. | |
163 | ||
164 | ||
165 | CPU_UP: | |
166 | When a CPU reaches the CPU_UP state, it is safe for the CPU to | |
167 | start participating in local coherency. | |
168 | ||
169 | This is done by jumping to the kernel's CPU resume code. | |
170 | ||
171 | Note that the definition of this state is slightly different | |
172 | from the basic model definition: CPU_UP does not mean that the | |
173 | CPU is coherent yet, but it does mean that it is safe to resume | |
174 | the kernel. The kernel handles the rest of the resume | |
175 | procedure, so the remaining steps are not visible as part of the | |
176 | race avoidance algorithm. | |
177 | ||
178 | The CPU remains in this state until an explicit policy decision | |
179 | is made to shut down or suspend the CPU. | |
180 | ||
181 | Next state: CPU_GOING_DOWN | |
182 | Conditions: none | |
183 | Trigger events: explicit policy decision | |
184 | ||
185 | ||
186 | CPU_GOING_DOWN: | |
187 | ||
188 | While in this state, the CPU exits coherency, including any | |
189 | operations required to achieve this (such as cleaning data | |
190 | caches). | |
191 | ||
192 | Next state: CPU_DOWN | |
193 | Conditions: local CPU teardown complete | |
194 | Trigger events: (spontaneous) | |
195 | ||
196 | ||
197 | Cluster state | |
198 | ------------- | |
199 | ||
200 | A cluster is a group of connected CPUs with some common resources. | |
201 | Because a cluster contains multiple CPUs, it can be doing multiple | |
202 | things at the same time. This has some implications. In particular, a | |
203 | CPU can start up while another CPU is tearing the cluster down. | |
204 | ||
205 | In this discussion, the "outbound side" is the view of the cluster state | |
206 | as seen by a CPU tearing the cluster down. The "inbound side" is the | |
207 | view of the cluster state as seen by a CPU setting the CPU up. | |
208 | ||
209 | In order to enable safe coordination in such situations, it is important | |
210 | that a CPU which is setting up the cluster can advertise its state | |
211 | independently of the CPU which is tearing down the cluster. For this | |
212 | reason, the cluster state is split into two parts: | |
213 | ||
214 | "cluster" state: The global state of the cluster; or the state | |
215 | on the outbound side: | |
216 | ||
217 | CLUSTER_DOWN | |
218 | CLUSTER_UP | |
219 | CLUSTER_GOING_DOWN | |
220 | ||
221 | "inbound" state: The state of the cluster on the inbound side. | |
222 | ||
223 | INBOUND_NOT_COMING_UP | |
224 | INBOUND_COMING_UP | |
225 | ||
226 | ||
227 | The different pairings of these states results in six possible | |
228 | states for the cluster as a whole: | |
229 | ||
230 | CLUSTER_UP | |
231 | +==========> INBOUND_NOT_COMING_UP -------------+ | |
232 | # | | |
233 | | | |
234 | CLUSTER_UP <----+ | | |
235 | INBOUND_COMING_UP | v | |
236 | ||
237 | ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN | |
238 | # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP | |
239 | ||
240 | CLUSTER_DOWN | | | |
241 | INBOUND_COMING_UP <----+ | | |
242 | | | |
243 | ^ | | |
244 | +=========== CLUSTER_DOWN <------------+ | |
245 | INBOUND_NOT_COMING_UP | |
246 | ||
247 | Transitions -----> can only be made by the outbound CPU, and | |
248 | only involve changes to the "cluster" state. | |
249 | ||
250 | Transitions ===##> can only be made by the inbound CPU, and only | |
251 | involve changes to the "inbound" state, except where there is no | |
252 | further transition possible on the outbound side (i.e., the | |
253 | outbound CPU has put the cluster into the CLUSTER_DOWN state). | |
254 | ||
255 | The race avoidance algorithm does not provide a way to determine | |
256 | which exact CPUs within the cluster play these roles. This must | |
257 | be decided in advance by some other means. Refer to the section | |
258 | "Last man and first man selection" for more explanation. | |
259 | ||
260 | ||
261 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the | |
262 | cluster can actually be powered down. | |
263 | ||
264 | The parallelism of the inbound and outbound CPUs is observed by | |
265 | the existence of two different paths from CLUSTER_GOING_DOWN/ | |
266 | INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic | |
267 | model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to | |
268 | COMING_UP in the basic model). The second path avoids cluster | |
269 | teardown completely. | |
270 | ||
271 | CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic | |
272 | model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP | |
273 | is trivial and merely resets the state machine ready for the | |
274 | next cycle. | |
275 | ||
276 | Details of the allowable transitions follow. | |
277 | ||
278 | The next state in each case is notated | |
279 | ||
280 | <cluster state>/<inbound state> (<transitioner>) | |
281 | ||
282 | where the <transitioner> is the side on which the transition | |
283 | can occur; either the inbound or the outbound side. | |
284 | ||
285 | ||
286 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP: | |
287 | ||
288 | Next state: CLUSTER_DOWN/INBOUND_COMING_UP (inbound) | |
289 | Conditions: none | |
290 | Trigger events: | |
291 | ||
292 | a) an explicit hardware power-up operation, resulting | |
293 | from a policy decision on another CPU; | |
294 | ||
295 | b) a hardware event, such as an interrupt. | |
296 | ||
297 | ||
298 | CLUSTER_DOWN/INBOUND_COMING_UP: | |
299 | ||
300 | In this state, an inbound CPU sets up the cluster, including | |
301 | enabling of hardware coherency at the cluster level and any | |
302 | other operations (such as cache invalidation) which are required | |
303 | in order to achieve this. | |
304 | ||
305 | The purpose of this state is to do sufficient cluster-level | |
306 | setup to enable other CPUs in the cluster to enter coherency | |
307 | safely. | |
308 | ||
309 | Next state: CLUSTER_UP/INBOUND_COMING_UP (inbound) | |
310 | Conditions: cluster-level setup and hardware coherency complete | |
311 | Trigger events: (spontaneous) | |
312 | ||
313 | ||
314 | CLUSTER_UP/INBOUND_COMING_UP: | |
315 | ||
316 | Cluster-level setup is complete and hardware coherency is | |
317 | enabled for the cluster. Other CPUs in the cluster can safely | |
318 | enter coherency. | |
319 | ||
320 | This is a transient state, leading immediately to | |
321 | CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster | |
322 | should consider treat these two states as equivalent. | |
323 | ||
324 | Next state: CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) | |
325 | Conditions: none | |
326 | Trigger events: (spontaneous) | |
327 | ||
328 | ||
329 | CLUSTER_UP/INBOUND_NOT_COMING_UP: | |
330 | ||
331 | Cluster-level setup is complete and hardware coherency is | |
332 | enabled for the cluster. Other CPUs in the cluster can safely | |
333 | enter coherency. | |
334 | ||
335 | The cluster will remain in this state until a policy decision is | |
336 | made to power the cluster down. | |
337 | ||
338 | Next state: CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) | |
339 | Conditions: none | |
340 | Trigger events: policy decision to power down the cluster | |
341 | ||
342 | ||
343 | CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: | |
344 | ||
345 | An outbound CPU is tearing the cluster down. The selected CPU | |
346 | must wait in this state until all CPUs in the cluster are in the | |
347 | CPU_DOWN state. | |
348 | ||
349 | When all CPUs are in the CPU_DOWN state, the cluster can be torn | |
350 | down, for example by cleaning data caches and exiting | |
351 | cluster-level coherency. | |
352 | ||
353 | To avoid wasteful unnecessary teardown operations, the outbound | |
354 | should check the inbound cluster state for asynchronous | |
355 | transitions to INBOUND_COMING_UP. Alternatively, individual | |
356 | CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. | |
357 | ||
358 | ||
359 | Next states: | |
360 | ||
361 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) | |
362 | Conditions: cluster torn down and ready to power off | |
363 | Trigger events: (spontaneous) | |
364 | ||
365 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) | |
366 | Conditions: none | |
367 | Trigger events: | |
368 | ||
369 | a) an explicit hardware power-up operation, | |
370 | resulting from a policy decision on another | |
371 | CPU; | |
372 | ||
373 | b) a hardware event, such as an interrupt. | |
374 | ||
375 | ||
376 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP: | |
377 | ||
378 | The cluster is (or was) being torn down, but another CPU has | |
379 | come online in the meantime and is trying to set up the cluster | |
380 | again. | |
381 | ||
382 | If the outbound CPU observes this state, it has two choices: | |
383 | ||
384 | a) back out of teardown, restoring the cluster to the | |
385 | CLUSTER_UP state; | |
386 | ||
387 | b) finish tearing the cluster down and put the cluster | |
388 | in the CLUSTER_DOWN state; the inbound CPU will | |
389 | set up the cluster again from there. | |
390 | ||
391 | Choice (a) permits the removal of some latency by avoiding | |
392 | unnecessary teardown and setup operations in situations where | |
393 | the cluster is not really going to be powered down. | |
394 | ||
395 | ||
396 | Next states: | |
397 | ||
398 | CLUSTER_UP/INBOUND_COMING_UP (outbound) | |
399 | Conditions: cluster-level setup and hardware | |
400 | coherency complete | |
401 | Trigger events: (spontaneous) | |
402 | ||
403 | CLUSTER_DOWN/INBOUND_COMING_UP (outbound) | |
404 | Conditions: cluster torn down and ready to power off | |
405 | Trigger events: (spontaneous) | |
406 | ||
407 | ||
408 | Last man and First man selection | |
409 | -------------------------------- | |
410 | ||
411 | The CPU which performs cluster tear-down operations on the outbound side | |
412 | is commonly referred to as the "last man". | |
413 | ||
414 | The CPU which performs cluster setup on the inbound side is commonly | |
415 | referred to as the "first man". | |
416 | ||
417 | The race avoidance algorithm documented above does not provide a | |
418 | mechanism to choose which CPUs should play these roles. | |
419 | ||
420 | ||
421 | Last man: | |
422 | ||
423 | When shutting down the cluster, all the CPUs involved are initially | |
424 | executing Linux and hence coherent. Therefore, ordinary spinlocks can | |
425 | be used to select a last man safely, before the CPUs become | |
426 | non-coherent. | |
427 | ||
428 | ||
429 | First man: | |
430 | ||
431 | Because CPUs may power up asynchronously in response to external wake-up | |
432 | events, a dynamic mechanism is needed to make sure that only one CPU | |
433 | attempts to play the first man role and do the cluster-level | |
434 | initialisation: any other CPUs must wait for this to complete before | |
435 | proceeding. | |
436 | ||
437 | Cluster-level initialisation may involve actions such as configuring | |
438 | coherency controls in the bus fabric. | |
439 | ||
440 | The current implementation in mcpm_head.S uses a separate mutual exclusion | |
441 | mechanism to do this arbitration. This mechanism is documented in | |
442 | detail in vlocks.txt. | |
443 | ||
444 | ||
445 | Features and Limitations | |
446 | ------------------------ | |
447 | ||
448 | Implementation: | |
449 | ||
450 | The current ARM-based implementation is split between | |
451 | arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and | |
452 | arch/arm/common/mcpm_entry.c (everything else): | |
453 | ||
454 | __mcpm_cpu_going_down() signals the transition of a CPU to the | |
455 | CPU_GOING_DOWN state. | |
456 | ||
457 | __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN | |
458 | state. | |
459 | ||
460 | A CPU transitions to CPU_COMING_UP and then to CPU_UP via the | |
461 | low-level power-up code in mcpm_head.S. This could | |
462 | involve CPU-specific setup code, but in the current | |
463 | implementation it does not. | |
464 | ||
465 | __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() | |
466 | handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN | |
467 | and from there to CLUSTER_DOWN or back to CLUSTER_UP (in | |
468 | the case of an aborted cluster power-down). | |
469 | ||
470 | These functions are more complex than the __mcpm_cpu_*() | |
471 | functions due to the extra inter-CPU coordination which | |
472 | is needed for safe transitions at the cluster level. | |
473 | ||
474 | A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via | |
475 | the low-level power-up code in mcpm_head.S. This | |
476 | typically involves platform-specific setup code, | |
477 | provided by the platform-specific power_up_setup | |
478 | function registered via mcpm_sync_init. | |
479 | ||
480 | Deep topologies: | |
481 | ||
482 | As currently described and implemented, the algorithm does not | |
483 | support CPU topologies involving more than two levels (i.e., | |
484 | clusters of clusters are not supported). The algorithm could be | |
485 | extended by replicating the cluster-level states for the | |
486 | additional topological levels, and modifying the transition | |
487 | rules for the intermediate (non-outermost) cluster levels. | |
488 | ||
489 | ||
490 | Colophon | |
491 | -------- | |
492 | ||
493 | Originally created and documented by Dave Martin for Linaro Limited, in | |
494 | collaboration with Nicolas Pitre and Achin Gupta. | |
495 | ||
496 | Copyright (C) 2012-2013 Linaro Limited | |
497 | Distributed under the terms of Version 2 of the GNU General Public | |
498 | License, as defined in linux/COPYING. |