Commit | Line | Data |
---|---|---|
7fef9fc8 SB |
1 | Interaction of Suspend code (S3) with the CPU hotplug infrastructure |
2 | ||
9b758d4e | 3 | (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> |
7fef9fc8 SB |
4 | |
5 | ||
6 | I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM | |
7 | infrastructure uses it internally? And where do they share common code? | |
8 | ||
9 | Well, a picture is worth a thousand words... So ASCII art follows :-) | |
10 | ||
11 | [This depicts the current design in the kernel, and focusses only on the | |
12 | interactions involving the freezer and CPU hotplug and also tries to explain | |
13 | the locking involved. It outlines the notifications involved as well. | |
14 | But please note that here, only the call paths are illustrated, with the aim | |
15 | of describing where they take different paths and where they share code. | |
16 | What happens when regular CPU hotplug and Suspend-to-RAM race with each other | |
17 | is not depicted here.] | |
18 | ||
19 | On a high level, the suspend-resume cycle goes like this: | |
20 | ||
21 | |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | | |
22 | |tasks | | cpus | | | | cpus | |tasks| | |
23 | ||
24 | ||
25 | More details follow: | |
26 | ||
27 | Suspend call path | |
28 | ----------------- | |
29 | ||
30 | Write 'mem' to | |
31 | /sys/power/state | |
6237dd13 | 32 | sysfs file |
7fef9fc8 SB |
33 | | |
34 | v | |
35 | Acquire pm_mutex lock | |
36 | | | |
37 | v | |
38 | Send PM_SUSPEND_PREPARE | |
39 | notifications | |
40 | | | |
41 | v | |
42 | Freeze tasks | |
43 | | | |
44 | | | |
45 | v | |
46 | disable_nonboot_cpus() | |
47 | /* start */ | |
48 | | | |
49 | v | |
50 | Acquire cpu_add_remove_lock | |
51 | | | |
52 | v | |
53 | Iterate over CURRENTLY | |
54 | online CPUs | |
55 | | | |
56 | | | |
57 | | ---------- | |
58 | v | L | |
59 | ======> _cpu_down() | | |
60 | | [This takes cpuhotplug.lock | | |
61 | Common | before taking down the CPU | | |
62 | code | and releases it when done] | O | |
63 | | While it is at it, notifications | | |
64 | | are sent when notable events occur, | | |
65 | ======> by running all registered callbacks. | | |
66 | | | O | |
67 | | | | |
68 | | | | |
69 | v | | |
70 | Note down these cpus in | P | |
71 | frozen_cpus mask ---------- | |
72 | | | |
73 | v | |
74 | Disable regular cpu hotplug | |
75 | by setting cpu_hotplug_disabled=1 | |
76 | | | |
77 | v | |
78 | Release cpu_add_remove_lock | |
79 | | | |
80 | v | |
81 | /* disable_nonboot_cpus() complete */ | |
82 | | | |
83 | v | |
84 | Do suspend | |
85 | ||
86 | ||
87 | ||
88 | Resuming back is likewise, with the counterparts being (in the order of | |
89 | execution during resume): | |
90 | * enable_nonboot_cpus() which involves: | |
91 | | Acquire cpu_add_remove_lock | |
92 | | Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug | |
93 | | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] | |
94 | | Release cpu_add_remove_lock | |
95 | v | |
96 | ||
97 | * thaw tasks | |
98 | * send PM_POST_SUSPEND notifications | |
99 | * Release pm_mutex lock. | |
100 | ||
101 | ||
102 | It is to be noted here that the pm_mutex lock is acquired at the very | |
103 | beginning, when we are just starting out to suspend, and then released only | |
104 | after the entire cycle is complete (i.e., suspend + resume). | |
105 | ||
106 | ||
107 | ||
108 | Regular CPU hotplug call path | |
109 | ----------------------------- | |
110 | ||
111 | Write 0 (or 1) to | |
112 | /sys/devices/system/cpu/cpu*/online | |
113 | sysfs file | |
114 | | | |
115 | | | |
116 | v | |
117 | cpu_down() | |
118 | | | |
119 | v | |
120 | Acquire cpu_add_remove_lock | |
121 | | | |
122 | v | |
123 | If cpu_hotplug_disabled is 1 | |
124 | return gracefully | |
125 | | | |
126 | | | |
127 | v | |
128 | ======> _cpu_down() | |
129 | | [This takes cpuhotplug.lock | |
130 | Common | before taking down the CPU | |
131 | code | and releases it when done] | |
132 | | While it is at it, notifications | |
133 | | are sent when notable events occur, | |
134 | ======> by running all registered callbacks. | |
135 | | | |
136 | | | |
137 | v | |
138 | Release cpu_add_remove_lock | |
139 | [That's it!, for | |
140 | regular CPU hotplug] | |
141 | ||
142 | ||
143 | ||
144 | So, as can be seen from the two diagrams (the parts marked as "Common code"), | |
145 | regular CPU hotplug and the suspend code path converge at the _cpu_down() and | |
146 | _cpu_up() functions. They differ in the arguments passed to these functions, | |
147 | in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' | |
148 | argument. But during suspend, since the tasks are already frozen by the time | |
149 | the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called | |
150 | with the 'tasks_frozen' argument set to 1. | |
151 | [See below for some known issues regarding this.] | |
152 | ||
153 | ||
154 | Important files and functions/entry points: | |
155 | ------------------------------------------ | |
156 | ||
157 | kernel/power/process.c : freeze_processes(), thaw_processes() | |
158 | kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() | |
159 | kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() | |
160 | ||
161 | ||
162 | ||
163 | II. What are the issues involved in CPU hotplug? | |
164 | ------------------------------------------- | |
165 | ||
166 | There are some interesting situations involving CPU hotplug and microcode | |
167 | update on the CPUs, as discussed below: | |
168 | ||
169 | [Please bear in mind that the kernel requests the microcode images from | |
170 | userspace, using the request_firmware() function defined in | |
171 | drivers/base/firmware_class.c] | |
172 | ||
173 | ||
174 | a. When all the CPUs are identical: | |
175 | ||
176 | This is the most common situation and it is quite straightforward: we want | |
177 | to apply the same microcode revision to each of the CPUs. | |
178 | To give an example of x86, the collect_cpu_info() function defined in | |
179 | arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU | |
180 | and thereby in applying the correct microcode revision to it. | |
181 | But note that the kernel does not maintain a common microcode image for the | |
182 | all CPUs, in order to handle case 'b' described below. | |
183 | ||
184 | ||
185 | b. When some of the CPUs are different than the rest: | |
186 | ||
187 | In this case since we probably need to apply different microcode revisions | |
188 | to different CPUs, the kernel maintains a copy of the correct microcode | |
189 | image for each CPU (after appropriate CPU type/model discovery using | |
190 | functions such as collect_cpu_info()). | |
191 | ||
192 | ||
193 | c. When a CPU is physically hot-unplugged and a new (and possibly different | |
194 | type of) CPU is hot-plugged into the system: | |
195 | ||
196 | In the current design of the kernel, whenever a CPU is taken offline during | |
197 | a regular CPU hotplug operation, upon receiving the CPU_DEAD notification | |
198 | (which is sent by the CPU hotplug code), the microcode update driver's | |
199 | callback for that event reacts by freeing the kernel's copy of the | |
200 | microcode image for that CPU. | |
201 | ||
202 | Hence, when a new CPU is brought online, since the kernel finds that it | |
203 | doesn't have the microcode image, it does the CPU type/model discovery | |
204 | afresh and then requests the userspace for the appropriate microcode image | |
205 | for that CPU, which is subsequently applied. | |
206 | ||
207 | For example, in x86, the mc_cpu_callback() function (which is the microcode | |
208 | update driver's callback registered for CPU hotplug events) calls | |
209 | microcode_update_cpu() which would call microcode_init_cpu() in this case, | |
210 | instead of microcode_resume_cpu() when it finds that the kernel doesn't | |
211 | have a valid microcode image. This ensures that the CPU type/model | |
212 | discovery is performed and the right microcode is applied to the CPU after | |
213 | getting it from userspace. | |
214 | ||
215 | ||
216 | d. Handling microcode update during suspend/hibernate: | |
217 | ||
218 | Strictly speaking, during a CPU hotplug operation which does not involve | |
219 | physically removing or inserting CPUs, the CPUs are not actually powered | |
220 | off during a CPU offline. They are just put to the lowest C-states possible. | |
221 | Hence, in such a case, it is not really necessary to re-apply microcode | |
222 | when the CPUs are brought back online, since they wouldn't have lost the | |
223 | image during the CPU offline operation. | |
224 | ||
225 | This is the usual scenario encountered during a resume after a suspend. | |
226 | However, in the case of hibernation, since all the CPUs are completely | |
227 | powered off, during restore it becomes necessary to apply the microcode | |
228 | images to all the CPUs. | |
229 | ||
230 | [Note that we don't expect someone to physically pull out nodes and insert | |
231 | nodes with a different type of CPUs in-between a suspend-resume or a | |
232 | hibernate/restore cycle.] | |
233 | ||
234 | In the current design of the kernel however, during a CPU offline operation | |
235 | as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification), | |
236 | the existing copy of microcode image in the kernel is not freed up. | |
237 | And during the CPU online operations (during resume/restore), since the | |
238 | kernel finds that it already has copies of the microcode images for all the | |
239 | CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU | |
240 | type/model and the need for validating whether the microcode revisions are | |
241 | right for the CPUs or not (due to the above assumption that physical CPU | |
242 | hotplug will not be done in-between suspend/resume or hibernate/restore | |
243 | cycles). | |
244 | ||
245 | ||
246 | III. Are there any known problems when regular CPU hotplug and suspend race | |
247 | with each other? | |
248 | ||
249 | Yes, they are listed below: | |
250 | ||
251 | 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to | |
252 | the _cpu_down() and _cpu_up() functions is *always* 0. | |
253 | This might not reflect the true current state of the system, since the | |
254 | tasks could have been frozen by an out-of-band event such as a suspend | |
255 | operation in progress. Hence, it will lead to wrong notifications being | |
256 | sent during the cpu online/offline events (eg, CPU_ONLINE notification | |
257 | instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of | |
258 | inappropriate code by the callbacks registered for such CPU hotplug events. | |
259 | ||
260 | 2. If a regular CPU hotplug stress test happens to race with the freezer due | |
261 | to a suspend operation in progress at the same time, then we could hit the | |
262 | situation described below: | |
263 | ||
264 | * A regular cpu online operation continues its journey from userspace | |
265 | into the kernel, since the freezing has not yet begun. | |
266 | * Then freezer gets to work and freezes userspace. | |
267 | * If cpu online has not yet completed the microcode update stuff by now, | |
268 | it will now start waiting on the frozen userspace in the | |
269 | TASK_UNINTERRUPTIBLE state, in order to get the microcode image. | |
270 | * Now the freezer continues and tries to freeze the remaining tasks. But | |
271 | due to this wait mentioned above, the freezer won't be able to freeze | |
272 | the cpu online hotplug task and hence freezing of tasks fails. | |
273 | ||
274 | As a result of this task freezing failure, the suspend operation gets | |
275 | aborted. |