Commit | Line | Data |
---|---|---|
0c87f9b5 PM |
1 | NO_HZ: Reducing Scheduling-Clock Ticks |
2 | ||
3 | ||
4 | This document describes Kconfig options and boot parameters that can | |
5 | reduce the number of scheduling-clock interrupts, thereby improving energy | |
6 | efficiency and reducing OS jitter. Reducing OS jitter is important for | |
7 | some types of computationally intensive high-performance computing (HPC) | |
8 | applications and for real-time applications. | |
9 | ||
295fde89 PM |
10 | There are three main ways of managing scheduling-clock interrupts |
11 | (also known as "scheduling-clock ticks" or simply "ticks"): | |
0c87f9b5 | 12 | |
295fde89 PM |
13 | 1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or |
14 | CONFIG_NO_HZ=n for older kernels). You normally will -not- | |
15 | want to choose this option. | |
0c87f9b5 | 16 | |
295fde89 PM |
17 | 2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or |
18 | CONFIG_NO_HZ=y for older kernels). This is the most common | |
19 | approach, and should be the default. | |
0c87f9b5 | 20 | |
295fde89 PM |
21 | 3. Omit scheduling-clock ticks on CPUs that are either idle or that |
22 | have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you | |
23 | are running realtime applications or certain types of HPC | |
24 | workloads, you will normally -not- want this option. | |
25 | ||
26 | These three cases are described in the following three sections, followed | |
8bdf7a25 PM |
27 | by a third section on RCU-specific considerations, a fourth section |
28 | discussing testing, and a fifth and final section listing known issues. | |
0c87f9b5 PM |
29 | |
30 | ||
295fde89 PM |
31 | NEVER OMIT SCHEDULING-CLOCK TICKS |
32 | ||
33 | Very old versions of Linux from the 1990s and the very early 2000s | |
34 | are incapable of omitting scheduling-clock ticks. It turns out that | |
35 | there are some situations where this old-school approach is still the | |
36 | right approach, for example, in heavy workloads with lots of tasks | |
37 | that use short bursts of CPU, where there are very frequent idle | |
38 | periods, but where these idle periods are also quite short (tens or | |
39 | hundreds of microseconds). For these types of workloads, scheduling | |
40 | clock interrupts will normally be delivered any way because there | |
41 | will frequently be multiple runnable tasks per CPU. In these cases, | |
42 | attempting to turn off the scheduling clock interrupt will have no effect | |
43 | other than increasing the overhead of switching to and from idle and | |
44 | transitioning between user and kernel execution. | |
45 | ||
46 | This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or | |
47 | CONFIG_NO_HZ=n for older kernels). | |
48 | ||
49 | However, if you are instead running a light workload with long idle | |
50 | periods, failing to omit scheduling-clock interrupts will result in | |
51 | excessive power consumption. This is especially bad on battery-powered | |
52 | devices, where it results in extremely short battery lifetimes. If you | |
53 | are running light workloads, you should therefore read the following | |
54 | section. | |
55 | ||
56 | In addition, if you are running either a real-time workload or an HPC | |
57 | workload with short iterations, the scheduling-clock interrupts can | |
58 | degrade your applications performance. If this describes your workload, | |
59 | you should read the following two sections. | |
60 | ||
61 | ||
62 | OMIT SCHEDULING-CLOCK TICKS FOR IDLE CPUs | |
0c87f9b5 PM |
63 | |
64 | If a CPU is idle, there is little point in sending it a scheduling-clock | |
65 | interrupt. After all, the primary purpose of a scheduling-clock interrupt | |
66 | is to force a busy CPU to shift its attention among multiple duties, | |
67 | and an idle CPU has no duties to shift its attention among. | |
68 | ||
69 | The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending | |
70 | scheduling-clock interrupts to idle CPUs, which is critically important | |
71 | both to battery-powered devices and to highly virtualized mainframes. | |
72 | A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would | |
73 | drain its battery very quickly, easily 2-3 times as fast as would the | |
74 | same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running | |
75 | 1,500 OS instances might find that half of its CPU time was consumed by | |
76 | unnecessary scheduling-clock interrupts. In these situations, there | |
77 | is strong motivation to avoid sending scheduling-clock interrupts to | |
78 | idle CPUs. That said, dyntick-idle mode is not free: | |
79 | ||
80 | 1. It increases the number of instructions executed on the path | |
81 | to and from the idle loop. | |
82 | ||
83 | 2. On many architectures, dyntick-idle mode also increases the | |
84 | number of expensive clock-reprogramming operations. | |
85 | ||
86 | Therefore, systems with aggressive real-time response constraints often | |
87 | run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels) | |
88 | in order to avoid degrading from-idle transition latencies. | |
89 | ||
90 | An idle CPU that is not receiving scheduling-clock interrupts is said to | |
91 | be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running | |
92 | tickless". The remainder of this document will use "dyntick-idle mode". | |
93 | ||
94 | There is also a boot parameter "nohz=" that can be used to disable | |
95 | dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off". | |
96 | By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling | |
97 | dyntick-idle mode. | |
98 | ||
99 | ||
295fde89 | 100 | OMIT SCHEDULING-CLOCK TICKS FOR CPUs WITH ONLY ONE RUNNABLE TASK |
0c87f9b5 PM |
101 | |
102 | If a CPU has only one runnable task, there is little point in sending it | |
103 | a scheduling-clock interrupt because there is no other task to switch to. | |
295fde89 PM |
104 | Note that omitting scheduling-clock ticks for CPUs with only one runnable |
105 | task implies also omitting them for idle CPUs. | |
0c87f9b5 PM |
106 | |
107 | The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid | |
108 | sending scheduling-clock interrupts to CPUs with a single runnable task, | |
109 | and such CPUs are said to be "adaptive-ticks CPUs". This is important | |
110 | for applications with aggressive real-time response constraints because | |
111 | it allows them to improve their worst-case response times by the maximum | |
112 | duration of a scheduling-clock interrupt. It is also important for | |
113 | computationally intensive short-iteration workloads: If any CPU is | |
114 | delayed during a given iteration, all the other CPUs will be forced to | |
115 | wait idle while the delayed CPU finishes. Thus, the delay is multiplied | |
116 | by one less than the number of CPUs. In these situations, there is | |
117 | again strong motivation to avoid sending scheduling-clock interrupts. | |
118 | ||
119 | By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" | |
120 | boot parameter specifies the adaptive-ticks CPUs. For example, | |
121 | "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks | |
122 | CPUs. Note that you are prohibited from marking all of the CPUs as | |
123 | adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain | |
8bdf7a25 PM |
124 | online to handle timekeeping tasks in order to ensure that system |
125 | calls like gettimeofday() returns accurate values on adaptive-tick CPUs. | |
126 | (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running | |
127 | user processes to observe slight drifts in clock rate.) Therefore, the | |
128 | boot CPU is prohibited from entering adaptive-ticks mode. Specifying a | |
129 | "nohz_full=" mask that includes the boot CPU will result in a boot-time | |
130 | error message, and the boot CPU will be removed from the mask. Note that | |
131 | this means that your system must have at least two CPUs in order for | |
132 | CONFIG_NO_HZ_FULL=y to do anything for you. | |
0c87f9b5 PM |
133 | |
134 | Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies | |
135 | that all CPUs other than the boot CPU are adaptive-ticks CPUs. This | |
136 | Kconfig parameter will be overridden by the "nohz_full=" boot parameter, | |
137 | so that if both the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter and | |
138 | the "nohz_full=1" boot parameter is specified, the boot parameter will | |
139 | prevail so that only CPU 1 will be an adaptive-ticks CPU. | |
140 | ||
141 | Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. | |
142 | This is covered in the "RCU IMPLICATIONS" section below. | |
143 | ||
144 | Normally, a CPU remains in adaptive-ticks mode as long as possible. | |
145 | In particular, transitioning to kernel mode does not automatically change | |
146 | the mode. Instead, the CPU will exit adaptive-ticks mode only if needed, | |
147 | for example, if that CPU enqueues an RCU callback. | |
148 | ||
149 | Just as with dyntick-idle mode, the benefits of adaptive-tick mode do | |
150 | not come for free: | |
151 | ||
152 | 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run | |
153 | adaptive ticks without also running dyntick idle. This dependency | |
154 | extends down into the implementation, so that all of the costs | |
155 | of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL. | |
156 | ||
157 | 2. The user/kernel transitions are slightly more expensive due | |
158 | to the need to inform kernel subsystems (such as RCU) about | |
159 | the change in mode. | |
160 | ||
161 | 3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines | |
162 | (perhaps indefinitely) because they currently rely on | |
163 | scheduling-tick interrupts. This will likely be fixed in | |
164 | one of two ways: (1) Prevent CPUs with POSIX CPU timers from | |
165 | entering adaptive-tick mode, or (2) Use hrtimers or other | |
166 | adaptive-ticks-immune mechanism to cause the POSIX CPU timer to | |
167 | fire properly. | |
168 | ||
169 | 4. If there are more perf events pending than the hardware can | |
170 | accommodate, they are normally round-robined so as to collect | |
171 | all of them over time. Adaptive-tick mode may prevent this | |
172 | round-robining from happening. This will likely be fixed by | |
173 | preventing CPUs with large numbers of perf events pending from | |
174 | entering adaptive-tick mode. | |
175 | ||
176 | 5. Scheduler statistics for adaptive-tick CPUs may be computed | |
177 | slightly differently than those for non-adaptive-tick CPUs. | |
178 | This might in turn perturb load-balancing of real-time tasks. | |
179 | ||
180 | 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. | |
181 | ||
182 | Although improvements are expected over time, adaptive ticks is quite | |
183 | useful for many types of real-time and compute-intensive applications. | |
184 | However, the drawbacks listed above mean that adaptive ticks should not | |
185 | (yet) be enabled by default. | |
186 | ||
187 | ||
188 | RCU IMPLICATIONS | |
189 | ||
190 | There are situations in which idle CPUs cannot be permitted to | |
191 | enter either dyntick-idle mode or adaptive-tick mode, the most | |
192 | common being when that CPU has RCU callbacks pending. | |
193 | ||
194 | The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs | |
195 | to enter dyntick-idle mode or adaptive-tick mode anyway. In this case, | |
196 | a timer will awaken these CPUs every four jiffies in order to ensure | |
197 | that the RCU callbacks are processed in a timely fashion. | |
198 | ||
199 | Another approach is to offload RCU callback processing to "rcuo" kthreads | |
200 | using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to | |
201 | offload may be selected via several methods: | |
202 | ||
203 | 1. One of three mutually exclusive Kconfig options specify a | |
204 | build-time default for the CPUs to offload: | |
205 | ||
206 | a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in | |
207 | no CPUs being offloaded. | |
208 | ||
209 | b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes | |
210 | CPU 0 to be offloaded. | |
211 | ||
212 | c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all | |
213 | CPUs to be offloaded. Note that the callbacks will be | |
214 | offloaded to "rcuo" kthreads, and that those kthreads | |
215 | will in fact run on some CPU. However, this approach | |
216 | gives fine-grained control on exactly which CPUs the | |
217 | callbacks run on, along with their scheduling priority | |
218 | (including the default of SCHED_OTHER), and it further | |
219 | allows this control to be varied dynamically at runtime. | |
220 | ||
221 | 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated | |
222 | list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, | |
223 | 3, 4, and 5. The specified CPUs will be offloaded in addition to | |
224 | any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or | |
225 | CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot | |
226 | parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y. | |
227 | ||
228 | The offloaded CPUs will never queue RCU callbacks, and therefore RCU | |
229 | never prevents offloaded CPUs from entering either dyntick-idle mode | |
230 | or adaptive-tick mode. That said, note that it is up to userspace to | |
231 | pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the | |
232 | scheduler will decide where to run them, which might or might not be | |
233 | where you want them to run. | |
234 | ||
235 | ||
8bdf7a25 PM |
236 | TESTING |
237 | ||
238 | So you enable all the OS-jitter features described in this document, | |
239 | but do not see any change in your workload's behavior. Is this because | |
240 | your workload isn't affected that much by OS jitter, or is it because | |
241 | something else is in the way? This section helps answer this question | |
242 | by providing a simple OS-jitter test suite, which is available on branch | |
243 | master of the following git archive: | |
244 | ||
245 | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git | |
246 | ||
247 | Clone this archive and follow the instructions in the README file. | |
248 | This test procedure will produce a trace that will allow you to evaluate | |
249 | whether or not you have succeeded in removing OS jitter from your system. | |
250 | If this trace shows that you have removed OS jitter as much as is | |
251 | possible, then you can conclude that your workload is not all that | |
252 | sensitive to OS jitter. | |
253 | ||
254 | Note: this test requires that your system have at least two CPUs. | |
255 | We do not currently have a good way to remove OS jitter from single-CPU | |
256 | systems. | |
257 | ||
258 | ||
0c87f9b5 PM |
259 | KNOWN ISSUES |
260 | ||
261 | o Dyntick-idle slows transitions to and from idle slightly. | |
262 | In practice, this has not been a problem except for the most | |
263 | aggressive real-time workloads, which have the option of disabling | |
264 | dyntick-idle mode, an option that most of them take. However, | |
265 | some workloads will no doubt want to use adaptive ticks to | |
266 | eliminate scheduling-clock interrupt latencies. Here are some | |
267 | options for these workloads: | |
268 | ||
269 | a. Use PMQOS from userspace to inform the kernel of your | |
270 | latency requirements (preferred). | |
271 | ||
272 | b. On x86 systems, use the "idle=mwait" boot parameter. | |
273 | ||
274 | c. On x86 systems, use the "intel_idle.max_cstate=" to limit | |
275 | ` the maximum C-state depth. | |
276 | ||
277 | d. On x86 systems, use the "idle=poll" boot parameter. | |
278 | However, please note that use of this parameter can cause | |
279 | your CPU to overheat, which may cause thermal throttling | |
280 | to degrade your latencies -- and that this degradation can | |
281 | be even worse than that of dyntick-idle. Furthermore, | |
282 | this parameter effectively disables Turbo Mode on Intel | |
283 | CPUs, which can significantly reduce maximum performance. | |
284 | ||
285 | o Adaptive-ticks slows user/kernel transitions slightly. | |
286 | This is not expected to be a problem for computationally intensive | |
287 | workloads, which have few such transitions. Careful benchmarking | |
288 | will be required to determine whether or not other workloads | |
289 | are significantly affected by this effect. | |
290 | ||
291 | o Adaptive-ticks does not do anything unless there is only one | |
292 | runnable task for a given CPU, even though there are a number | |
293 | of other situations where the scheduling-clock tick is not | |
294 | needed. To give but one example, consider a CPU that has one | |
295 | runnable high-priority SCHED_FIFO task and an arbitrary number | |
296 | of low-priority SCHED_OTHER tasks. In this case, the CPU is | |
297 | required to run the SCHED_FIFO task until it either blocks or | |
298 | some other higher-priority task awakens on (or is assigned to) | |
299 | this CPU, so there is no point in sending a scheduling-clock | |
300 | interrupt to this CPU. However, the current implementation | |
301 | nevertheless sends scheduling-clock interrupts to CPUs having a | |
302 | single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER | |
303 | tasks, even though these interrupts are unnecessary. | |
304 | ||
ce5f4fc8 PM |
305 | And even when there are multiple runnable tasks on a given CPU, |
306 | there is little point in interrupting that CPU until the current | |
307 | running task's timeslice expires, which is almost always way | |
308 | longer than the time of the next scheduling-clock interrupt. | |
309 | ||
0c87f9b5 PM |
310 | Better handling of these sorts of situations is future work. |
311 | ||
312 | o A reboot is required to reconfigure both adaptive idle and RCU | |
313 | callback offloading. Runtime reconfiguration could be provided | |
314 | if needed, however, due to the complexity of reconfiguring RCU at | |
315 | runtime, there would need to be an earthshakingly good reason. | |
316 | Especially given that you have the straightforward option of | |
317 | simply offloading RCU callbacks from all CPUs and pinning them | |
318 | where you want them whenever you want them pinned. | |
319 | ||
320 | o Additional configuration is required to deal with other sources | |
321 | of OS jitter, including interrupts and system-utility tasks | |
322 | and processes. This configuration normally involves binding | |
323 | interrupts and tasks to particular CPUs. | |
324 | ||
325 | o Some sources of OS jitter can currently be eliminated only by | |
326 | constraining the workload. For example, the only way to eliminate | |
327 | OS jitter due to global TLB shootdowns is to avoid the unmapping | |
328 | operations (such as kernel module unload operations) that | |
329 | result in these shootdowns. For another example, page faults | |
330 | and TLB misses can be reduced (and in some cases eliminated) by | |
331 | using huge pages and by constraining the amount of memory used | |
332 | by the application. Pre-faulting the working set can also be | |
333 | helpful, especially when combined with the mlock() and mlockall() | |
334 | system calls. | |
335 | ||
336 | o Unless all CPUs are idle, at least one CPU must keep the | |
337 | scheduling-clock interrupt going in order to support accurate | |
338 | timekeeping. | |
339 | ||
ce5f4fc8 PM |
340 | o If there might potentially be some adaptive-ticks CPUs, there |
341 | will be at least one CPU keeping the scheduling-clock interrupt | |
342 | going, even if all CPUs are otherwise idle. | |
343 | ||
344 | Better handling of this situation is ongoing work. | |
345 | ||
346 | o Some process-handling operations still require the occasional | |
347 | scheduling-clock tick. These operations include calculating CPU | |
348 | load, maintaining sched average, computing CFS entity vruntime, | |
349 | computing avenrun, and carrying out load balancing. They are | |
350 | currently accommodated by scheduling-clock tick every second | |
351 | or so. On-going work will eliminate the need even for these | |
352 | infrequent scheduling-clock ticks. |