Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | CPUSETS |
2 | ------- | |
3 | ||
4 | Copyright (C) 2004 BULL SA. | |
5 | Written by Simon.Derr@bull.net | |
6 | ||
b4fb3766 | 7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. |
1da177e4 | 8 | Modified by Paul Jackson <pj@sgi.com> |
b4fb3766 | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
1da177e4 LT |
10 | |
11 | CONTENTS: | |
12 | ========= | |
13 | ||
14 | 1. Cpusets | |
15 | 1.1 What are cpusets ? | |
16 | 1.2 Why are cpusets needed ? | |
17 | 1.3 How are cpusets implemented ? | |
bd5e09cf PJ |
18 | 1.4 What are exclusive cpusets ? |
19 | 1.5 What does notify_on_release do ? | |
90c9cc40 | 20 | 1.6 What is memory_pressure ? |
825a46af PJ |
21 | 1.7 What is memory spread ? |
22 | 1.8 How do I use cpusets ? | |
1da177e4 LT |
23 | 2. Usage Examples and Syntax |
24 | 2.1 Basic Usage | |
25 | 2.2 Adding/removing cpus | |
26 | 2.3 Setting flags | |
27 | 2.4 Attaching processes | |
28 | 3. Questions | |
29 | 4. Contact | |
30 | ||
31 | 1. Cpusets | |
32 | ========== | |
33 | ||
34 | 1.1 What are cpusets ? | |
35 | ---------------------- | |
36 | ||
37 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | |
0e1e7c7a CL |
38 | Nodes to a set of tasks. In this document "Memory Node" refers to |
39 | an on-line node that contains memory. | |
1da177e4 LT |
40 | |
41 | Cpusets constrain the CPU and Memory placement of tasks to only | |
42 | the resources within a tasks current cpuset. They form a nested | |
43 | hierarchy visible in a virtual file system. These are the essential | |
44 | hooks, beyond what is already present, required to manage dynamic | |
45 | job placement on large systems. | |
46 | ||
47 | Each task has a pointer to a cpuset. Multiple tasks may reference | |
48 | the same cpuset. Requests by a task, using the sched_setaffinity(2) | |
49 | system call to include CPUs in its CPU affinity mask, and using the | |
50 | mbind(2) and set_mempolicy(2) system calls to include Memory Nodes | |
51 | in its memory policy, are both filtered through that tasks cpuset, | |
52 | filtering out any CPUs or Memory Nodes not in that cpuset. The | |
53 | scheduler will not schedule a task on a CPU that is not allowed in | |
54 | its cpus_allowed vector, and the kernel page allocator will not | |
55 | allocate a page on a node that is not allowed in the requesting tasks | |
56 | mems_allowed vector. | |
57 | ||
1da177e4 LT |
58 | User level code may create and destroy cpusets by name in the cpuset |
59 | virtual file system, manage the attributes and permissions of these | |
60 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | |
61 | specify and query to which cpuset a task is assigned, and list the | |
62 | task pids assigned to a cpuset. | |
63 | ||
64 | ||
65 | 1.2 Why are cpusets needed ? | |
66 | ---------------------------- | |
67 | ||
68 | The management of large computer systems, with many processors (CPUs), | |
69 | complex memory cache hierarchies and multiple Memory Nodes having | |
70 | non-uniform access times (NUMA) presents additional challenges for | |
71 | the efficient scheduling and memory placement of processes. | |
72 | ||
73 | Frequently more modest sized systems can be operated with adequate | |
74 | efficiency just by letting the operating system automatically share | |
75 | the available CPU and Memory resources amongst the requesting tasks. | |
76 | ||
77 | But larger systems, which benefit more from careful processor and | |
78 | memory placement to reduce memory access times and contention, | |
79 | and which typically represent a larger investment for the customer, | |
33430dc5 | 80 | can benefit from explicitly placing jobs on properly sized subsets of |
1da177e4 LT |
81 | the system. |
82 | ||
83 | This can be especially valuable on: | |
84 | ||
85 | * Web Servers running multiple instances of the same web application, | |
86 | * Servers running different applications (for instance, a web server | |
87 | and a database), or | |
88 | * NUMA systems running large HPC applications with demanding | |
89 | performance characteristics. | |
90 | ||
91 | These subsets, or "soft partitions" must be able to be dynamically | |
92 | adjusted, as the job mix changes, without impacting other concurrently | |
b4fb3766 CL |
93 | executing jobs. The location of the running jobs pages may also be moved |
94 | when the memory locations are changed. | |
1da177e4 LT |
95 | |
96 | The kernel cpuset patch provides the minimum essential kernel | |
97 | mechanisms required to efficiently implement such subsets. It | |
98 | leverages existing CPU and Memory Placement facilities in the Linux | |
99 | kernel to avoid any additional impact on the critical scheduler or | |
100 | memory allocator code. | |
101 | ||
102 | ||
103 | 1.3 How are cpusets implemented ? | |
104 | --------------------------------- | |
105 | ||
b4fb3766 CL |
106 | Cpusets provide a Linux kernel mechanism to constrain which CPUs and |
107 | Memory Nodes are used by a process or set of processes. | |
1da177e4 LT |
108 | |
109 | The Linux kernel already has a pair of mechanisms to specify on which | |
110 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | |
111 | Nodes it may obtain memory (mbind, set_mempolicy). | |
112 | ||
113 | Cpusets extends these two mechanisms as follows: | |
114 | ||
115 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | |
116 | kernel. | |
117 | - Each task in the system is attached to a cpuset, via a pointer | |
118 | in the task structure to a reference counted cpuset structure. | |
119 | - Calls to sched_setaffinity are filtered to just those CPUs | |
120 | allowed in that tasks cpuset. | |
121 | - Calls to mbind and set_mempolicy are filtered to just | |
122 | those Memory Nodes allowed in that tasks cpuset. | |
123 | - The root cpuset contains all the systems CPUs and Memory | |
124 | Nodes. | |
125 | - For any cpuset, one can define child cpusets containing a subset | |
126 | of the parents CPU and Memory Node resources. | |
127 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | |
128 | browsing and manipulation from user space. | |
129 | - A cpuset may be marked exclusive, which ensures that no other | |
130 | cpuset (except direct ancestors and descendents) may contain | |
131 | any overlapping CPUs or Memory Nodes. | |
132 | - You can list all the tasks (by pid) attached to any cpuset. | |
133 | ||
134 | The implementation of cpusets requires a few, simple hooks | |
135 | into the rest of the kernel, none in performance critical paths: | |
136 | ||
864913f3 | 137 | - in init/main.c, to initialize the root cpuset at system boot. |
1da177e4 LT |
138 | - in fork and exit, to attach and detach a task from its cpuset. |
139 | - in sched_setaffinity, to mask the requested CPUs by what's | |
140 | allowed in that tasks cpuset. | |
141 | - in sched.c migrate_all_tasks(), to keep migrating tasks within | |
142 | the CPUs allowed by their cpuset, if possible. | |
143 | - in the mbind and set_mempolicy system calls, to mask the requested | |
144 | Memory Nodes by what's allowed in that tasks cpuset. | |
864913f3 | 145 | - in page_alloc.c, to restrict memory to allowed nodes. |
1da177e4 LT |
146 | - in vmscan.c, to restrict page recovery to the current cpuset. |
147 | ||
148 | In addition a new file system, of type "cpuset" may be mounted, | |
149 | typically at /dev/cpuset, to enable browsing and modifying the cpusets | |
150 | presently known to the kernel. No new system calls are added for | |
151 | cpusets - all support for querying and modifying cpusets is via | |
152 | this cpuset file system. | |
153 | ||
154 | Each task under /proc has an added file named 'cpuset', displaying | |
155 | the cpuset name, as the path relative to the root of the cpuset file | |
156 | system. | |
157 | ||
158 | The /proc/<pid>/status file for each task has two added lines, | |
159 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | |
160 | and mems_allowed (on which Memory Nodes it may obtain memory), | |
161 | in the format seen in the following example: | |
162 | ||
163 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | |
164 | Mems_allowed: ffffffff,ffffffff | |
165 | ||
166 | Each cpuset is represented by a directory in the cpuset file system | |
167 | containing the following files describing that cpuset: | |
168 | ||
169 | - cpus: list of CPUs in that cpuset | |
170 | - mems: list of Memory Nodes in that cpuset | |
45b07ef3 | 171 | - memory_migrate flag: if set, move pages to cpusets nodes |
1da177e4 LT |
172 | - cpu_exclusive flag: is cpu placement exclusive? |
173 | - mem_exclusive flag: is memory placement exclusive? | |
174 | - tasks: list of tasks (by pid) attached to that cpuset | |
bd5e09cf | 175 | - notify_on_release flag: run /sbin/cpuset_release_agent on exit? |
bd5e09cf PJ |
176 | - memory_pressure: measure of how much paging pressure in cpuset |
177 | ||
178 | In addition, the root cpuset only has the following file: | |
179 | - memory_pressure_enabled flag: compute memory_pressure? | |
1da177e4 LT |
180 | |
181 | New cpusets are created using the mkdir system call or shell | |
182 | command. The properties of a cpuset, such as its flags, allowed | |
183 | CPUs and Memory Nodes, and attached tasks, are modified by writing | |
184 | to the appropriate file in that cpusets directory, as listed above. | |
185 | ||
186 | The named hierarchical structure of nested cpusets allows partitioning | |
187 | a large system into nested, dynamically changeable, "soft-partitions". | |
188 | ||
189 | The attachment of each task, automatically inherited at fork by any | |
190 | children of that task, to a cpuset allows organizing the work load | |
191 | on a system into related sets of tasks such that each set is constrained | |
192 | to using the CPUs and Memory Nodes of a particular cpuset. A task | |
193 | may be re-attached to any other cpuset, if allowed by the permissions | |
194 | on the necessary cpuset file system directories. | |
195 | ||
196 | Such management of a system "in the large" integrates smoothly with | |
197 | the detailed placement done on individual tasks and memory regions | |
198 | using the sched_setaffinity, mbind and set_mempolicy system calls. | |
199 | ||
200 | The following rules apply to each cpuset: | |
201 | ||
202 | - Its CPUs and Memory Nodes must be a subset of its parents. | |
203 | - It can only be marked exclusive if its parent is. | |
204 | - If its cpu or memory is exclusive, they may not overlap any sibling. | |
205 | ||
206 | These rules, and the natural hierarchy of cpusets, enable efficient | |
207 | enforcement of the exclusive guarantee, without having to scan all | |
208 | cpusets every time any of them change to ensure nothing overlaps a | |
209 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |
210 | to represent the cpuset hierarchy provides for a familiar permission | |
211 | and name space for cpusets, with a minimum of additional kernel code. | |
212 | ||
38837fc7 PJ |
213 | The cpus and mems files in the root (top_cpuset) cpuset are |
214 | read-only. The cpus file automatically tracks the value of | |
215 | cpu_online_map using a CPU hotplug notifier, and the mems file | |
0e1e7c7a CL |
216 | automatically tracks the value of node_states[N_MEMORY]--i.e., |
217 | nodes with memory--using the cpuset_track_online_nodes() hook. | |
4c4d50f7 | 218 | |
bd5e09cf PJ |
219 | |
220 | 1.4 What are exclusive cpusets ? | |
221 | -------------------------------- | |
222 | ||
223 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | |
224 | a direct ancestor or descendent, may share any of the same CPUs or | |
225 | Memory Nodes. | |
226 | ||
bd5e09cf PJ |
227 | A cpuset that is mem_exclusive restricts kernel allocations for |
228 | page, buffer and other data commonly shared by the kernel across | |
229 | multiple users. All cpusets, whether mem_exclusive or not, restrict | |
230 | allocations of memory for user space. This enables configuring a | |
231 | system so that several independent jobs can share common kernel data, | |
232 | such as file system pages, while isolating each jobs user allocation in | |
233 | its own cpuset. To do this, construct a large mem_exclusive cpuset to | |
234 | hold all the jobs, and construct child, non-mem_exclusive cpusets for | |
235 | each individual job. Only a small amount of typical kernel memory, | |
236 | such as requests from interrupt handlers, is allowed to be taken | |
237 | outside even a mem_exclusive cpuset. | |
238 | ||
239 | ||
240 | 1.5 What does notify_on_release do ? | |
241 | ------------------------------------ | |
242 | ||
243 | If the notify_on_release flag is enabled (1) in a cpuset, then whenever | |
244 | the last task in the cpuset leaves (exits or attaches to some other | |
245 | cpuset) and the last child cpuset of that cpuset is removed, then | |
246 | the kernel runs the command /sbin/cpuset_release_agent, supplying the | |
247 | pathname (relative to the mount point of the cpuset file system) of the | |
248 | abandoned cpuset. This enables automatic removal of abandoned cpusets. | |
249 | The default value of notify_on_release in the root cpuset at system | |
250 | boot is disabled (0). The default value of other cpusets at creation | |
251 | is the current value of their parents notify_on_release setting. | |
252 | ||
253 | ||
90c9cc40 | 254 | 1.6 What is memory_pressure ? |
bd5e09cf PJ |
255 | ----------------------------- |
256 | The memory_pressure of a cpuset provides a simple per-cpuset metric | |
257 | of the rate that the tasks in a cpuset are attempting to free up in | |
258 | use memory on the nodes of the cpuset to satisfy additional memory | |
259 | requests. | |
260 | ||
261 | This enables batch managers monitoring jobs running in dedicated | |
262 | cpusets to efficiently detect what level of memory pressure that job | |
263 | is causing. | |
264 | ||
265 | This is useful both on tightly managed systems running a wide mix of | |
266 | submitted jobs, which may choose to terminate or re-prioritize jobs that | |
267 | are trying to use more memory than allowed on the nodes assigned them, | |
268 | and with tightly coupled, long running, massively parallel scientific | |
269 | computing jobs that will dramatically fail to meet required performance | |
270 | goals if they start to use more memory than allowed to them. | |
271 | ||
272 | This mechanism provides a very economical way for the batch manager | |
273 | to monitor a cpuset for signs of memory pressure. It's up to the | |
274 | batch manager or other user code to decide what to do about it and | |
275 | take action. | |
276 | ||
277 | ==> Unless this feature is enabled by writing "1" to the special file | |
278 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | |
279 | code of __alloc_pages() for this metric reduces to simply noticing | |
280 | that the cpuset_memory_pressure_enabled flag is zero. So only | |
281 | systems that enable this feature will compute the metric. | |
282 | ||
283 | Why a per-cpuset, running average: | |
284 | ||
285 | Because this meter is per-cpuset, rather than per-task or mm, | |
286 | the system load imposed by a batch scheduler monitoring this | |
287 | metric is sharply reduced on large systems, because a scan of | |
288 | the tasklist can be avoided on each set of queries. | |
289 | ||
290 | Because this meter is a running average, instead of an accumulating | |
291 | counter, a batch scheduler can detect memory pressure with a | |
292 | single read, instead of having to read and accumulate results | |
293 | for a period of time. | |
294 | ||
295 | Because this meter is per-cpuset rather than per-task or mm, | |
296 | the batch scheduler can obtain the key information, memory | |
297 | pressure in a cpuset, with a single read, rather than having to | |
298 | query and accumulate results over all the (dynamically changing) | |
299 | set of tasks in the cpuset. | |
300 | ||
301 | A per-cpuset simple digital filter (requires a spinlock and 3 words | |
302 | of data per-cpuset) is kept, and updated by any task attached to that | |
303 | cpuset, if it enters the synchronous (direct) page reclaim code. | |
304 | ||
305 | A per-cpuset file provides an integer number representing the recent | |
306 | (half-life of 10 seconds) rate of direct page reclaims caused by | |
307 | the tasks in the cpuset, in units of reclaims attempted per second, | |
308 | times 1000. | |
309 | ||
310 | ||
825a46af PJ |
311 | 1.7 What is memory spread ? |
312 | --------------------------- | |
313 | There are two boolean flag files per cpuset that control where the | |
314 | kernel allocates pages for the file system buffers and related in | |
315 | kernel data structures. They are called 'memory_spread_page' and | |
316 | 'memory_spread_slab'. | |
317 | ||
318 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | |
319 | the kernel will spread the file system buffers (page cache) evenly | |
320 | over all the nodes that the faulting task is allowed to use, instead | |
321 | of preferring to put those pages on the node where the task is running. | |
322 | ||
323 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | |
324 | then the kernel will spread some file system related slab caches, | |
325 | such as for inodes and dentries evenly over all the nodes that the | |
326 | faulting task is allowed to use, instead of preferring to put those | |
327 | pages on the node where the task is running. | |
328 | ||
329 | The setting of these flags does not affect anonymous data segment or | |
330 | stack segment pages of a task. | |
331 | ||
332 | By default, both kinds of memory spreading are off, and memory | |
333 | pages are allocated on the node local to where the task is running, | |
334 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | |
335 | configuration, so long as sufficient free memory pages are available. | |
336 | ||
337 | When new cpusets are created, they inherit the memory spread settings | |
338 | of their parent. | |
339 | ||
340 | Setting memory spreading causes allocations for the affected page | |
341 | or slab caches to ignore the tasks NUMA mempolicy and be spread | |
342 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | |
343 | mempolicies will not notice any change in these calls as a result of | |
344 | their containing tasks memory spread settings. If memory spreading | |
345 | is turned off, then the currently specified NUMA mempolicy once again | |
346 | applies to memory page allocations. | |
347 | ||
348 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | |
349 | files. By default they contain "0", meaning that the feature is off | |
350 | for that cpuset. If a "1" is written to that file, then that turns | |
351 | the named feature on. | |
352 | ||
353 | The implementation is simple. | |
354 | ||
355 | Setting the flag 'memory_spread_page' turns on a per-process flag | |
356 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | |
357 | joins that cpuset. The page allocation calls for the page cache | |
358 | is modified to perform an inline check for this PF_SPREAD_PAGE task | |
359 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | |
360 | returns the node to prefer for the allocation. | |
361 | ||
362 | Similarly, setting 'memory_spread_cache' turns on the flag | |
363 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | |
364 | pages from the node returned by cpuset_mem_spread_node(). | |
365 | ||
366 | The cpuset_mem_spread_node() routine is also simple. It uses the | |
367 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | |
368 | node in the current tasks mems_allowed to prefer for the allocation. | |
369 | ||
370 | This memory placement policy is also known (in other contexts) as | |
371 | round-robin or interleave. | |
372 | ||
373 | This policy can provide substantial improvements for jobs that need | |
374 | to place thread local data on the corresponding node, but that need | |
375 | to access large file system data sets that need to be spread across | |
376 | the several nodes in the jobs cpuset in order to fit. Without this | |
377 | policy, especially for jobs that might have one thread reading in the | |
378 | data set, the memory allocation across the nodes in the jobs cpuset | |
379 | can become very uneven. | |
380 | ||
381 | ||
382 | 1.8 How do I use cpusets ? | |
1da177e4 LT |
383 | -------------------------- |
384 | ||
385 | In order to minimize the impact of cpusets on critical kernel | |
386 | code, such as the scheduler, and due to the fact that the kernel | |
387 | does not support one task updating the memory placement of another | |
388 | task directly, the impact on a task of changing its cpuset CPU | |
389 | or Memory Node placement, or of changing to which cpuset a task | |
390 | is attached, is subtle. | |
391 | ||
392 | If a cpuset has its Memory Nodes modified, then for each task attached | |
393 | to that cpuset, the next time that the kernel attempts to allocate | |
394 | a page of memory for that task, the kernel will notice the change | |
395 | in the tasks cpuset, and update its per-task memory placement to | |
396 | remain within the new cpusets memory placement. If the task was using | |
397 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | |
398 | its new cpuset, then the task will continue to use whatever subset | |
399 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | |
400 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |
401 | in the new cpuset, then the task will be essentially treated as if it | |
402 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | |
403 | as queried by get_mempolicy(), doesn't change). If a task is moved | |
404 | from one cpuset to another, then the kernel will adjust the tasks | |
405 | memory placement, as above, the next time that the kernel attempts | |
406 | to allocate a page of memory for that task. | |
407 | ||
408 | If a cpuset has its CPUs modified, then each task using that | |
409 | cpuset does _not_ change its behavior automatically. In order to | |
410 | minimize the impact on the critical scheduling code in the kernel, | |
411 | tasks will continue to use their prior CPU placement until they | |
412 | are rebound to their cpuset, by rewriting their pid to the 'tasks' | |
413 | file of their cpuset. If a task had been bound to some subset of its | |
414 | cpuset using the sched_setaffinity() call, and if any of that subset | |
415 | is still allowed in its new cpuset settings, then the task will be | |
416 | restricted to the intersection of the CPUs it was allowed on before, | |
417 | and its new cpuset CPU placement. If, on the other hand, there is | |
418 | no overlap between a tasks prior placement and its new cpuset CPU | |
419 | placement, then the task will be allowed to run on any CPU allowed | |
420 | in its new cpuset. If a task is moved from one cpuset to another, | |
421 | its CPU placement is updated in the same way as if the tasks pid is | |
422 | rewritten to the 'tasks' file of its current cpuset. | |
423 | ||
424 | In summary, the memory placement of a task whose cpuset is changed is | |
425 | updated by the kernel, on the next allocation of a page for that task, | |
426 | but the processor placement is not updated, until that tasks pid is | |
427 | rewritten to the 'tasks' file of its cpuset. This is done to avoid | |
428 | impacting the scheduler code in the kernel with a check for changes | |
429 | in a tasks processor placement. | |
430 | ||
45b07ef3 PJ |
431 | Normally, once a page is allocated (given a physical page |
432 | of main memory) then that page stays on whatever node it | |
433 | was allocated, so long as it remains allocated, even if the | |
434 | cpusets memory placement policy 'mems' subsequently changes. | |
435 | If the cpuset flag file 'memory_migrate' is set true, then when | |
436 | tasks are attached to that cpuset, any pages that task had | |
437 | allocated to it on nodes in its previous cpuset are migrated | |
b4fb3766 CL |
438 | to the tasks new cpuset. The relative placement of the page within |
439 | the cpuset is preserved during these migration operations if possible. | |
440 | For example if the page was on the second valid node of the prior cpuset | |
441 | then the page will be placed on the second valid node of the new cpuset. | |
442 | ||
45b07ef3 PJ |
443 | Also if 'memory_migrate' is set true, then if that cpusets |
444 | 'mems' file is modified, pages allocated to tasks in that | |
445 | cpuset, that were on nodes in the previous setting of 'mems', | |
b4fb3766 CL |
446 | will be moved to nodes in the new setting of 'mems.' |
447 | Pages that were not in the tasks prior cpuset, or in the cpusets | |
448 | prior 'mems' setting, will not be moved. | |
45b07ef3 | 449 | |
d533f671 | 450 | There is an exception to the above. If hotplug functionality is used |
1da177e4 LT |
451 | to remove all the CPUs that are currently assigned to a cpuset, |
452 | then the kernel will automatically update the cpus_allowed of all | |
b39c4fab | 453 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
1da177e4 LT |
454 | hotplug functionality for removing Memory Nodes is available, a |
455 | similar exception is expected to apply there as well. In general, | |
456 | the kernel prefers to violate cpuset placement, over starving a task | |
457 | that has had all its allowed CPUs or Memory Nodes taken offline. User | |
458 | code should reconfigure cpusets to only refer to online CPUs and Memory | |
459 | Nodes when using hotplug to add or remove such resources. | |
460 | ||
461 | There is a second exception to the above. GFP_ATOMIC requests are | |
462 | kernel internal allocations that must be satisfied, immediately. | |
463 | The kernel may drop some request, in rare cases even panic, if a | |
464 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | |
465 | the current tasks cpuset, then we relax the cpuset, and look for | |
466 | memory anywhere we can find it. It's better to violate the cpuset | |
467 | than stress the kernel. | |
468 | ||
469 | To start a new job that is to be contained within a cpuset, the steps are: | |
470 | ||
471 | 1) mkdir /dev/cpuset | |
472 | 2) mount -t cpuset none /dev/cpuset | |
473 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | |
474 | the /dev/cpuset virtual file system. | |
475 | 4) Start a task that will be the "founding father" of the new job. | |
476 | 5) Attach that task to the new cpuset by writing its pid to the | |
477 | /dev/cpuset tasks file for that cpuset. | |
478 | 6) fork, exec or clone the job tasks from this founding father task. | |
479 | ||
480 | For example, the following sequence of commands will setup a cpuset | |
481 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | |
482 | and then start a subshell 'sh' in that cpuset: | |
483 | ||
484 | mount -t cpuset none /dev/cpuset | |
485 | cd /dev/cpuset | |
486 | mkdir Charlie | |
487 | cd Charlie | |
488 | /bin/echo 2-3 > cpus | |
489 | /bin/echo 1 > mems | |
490 | /bin/echo $$ > tasks | |
491 | sh | |
492 | # The subshell 'sh' is now running in cpuset Charlie | |
493 | # The next line should display '/Charlie' | |
494 | cat /proc/self/cpuset | |
495 | ||
1da177e4 LT |
496 | In the future, a C library interface to cpusets will likely be |
497 | available. For now, the only way to query or modify cpusets is | |
498 | via the cpuset file system, using the various cd, mkdir, echo, cat, | |
499 | rmdir commands from the shell, or their equivalent from C. | |
500 | ||
501 | The sched_setaffinity calls can also be done at the shell prompt using | |
502 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | |
503 | calls can be done at the shell prompt using the numactl command | |
504 | (part of Andi Kleen's numa package). | |
505 | ||
506 | 2. Usage Examples and Syntax | |
507 | ============================ | |
508 | ||
509 | 2.1 Basic Usage | |
510 | --------------- | |
511 | ||
512 | Creating, modifying, using the cpusets can be done through the cpuset | |
513 | virtual filesystem. | |
514 | ||
515 | To mount it, type: | |
516 | # mount -t cpuset none /dev/cpuset | |
517 | ||
518 | Then under /dev/cpuset you can find a tree that corresponds to the | |
519 | tree of the cpusets in the system. For instance, /dev/cpuset | |
520 | is the cpuset that holds the whole system. | |
521 | ||
522 | If you want to create a new cpuset under /dev/cpuset: | |
523 | # cd /dev/cpuset | |
524 | # mkdir my_cpuset | |
525 | ||
526 | Now you want to do something with this cpuset. | |
527 | # cd my_cpuset | |
528 | ||
529 | In this directory you can find several files: | |
530 | # ls | |
531 | cpus cpu_exclusive mems mem_exclusive tasks | |
532 | ||
533 | Reading them will give you information about the state of this cpuset: | |
534 | the CPUs and Memory Nodes it can use, the processes that are using | |
535 | it, its properties. By writing to these files you can manipulate | |
536 | the cpuset. | |
537 | ||
538 | Set some flags: | |
539 | # /bin/echo 1 > cpu_exclusive | |
540 | ||
541 | Add some cpus: | |
542 | # /bin/echo 0-7 > cpus | |
543 | ||
2400ff77 SH |
544 | Add some mems: |
545 | # /bin/echo 0-7 > mems | |
546 | ||
1da177e4 LT |
547 | Now attach your shell to this cpuset: |
548 | # /bin/echo $$ > tasks | |
549 | ||
550 | You can also create cpusets inside your cpuset by using mkdir in this | |
551 | directory. | |
552 | # mkdir my_sub_cs | |
553 | ||
554 | To remove a cpuset, just use rmdir: | |
555 | # rmdir my_sub_cs | |
556 | This will fail if the cpuset is in use (has cpusets inside, or has | |
557 | processes attached). | |
558 | ||
559 | 2.2 Adding/removing cpus | |
560 | ------------------------ | |
561 | ||
562 | This is the syntax to use when writing in the cpus or mems files | |
563 | in cpuset directories: | |
564 | ||
565 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | |
566 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | |
567 | ||
568 | 2.3 Setting flags | |
569 | ----------------- | |
570 | ||
571 | The syntax is very simple: | |
572 | ||
573 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | |
574 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | |
575 | ||
576 | 2.4 Attaching processes | |
577 | ----------------------- | |
578 | ||
579 | # /bin/echo PID > tasks | |
580 | ||
581 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | |
582 | If you have several tasks to attach, you have to do it one after another: | |
583 | ||
584 | # /bin/echo PID1 > tasks | |
585 | # /bin/echo PID2 > tasks | |
586 | ... | |
587 | # /bin/echo PIDn > tasks | |
588 | ||
589 | ||
590 | 3. Questions | |
591 | ============ | |
592 | ||
593 | Q: what's up with this '/bin/echo' ? | |
594 | A: bash's builtin 'echo' command does not check calls to write() against | |
595 | errors. If you use it in the cpuset file system, you won't be | |
596 | able to tell whether a command succeeded or failed. | |
597 | ||
598 | Q: When I attach processes, only the first of the line gets really attached ! | |
599 | A: We can only return one error code per call to write(). So you should also | |
600 | put only ONE pid. | |
601 | ||
602 | 4. Contact | |
603 | ========== | |
604 | ||
605 | Web: http://www.bullopensource.org/cpuset |