Commit | Line | Data |
---|---|---|
712e5e34 DF |
1 | Deadline Task Scheduling |
2 | ------------------------ | |
3 | ||
4 | CONTENTS | |
5 | ======== | |
6 | ||
7 | 0. WARNING | |
8 | 1. Overview | |
9 | 2. Scheduling algorithm | |
10 | 3. Scheduling Real-Time Tasks | |
11 | 4. Bandwidth management | |
12 | 4.1 System-wide settings | |
13 | 4.2 Task interface | |
14 | 4.3 Default behavior | |
15 | 5. Tasks CPU affinity | |
16 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
17 | 6. Future plans | |
18 | ||
19 | ||
20 | 0. WARNING | |
21 | ========== | |
22 | ||
23 | Fiddling with these settings can result in an unpredictable or even unstable | |
24 | system behavior. As for -rt (group) scheduling, it is assumed that root users | |
25 | know what they're doing. | |
26 | ||
27 | ||
28 | 1. Overview | |
29 | =========== | |
30 | ||
31 | The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is | |
32 | basically an implementation of the Earliest Deadline First (EDF) scheduling | |
33 | algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) | |
34 | that makes it possible to isolate the behavior of tasks between each other. | |
35 | ||
36 | ||
37 | 2. Scheduling algorithm | |
38 | ================== | |
39 | ||
40 | SCHED_DEADLINE uses three parameters, named "runtime", "period", and | |
b56bfc6c | 41 | "deadline", to schedule tasks. A SCHED_DEADLINE task should receive |
712e5e34 DF |
42 | "runtime" microseconds of execution time every "period" microseconds, and |
43 | these "runtime" microseconds are available within "deadline" microseconds | |
44 | from the beginning of the period. In order to implement this behaviour, | |
45 | every time the task wakes up, the scheduler computes a "scheduling deadline" | |
46 | consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then | |
47 | scheduled using EDF[1] on these scheduling deadlines (the task with the | |
b56bfc6c LA |
48 | earliest scheduling deadline is selected for execution). Notice that the |
49 | task actually receives "runtime" time units within "deadline" if a proper | |
50 | "admission control" strategy (see Section "4. Bandwidth management") is used | |
51 | (clearly, if the system is overloaded this guarantee cannot be respected). | |
712e5e34 DF |
52 | |
53 | Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so | |
54 | that each task runs for at most its runtime every period, avoiding any | |
55 | interference between different tasks (bandwidth isolation), while the EDF[1] | |
ad67dc31 LA |
56 | algorithm selects the task with the earliest scheduling deadline as the one |
57 | to be executed next. Thanks to this feature, tasks that do not strictly comply | |
58 | with the "traditional" real-time task model (see Section 3) can effectively | |
59 | use the new policy. | |
712e5e34 DF |
60 | |
61 | In more details, the CBS algorithm assigns scheduling deadlines to | |
62 | tasks in the following way: | |
63 | ||
64 | - Each SCHED_DEADLINE task is characterised by the "runtime", | |
65 | "deadline", and "period" parameters; | |
66 | ||
67 | - The state of the task is described by a "scheduling deadline", and | |
ad67dc31 | 68 | a "remaining runtime". These two parameters are initially set to 0; |
712e5e34 DF |
69 | |
70 | - When a SCHED_DEADLINE task wakes up (becomes ready for execution), | |
71 | the scheduler checks if | |
72 | ||
ad67dc31 LA |
73 | remaining runtime runtime |
74 | ---------------------------------- > --------- | |
75 | scheduling deadline - current time period | |
712e5e34 DF |
76 | |
77 | then, if the scheduling deadline is smaller than the current time, or | |
78 | this condition is verified, the scheduling deadline and the | |
ad67dc31 | 79 | remaining runtime are re-initialised as |
712e5e34 DF |
80 | |
81 | scheduling deadline = current time + deadline | |
ad67dc31 | 82 | remaining runtime = runtime |
712e5e34 | 83 | |
ad67dc31 | 84 | otherwise, the scheduling deadline and the remaining runtime are |
712e5e34 DF |
85 | left unchanged; |
86 | ||
87 | - When a SCHED_DEADLINE task executes for an amount of time t, its | |
ad67dc31 | 88 | remaining runtime is decreased as |
712e5e34 | 89 | |
ad67dc31 | 90 | remaining runtime = remaining runtime - t |
712e5e34 DF |
91 | |
92 | (technically, the runtime is decreased at every tick, or when the | |
93 | task is descheduled / preempted); | |
94 | ||
ad67dc31 | 95 | - When the remaining runtime becomes less or equal than 0, the task is |
712e5e34 DF |
96 | said to be "throttled" (also known as "depleted" in real-time literature) |
97 | and cannot be scheduled until its scheduling deadline. The "replenishment | |
98 | time" for this task (see next item) is set to be equal to the current | |
99 | value of the scheduling deadline; | |
100 | ||
101 | - When the current time is equal to the replenishment time of a | |
ad67dc31 | 102 | throttled task, the scheduling deadline and the remaining runtime are |
712e5e34 DF |
103 | updated as |
104 | ||
105 | scheduling deadline = scheduling deadline + period | |
ad67dc31 | 106 | remaining runtime = remaining runtime + runtime |
712e5e34 DF |
107 | |
108 | ||
109 | 3. Scheduling Real-Time Tasks | |
110 | ============================= | |
111 | ||
112 | * BIG FAT WARNING ****************************************************** | |
113 | * | |
114 | * This section contains a (not-thorough) summary on classical deadline | |
115 | * scheduling theory, and how it applies to SCHED_DEADLINE. | |
116 | * The reader can "safely" skip to Section 4 if only interested in seeing | |
117 | * how the scheduling policy can be used. Anyway, we strongly recommend | |
118 | * to come back here and continue reading (once the urge for testing is | |
119 | * satisfied :P) to be sure of fully understanding all technical details. | |
120 | ************************************************************************ | |
121 | ||
122 | There are no limitations on what kind of task can exploit this new | |
123 | scheduling discipline, even if it must be said that it is particularly | |
124 | suited for periodic or sporadic real-time tasks that need guarantees on their | |
125 | timing behavior, e.g., multimedia, streaming, control applications, etc. | |
126 | ||
127 | A typical real-time task is composed of a repetition of computation phases | |
128 | (task instances, or jobs) which are activated on a periodic or sporadic | |
129 | fashion. | |
130 | Each job J_j (where J_j is the j^th job of the task) is characterised by an | |
131 | arrival time r_j (the time when the job starts), an amount of computation | |
132 | time c_j needed to finish the job, and a job absolute deadline d_j, which | |
133 | is the time within which the job should be finished. The maximum execution | |
134 | time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. | |
135 | A real-time task can be periodic with period P if r_{j+1} = r_j + P, or | |
136 | sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, | |
137 | d_j = r_j + D, where D is the task's relative deadline. | |
b56bfc6c LA |
138 | The utilisation of a real-time task is defined as the ratio between its |
139 | WCET and its period (or minimum inter-arrival time), and represents | |
140 | the fraction of CPU time needed to execute the task. | |
141 | ||
142 | If the total utilisation sum_i(WCET_i/P_i) is larger than M (with M equal | |
143 | to the number of CPUs), then the scheduler is unable to respect all the | |
144 | deadlines. | |
145 | Note that total utilisation is defined as the sum of the utilisations | |
146 | WCET_i/P_i over all the real-time tasks in the system. When considering | |
147 | multiple real-time tasks, the parameters of the i-th task are indicated | |
148 | with the "_i" suffix. | |
149 | Moreover, if the total utilisation is larger than M, then we risk starving | |
150 | non- real-time tasks by real-time tasks. | |
151 | If, instead, the total utilisation is smaller than M, then non real-time | |
152 | tasks will not be starved and the system might be able to respect all the | |
153 | deadlines. | |
154 | As a matter of fact, in this case it is possible to provide an upper bound | |
155 | for tardiness (defined as the maximum between 0 and the difference | |
156 | between the finishing time of a job and its absolute deadline). | |
157 | More precisely, it can be proven that using a global EDF scheduler the | |
158 | maximum tardiness of each task is smaller or equal than | |
159 | ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max | |
160 | where WCET_max = max_i{WCET_i} is the maximum WCET, WCET_min=min_i{WCET_i} | |
161 | is the minimum WCET, and U_max = max_i{WCET_i/P_i} is the maximum utilisation. | |
162 | ||
163 | If M=1 (uniprocessor system), or in case of partitioned scheduling (each | |
164 | real-time task is statically assigned to one and only one CPU), it is | |
165 | possible to formally check if all the deadlines are respected. | |
166 | If D_i = P_i for all tasks, then EDF is able to respect all the deadlines | |
167 | of all the tasks executing on a CPU if and only if the total utilisation | |
168 | of the tasks running on such a CPU is smaller or equal than 1. | |
169 | If D_i != P_i for some task, then it is possible to define the density of | |
170 | a task as C_i/min{D_i,T_i}, and EDF is able to respect all the deadlines | |
171 | of all the tasks running on a CPU if the sum sum_i C_i/min{D_i,T_i} of the | |
172 | densities of the tasks running on such a CPU is smaller or equal than 1 | |
173 | (notice that this condition is only sufficient, and not necessary). | |
174 | ||
175 | On multiprocessor systems with global EDF scheduling (non partitioned | |
176 | systems), a sufficient test for schedulability can not be based on the | |
177 | utilisations (it can be shown that task sets with utilisations slightly | |
178 | larger than 1 can miss deadlines regardless of the number of CPUs M). | |
179 | However, as previously stated, enforcing that the total utilisation is smaller | |
180 | than M is enough to guarantee that non real-time tasks are not starved and | |
181 | that the tardiness of real-time tasks has an upper bound. | |
712e5e34 DF |
182 | |
183 | SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that | |
184 | the jobs' deadlines of a task are respected. In order to do this, a task | |
185 | must be scheduled by setting: | |
186 | ||
187 | - runtime >= WCET | |
188 | - deadline = D | |
189 | - period <= P | |
190 | ||
191 | IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines | |
192 | and the absolute deadlines (d_j) coincide, so a proper admission control | |
193 | allows to respect the jobs' absolute deadlines for this task (this is what is | |
194 | called "hard schedulability property" and is an extension of Lemma 1 of [2]). | |
ad67dc31 LA |
195 | Notice that if runtime > deadline the admission control will surely reject |
196 | this task, as it is not possible to respect its temporal constraints. | |
712e5e34 DF |
197 | |
198 | References: | |
199 | 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- | |
200 | ming in a hard-real-time environment. Journal of the Association for | |
201 | Computing Machinery, 20(1), 1973. | |
202 | 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard | |
203 | Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems | |
204 | Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf | |
205 | 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab | |
ad67dc31 | 206 | Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf |
712e5e34 DF |
207 | |
208 | 4. Bandwidth management | |
209 | ======================= | |
210 | ||
b56bfc6c LA |
211 | As previously mentioned, in order for -deadline scheduling to be |
212 | effective and useful (that is, to be able to provide "runtime" time units | |
213 | within "deadline"), it is important to have some method to keep the allocation | |
214 | of the available fractions of CPU time to the various tasks under control. | |
215 | This is usually called "admission control" and if it is not performed, then | |
216 | no guarantee can be given on the actual scheduling of the -deadline tasks. | |
217 | ||
218 | As already stated in Section 3, a necessary condition to be respected to | |
219 | correctly schedule a set of real-time tasks is that the total utilisation | |
220 | is smaller than M. When talking about -deadline tasks, this requires that | |
221 | the sum of the ratio between runtime and period for all tasks is smaller | |
222 | than M. Notice that the ratio runtime/period is equivalent to the utilisation | |
223 | of a "traditional" real-time task, and is also often referred to as | |
224 | "bandwidth". | |
225 | The interface used to control the CPU bandwidth that can be allocated | |
226 | to -deadline tasks is similar to the one already used for -rt | |
0d9ba8b0 JL |
227 | tasks with real-time group scheduling (a.k.a. RT-throttling - see |
228 | Documentation/scheduler/sched-rt-group.txt), and is based on readable/ | |
229 | writable control files located in procfs (for system wide settings). | |
230 | Notice that per-group settings (controlled through cgroupfs) are still not | |
231 | defined for -deadline tasks, because more discussion is needed in order to | |
232 | figure out how we want to manage SCHED_DEADLINE bandwidth at the task group | |
233 | level. | |
234 | ||
235 | A main difference between deadline bandwidth management and RT-throttling | |
712e5e34 | 236 | is that -deadline tasks have bandwidth on their own (while -rt ones don't!), |
0d9ba8b0 | 237 | and thus we don't need a higher level throttling mechanism to enforce the |
b56bfc6c LA |
238 | desired bandwidth. In other words, this means that interface parameters are |
239 | only used at admission control time (i.e., when the user calls | |
240 | sched_setattr()). Scheduling is then performed considering actual tasks' | |
241 | parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks | |
242 | respecting their needs in terms of granularity. Therefore, using this simple | |
243 | interface we can put a cap on total utilization of -deadline tasks (i.e., | |
244 | \Sum (runtime_i / period_i) < global_dl_utilization_cap). | |
712e5e34 DF |
245 | |
246 | 4.1 System wide settings | |
247 | ------------------------ | |
248 | ||
249 | The system wide settings are configured under the /proc virtual file system. | |
250 | ||
0d9ba8b0 JL |
251 | For now the -rt knobs are used for -deadline admission control and the |
252 | -deadline runtime is accounted against the -rt runtime. We realise that this | |
253 | isn't entirely desirable; however, it is better to have a small interface for | |
254 | now, and be able to change it easily later. The ideal situation (see 5.) is to | |
255 | run -rt tasks from a -deadline server; in which case the -rt bandwidth is a | |
256 | direct subset of dl_bw. | |
712e5e34 DF |
257 | |
258 | This means that, for a root_domain comprising M CPUs, -deadline tasks | |
259 | can be created while the sum of their bandwidths stays below: | |
260 | ||
261 | M * (sched_rt_runtime_us / sched_rt_period_us) | |
262 | ||
263 | It is also possible to disable this bandwidth management logic, and | |
264 | be thus free of oversubscribing the system up to any arbitrary level. | |
265 | This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. | |
266 | ||
267 | ||
268 | 4.2 Task interface | |
269 | ------------------ | |
270 | ||
271 | Specifying a periodic/sporadic task that executes for a given amount of | |
272 | runtime at each instance, and that is scheduled according to the urgency of | |
273 | its own timing constraints needs, in general, a way of declaring: | |
274 | - a (maximum/typical) instance execution time, | |
275 | - a minimum interval between consecutive instances, | |
276 | - a time constraint by which each instance must be completed. | |
277 | ||
278 | Therefore: | |
279 | * a new struct sched_attr, containing all the necessary fields is | |
280 | provided; | |
281 | * the new scheduling related syscalls that manipulate it, i.e., | |
282 | sched_setattr() and sched_getattr() are implemented. | |
283 | ||
284 | ||
285 | 4.3 Default behavior | |
286 | --------------------- | |
287 | ||
288 | The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to | |
289 | 950000. With rt_period equal to 1000000, by default, it means that -deadline | |
290 | tasks can use at most 95%, multiplied by the number of CPUs that compose the | |
291 | root_domain, for each root_domain. | |
b56bfc6c LA |
292 | This means that non -deadline tasks will receive at least 5% of the CPU time, |
293 | and that -deadline tasks will receive their runtime with a guaranteed | |
294 | worst-case delay respect to the "deadline" parameter. If "deadline" = "period" | |
295 | and the cpuset mechanism is used to implement partitioned scheduling (see | |
296 | Section 5), then this simple setting of the bandwidth management is able to | |
297 | deterministically guarantee that -deadline tasks will receive their runtime | |
298 | in a period. | |
299 | ||
300 | Finally, notice that in order not to jeopardize the admission control a | |
301 | -deadline task cannot fork. | |
712e5e34 DF |
302 | |
303 | 5. Tasks CPU affinity | |
304 | ===================== | |
305 | ||
306 | -deadline tasks cannot have an affinity mask smaller that the entire | |
307 | root_domain they are created on. However, affinities can be specified | |
308 | through the cpuset facility (Documentation/cgroups/cpusets.txt). | |
309 | ||
310 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
311 | ------------------------------------ | |
312 | ||
313 | An example of a simple configuration (pin a -deadline task to CPU0) | |
314 | follows (rt-app is used to create a -deadline task). | |
315 | ||
316 | mkdir /dev/cpuset | |
317 | mount -t cgroup -o cpuset cpuset /dev/cpuset | |
318 | cd /dev/cpuset | |
319 | mkdir cpu0 | |
320 | echo 0 > cpu0/cpuset.cpus | |
321 | echo 0 > cpu0/cpuset.mems | |
322 | echo 1 > cpuset.cpu_exclusive | |
323 | echo 0 > cpuset.sched_load_balance | |
324 | echo 1 > cpu0/cpuset.cpu_exclusive | |
325 | echo 1 > cpu0/cpuset.mem_exclusive | |
326 | echo $$ > cpu0/tasks | |
327 | rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify | |
328 | task affinity) | |
329 | ||
330 | 6. Future plans | |
331 | =============== | |
332 | ||
333 | Still missing: | |
334 | ||
335 | - refinements to deadline inheritance, especially regarding the possibility | |
336 | of retaining bandwidth isolation among non-interacting tasks. This is | |
337 | being studied from both theoretical and practical points of view, and | |
338 | hopefully we should be able to produce some demonstrative code soon; | |
339 | - (c)group based bandwidth management, and maybe scheduling; | |
340 | - access control for non-root users (and related security concerns to | |
341 | address), which is the best way to allow unprivileged use of the mechanisms | |
342 | and how to prevent non-root users "cheat" the system? | |
343 | ||
344 | As already discussed, we are planning also to merge this work with the EDF | |
345 | throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in | |
346 | the preliminary phases of the merge and we really seek feedback that would | |
347 | help us decide on the direction it should take. |