Commit | Line | Data |
---|---|---|
712e5e34 DF |
1 | Deadline Task Scheduling |
2 | ------------------------ | |
3 | ||
4 | CONTENTS | |
5 | ======== | |
6 | ||
7 | 0. WARNING | |
8 | 1. Overview | |
9 | 2. Scheduling algorithm | |
10 | 3. Scheduling Real-Time Tasks | |
11 | 4. Bandwidth management | |
12 | 4.1 System-wide settings | |
13 | 4.2 Task interface | |
14 | 4.3 Default behavior | |
15 | 5. Tasks CPU affinity | |
16 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
17 | 6. Future plans | |
18 | ||
19 | ||
20 | 0. WARNING | |
21 | ========== | |
22 | ||
23 | Fiddling with these settings can result in an unpredictable or even unstable | |
24 | system behavior. As for -rt (group) scheduling, it is assumed that root users | |
25 | know what they're doing. | |
26 | ||
27 | ||
28 | 1. Overview | |
29 | =========== | |
30 | ||
31 | The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is | |
32 | basically an implementation of the Earliest Deadline First (EDF) scheduling | |
33 | algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) | |
34 | that makes it possible to isolate the behavior of tasks between each other. | |
35 | ||
36 | ||
37 | 2. Scheduling algorithm | |
38 | ================== | |
39 | ||
40 | SCHED_DEADLINE uses three parameters, named "runtime", "period", and | |
41 | "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive | |
42 | "runtime" microseconds of execution time every "period" microseconds, and | |
43 | these "runtime" microseconds are available within "deadline" microseconds | |
44 | from the beginning of the period. In order to implement this behaviour, | |
45 | every time the task wakes up, the scheduler computes a "scheduling deadline" | |
46 | consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then | |
47 | scheduled using EDF[1] on these scheduling deadlines (the task with the | |
48 | smallest scheduling deadline is selected for execution). Notice that this | |
49 | guaranteed is respected if a proper "admission control" strategy (see Section | |
50 | "4. Bandwidth management") is used. | |
51 | ||
52 | Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so | |
53 | that each task runs for at most its runtime every period, avoiding any | |
54 | interference between different tasks (bandwidth isolation), while the EDF[1] | |
55 | algorithm selects the task with the smallest scheduling deadline as the one | |
56 | to be executed first. Thanks to this feature, also tasks that do not | |
57 | strictly comply with the "traditional" real-time task model (see Section 3) | |
58 | can effectively use the new policy. | |
59 | ||
60 | In more details, the CBS algorithm assigns scheduling deadlines to | |
61 | tasks in the following way: | |
62 | ||
63 | - Each SCHED_DEADLINE task is characterised by the "runtime", | |
64 | "deadline", and "period" parameters; | |
65 | ||
66 | - The state of the task is described by a "scheduling deadline", and | |
67 | a "current runtime". These two parameters are initially set to 0; | |
68 | ||
69 | - When a SCHED_DEADLINE task wakes up (becomes ready for execution), | |
70 | the scheduler checks if | |
71 | ||
72 | current runtime runtime | |
73 | ---------------------------------- > ---------------- | |
74 | scheduling deadline - current time period | |
75 | ||
76 | then, if the scheduling deadline is smaller than the current time, or | |
77 | this condition is verified, the scheduling deadline and the | |
78 | current budget are re-initialised as | |
79 | ||
80 | scheduling deadline = current time + deadline | |
81 | current runtime = runtime | |
82 | ||
83 | otherwise, the scheduling deadline and the current runtime are | |
84 | left unchanged; | |
85 | ||
86 | - When a SCHED_DEADLINE task executes for an amount of time t, its | |
87 | current runtime is decreased as | |
88 | ||
89 | current runtime = current runtime - t | |
90 | ||
91 | (technically, the runtime is decreased at every tick, or when the | |
92 | task is descheduled / preempted); | |
93 | ||
94 | - When the current runtime becomes less or equal than 0, the task is | |
95 | said to be "throttled" (also known as "depleted" in real-time literature) | |
96 | and cannot be scheduled until its scheduling deadline. The "replenishment | |
97 | time" for this task (see next item) is set to be equal to the current | |
98 | value of the scheduling deadline; | |
99 | ||
100 | - When the current time is equal to the replenishment time of a | |
101 | throttled task, the scheduling deadline and the current runtime are | |
102 | updated as | |
103 | ||
104 | scheduling deadline = scheduling deadline + period | |
105 | current runtime = current runtime + runtime | |
106 | ||
107 | ||
108 | 3. Scheduling Real-Time Tasks | |
109 | ============================= | |
110 | ||
111 | * BIG FAT WARNING ****************************************************** | |
112 | * | |
113 | * This section contains a (not-thorough) summary on classical deadline | |
114 | * scheduling theory, and how it applies to SCHED_DEADLINE. | |
115 | * The reader can "safely" skip to Section 4 if only interested in seeing | |
116 | * how the scheduling policy can be used. Anyway, we strongly recommend | |
117 | * to come back here and continue reading (once the urge for testing is | |
118 | * satisfied :P) to be sure of fully understanding all technical details. | |
119 | ************************************************************************ | |
120 | ||
121 | There are no limitations on what kind of task can exploit this new | |
122 | scheduling discipline, even if it must be said that it is particularly | |
123 | suited for periodic or sporadic real-time tasks that need guarantees on their | |
124 | timing behavior, e.g., multimedia, streaming, control applications, etc. | |
125 | ||
126 | A typical real-time task is composed of a repetition of computation phases | |
127 | (task instances, or jobs) which are activated on a periodic or sporadic | |
128 | fashion. | |
129 | Each job J_j (where J_j is the j^th job of the task) is characterised by an | |
130 | arrival time r_j (the time when the job starts), an amount of computation | |
131 | time c_j needed to finish the job, and a job absolute deadline d_j, which | |
132 | is the time within which the job should be finished. The maximum execution | |
133 | time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. | |
134 | A real-time task can be periodic with period P if r_{j+1} = r_j + P, or | |
135 | sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, | |
136 | d_j = r_j + D, where D is the task's relative deadline. | |
137 | ||
138 | SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that | |
139 | the jobs' deadlines of a task are respected. In order to do this, a task | |
140 | must be scheduled by setting: | |
141 | ||
142 | - runtime >= WCET | |
143 | - deadline = D | |
144 | - period <= P | |
145 | ||
146 | IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines | |
147 | and the absolute deadlines (d_j) coincide, so a proper admission control | |
148 | allows to respect the jobs' absolute deadlines for this task (this is what is | |
149 | called "hard schedulability property" and is an extension of Lemma 1 of [2]). | |
150 | ||
151 | References: | |
152 | 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- | |
153 | ming in a hard-real-time environment. Journal of the Association for | |
154 | Computing Machinery, 20(1), 1973. | |
155 | 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard | |
156 | Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems | |
157 | Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf | |
158 | 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab | |
159 | Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps | |
160 | ||
161 | 4. Bandwidth management | |
162 | ======================= | |
163 | ||
164 | In order for the -deadline scheduling to be effective and useful, it is | |
165 | important to have some method to keep the allocation of the available CPU | |
166 | bandwidth to the tasks under control. | |
167 | This is usually called "admission control" and if it is not performed at all, | |
168 | no guarantee can be given on the actual scheduling of the -deadline tasks. | |
169 | ||
170 | Since when RT-throttling has been introduced each task group has a bandwidth | |
171 | associated, calculated as a certain amount of runtime over a period. | |
172 | Moreover, to make it possible to manipulate such bandwidth, readable/writable | |
173 | controls have been added to both procfs (for system wide settings) and cgroupfs | |
174 | (for per-group settings). | |
175 | Therefore, the same interface is being used for controlling the bandwidth | |
176 | distrubution to -deadline tasks. | |
177 | ||
178 | However, more discussion is needed in order to figure out how we want to manage | |
179 | SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE | |
180 | uses (for now) a less sophisticated, but actually very sensible, mechanism to | |
181 | ensure that a certain utilization cap is not overcome per each root_domain. | |
182 | ||
183 | Another main difference between deadline bandwidth management and RT-throttling | |
184 | is that -deadline tasks have bandwidth on their own (while -rt ones don't!), | |
185 | and thus we don't need an higher level throttling mechanism to enforce the | |
186 | desired bandwidth. | |
187 | ||
188 | 4.1 System wide settings | |
189 | ------------------------ | |
190 | ||
191 | The system wide settings are configured under the /proc virtual file system. | |
192 | ||
193 | For now the -rt knobs are used for dl admission control and the -deadline | |
194 | runtime is accounted against the -rt runtime. We realise that this isn't | |
195 | entirely desirable; however, it is better to have a small interface for now, | |
196 | and be able to change it easily later. The ideal situation (see 5.) is to run | |
197 | -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct | |
198 | subset of dl_bw. | |
199 | ||
200 | This means that, for a root_domain comprising M CPUs, -deadline tasks | |
201 | can be created while the sum of their bandwidths stays below: | |
202 | ||
203 | M * (sched_rt_runtime_us / sched_rt_period_us) | |
204 | ||
205 | It is also possible to disable this bandwidth management logic, and | |
206 | be thus free of oversubscribing the system up to any arbitrary level. | |
207 | This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. | |
208 | ||
209 | ||
210 | 4.2 Task interface | |
211 | ------------------ | |
212 | ||
213 | Specifying a periodic/sporadic task that executes for a given amount of | |
214 | runtime at each instance, and that is scheduled according to the urgency of | |
215 | its own timing constraints needs, in general, a way of declaring: | |
216 | - a (maximum/typical) instance execution time, | |
217 | - a minimum interval between consecutive instances, | |
218 | - a time constraint by which each instance must be completed. | |
219 | ||
220 | Therefore: | |
221 | * a new struct sched_attr, containing all the necessary fields is | |
222 | provided; | |
223 | * the new scheduling related syscalls that manipulate it, i.e., | |
224 | sched_setattr() and sched_getattr() are implemented. | |
225 | ||
226 | ||
227 | 4.3 Default behavior | |
228 | --------------------- | |
229 | ||
230 | The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to | |
231 | 950000. With rt_period equal to 1000000, by default, it means that -deadline | |
232 | tasks can use at most 95%, multiplied by the number of CPUs that compose the | |
233 | root_domain, for each root_domain. | |
234 | ||
235 | A -deadline task cannot fork. | |
236 | ||
237 | 5. Tasks CPU affinity | |
238 | ===================== | |
239 | ||
240 | -deadline tasks cannot have an affinity mask smaller that the entire | |
241 | root_domain they are created on. However, affinities can be specified | |
242 | through the cpuset facility (Documentation/cgroups/cpusets.txt). | |
243 | ||
244 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
245 | ------------------------------------ | |
246 | ||
247 | An example of a simple configuration (pin a -deadline task to CPU0) | |
248 | follows (rt-app is used to create a -deadline task). | |
249 | ||
250 | mkdir /dev/cpuset | |
251 | mount -t cgroup -o cpuset cpuset /dev/cpuset | |
252 | cd /dev/cpuset | |
253 | mkdir cpu0 | |
254 | echo 0 > cpu0/cpuset.cpus | |
255 | echo 0 > cpu0/cpuset.mems | |
256 | echo 1 > cpuset.cpu_exclusive | |
257 | echo 0 > cpuset.sched_load_balance | |
258 | echo 1 > cpu0/cpuset.cpu_exclusive | |
259 | echo 1 > cpu0/cpuset.mem_exclusive | |
260 | echo $$ > cpu0/tasks | |
261 | rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify | |
262 | task affinity) | |
263 | ||
264 | 6. Future plans | |
265 | =============== | |
266 | ||
267 | Still missing: | |
268 | ||
269 | - refinements to deadline inheritance, especially regarding the possibility | |
270 | of retaining bandwidth isolation among non-interacting tasks. This is | |
271 | being studied from both theoretical and practical points of view, and | |
272 | hopefully we should be able to produce some demonstrative code soon; | |
273 | - (c)group based bandwidth management, and maybe scheduling; | |
274 | - access control for non-root users (and related security concerns to | |
275 | address), which is the best way to allow unprivileged use of the mechanisms | |
276 | and how to prevent non-root users "cheat" the system? | |
277 | ||
278 | As already discussed, we are planning also to merge this work with the EDF | |
279 | throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in | |
280 | the preliminary phases of the merge and we really seek feedback that would | |
281 | help us decide on the direction it should take. |