| Real-Time group scheduling |
| -------------------------- |
| |
| CONTENTS |
| ======== |
| |
| 0. WARNING |
| 1. Overview |
| 1.1 The problem |
| 1.2 The solution |
| 2. The interface |
| 2.1 System-wide settings |
| 2.2 Default behaviour |
| 2.3 Basis for grouping tasks |
| 3. Future plans |
| |
| |
| 0. WARNING |
| ========== |
| |
| Fiddling with these settings can result in an unstable system, the knobs are |
| root only and assumes root knows what he is doing. |
| |
| Most notable: |
| |
| * very small values in sched_rt_period_us can result in an unstable |
| system when the period is smaller than either the available hrtimer |
| resolution, or the time it takes to handle the budget refresh itself. |
| |
| * very small values in sched_rt_runtime_us can result in an unstable |
| system when the runtime is so small the system has difficulty making |
| forward progress (NOTE: the migration thread and kstopmachine both |
| are real-time processes). |
| |
| 1. Overview |
| =========== |
| |
| |
| 1.1 The problem |
| --------------- |
| |
| Realtime scheduling is all about determinism, a group has to be able to rely on |
| the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
| multiple groups of realtime tasks, each group must be assigned a fixed portion |
| of the CPU time available. Without a minimum guarantee a realtime group can |
| obviously fall short. A fuzzy upper limit is of no use since it cannot be |
| relied upon. Which leaves us with just the single fixed portion. |
| |
| 1.2 The solution |
| ---------------- |
| |
| CPU time is divided by means of specifying how much time can be spent running |
| in a given period. We allocate this "run time" for each realtime group which |
| the other realtime groups will not be permitted to use. |
| |
| Any time not allocated to a realtime group will be used to run normal priority |
| tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by |
| SCHED_OTHER. |
| |
| Let's consider an example: a frame fixed realtime renderer must deliver 25 |
| frames a second, which yields a period of 0.04s per frame. Now say it will also |
| have to play some music and respond to input, leaving it with around 80% CPU |
| time dedicated for the graphics. We can then give this group a run time of 0.8 |
| * 0.04s = 0.032s. |
| |
| This way the graphics group will have a 0.04s period with a 0.032s run time |
| limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but |
| needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = |
| 0.00015s. So this group can be scheduled with a period of 0.005s and a run time |
| of 0.00015s. |
| |
| The remaining CPU time will be used for user input and other tasks. Because |
| realtime tasks have explicitly allocated the CPU time they need to perform |
| their tasks, buffer underruns in the graphics or audio can be eliminated. |
| |
| NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
| lack an EDF scheduler to make non-uniform periods usable. |
| |
| |
| 2. The Interface |
| ================ |
| |
| |
| 2.1 System wide settings |
| ------------------------ |
| |
| The system wide settings are configured under the /proc virtual file system: |
| |
| /proc/sys/kernel/sched_rt_period_us: |
| The scheduling period that is equivalent to 100% CPU bandwidth |
| |
| /proc/sys/kernel/sched_rt_runtime_us: |
| A global limit on how much time realtime scheduling may use. Even without |
| CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime |
| processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth |
| available to all realtime groups. |
| |
| * Time is specified in us because the interface is s32. This gives an |
| operating range from 1us to about 35 minutes. |
| * sched_rt_period_us takes values from 1 to INT_MAX. |
| * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). |
| * A run time of -1 specifies runtime == period, ie. no limit. |
| |
| |
| 2.2 Default behaviour |
| --------------------- |
| |
| The default values for sched_rt_period_us (1000000 or 1s) and |
| sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by |
| SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away |
| realtime tasks will not lock up the machine but leave a little time to recover |
| it. By setting runtime to -1 you'd get the old behaviour back. |
| |
| By default all bandwidth is assigned to the root group and new groups get the |
| period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you |
| want to assign bandwidth to another group, reduce the root group's bandwidth |
| and assign some or all of the difference to another group. |
| |
| Realtime group scheduling means you have to assign a portion of total CPU |
| bandwidth to the group before it will accept realtime tasks. Therefore you will |
| not be able to run realtime tasks as any user other than root until you have |
| done that, even if the user has the rights to run processes with realtime |
| priority! |
| |
| |
| 2.3 Basis for grouping tasks |
| ---------------------------- |
| |
| There are two compile-time settings for allocating CPU bandwidth. These are |
| configured using the "Basis for grouping tasks" multiple choice menu under |
| General setup > Group CPU Scheduler: |
| |
| a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") |
| |
| This lets you use the virtual files under |
| "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for |
| each user . |
| |
| The other option is: |
| |
| .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") |
| |
| This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" |
| to control the CPU time reserved for each control group instead. |
| |
| For more information on working with control groups, you should read |
| Documentation/cgroups/cgroups.txt as well. |
| |
| Group settings are checked against the following limits in order to keep the configuration |
| schedulable: |
| |
| \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
| |
| For now, this can be simplified to just the following (but see Future plans): |
| |
| \Sum_{i} runtime_{i} <= global_runtime |
| |
| |
| 3. Future plans |
| =============== |
| |
| There is work in progress to make the scheduling period for each group |
| ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or |
| "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. |
| |
| The constraint on the period is that a subgroup must have a smaller or |
| equal period to its parent. But realistically its not very useful _yet_ |
| as its prone to starvation without deadline scheduling. |
| |
| Consider two sibling groups A and B; both have 50% bandwidth, but A's |
| period is twice the length of B's. |
| |
| * group A: period=100000us, runtime=10000us |
| - this runs for 0.01s once every 0.1s |
| |
| * group B: period= 50000us, runtime=10000us |
| - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). |
| |
| This means that currently a while (1) loop in A will run for the full period of |
| B and can starve B's tasks (assuming they are of lower priority) for a whole |
| period. |
| |
| The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring |
| full deadline scheduling to the linux kernel. Deadline scheduling the above |
| groups and treating end of the period as a deadline will ensure that they both |
| get their allocated time. |
| |
| Implementing SCHED_EDF might take a while to complete. Priority Inheritance is |
| the biggest challenge as the current linux PI infrastructure is geared towards |
| the limited static priority levels 0-99. With deadline scheduling you need to |
| do deadline inheritance (since priority is inversely proportional to the |
| deadline delta (deadline - now). |
| |
| This means the whole PI machinery will have to be reworked - and that is one of |
| the most complex pieces of code we have. |