Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 1 | Real-Time group scheduling |
| 2 | -------------------------- |
| 3 | |
| 4 | CONTENTS |
| 5 | ======== |
| 6 | |
| 7 | 1. Overview |
| 8 | 1.1 The problem |
| 9 | 1.2 The solution |
| 10 | 2. The interface |
| 11 | 2.1 System-wide settings |
| 12 | 2.2 Default behaviour |
| 13 | 2.3 Basis for grouping tasks |
| 14 | 3. Future plans |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 15 | |
| 16 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 17 | 1. Overview |
| 18 | =========== |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 19 | |
| 20 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 21 | 1.1 The problem |
| 22 | --------------- |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 23 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 24 | Realtime scheduling is all about determinism, a group has to be able to rely on |
| 25 | the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
| 26 | multiple groups of realtime tasks, each group must be assigned a fixed portion |
| 27 | of the CPU time available. Without a minimum guarantee a realtime group can |
| 28 | obviously fall short. A fuzzy upper limit is of no use since it cannot be |
| 29 | relied upon. Which leaves us with just the single fixed portion. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 30 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 31 | 1.2 The solution |
| 32 | ---------------- |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 33 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 34 | CPU time is divided by means of specifying how much time can be spent running |
| 35 | in a given period. We allocate this "run time" for each realtime group which |
| 36 | the other realtime groups will not be permitted to use. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 37 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 38 | Any time not allocated to a realtime group will be used to run normal priority |
| 39 | tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by |
| 40 | SCHED_OTHER. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 41 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 42 | Let's consider an example: a frame fixed realtime renderer must deliver 25 |
| 43 | frames a second, which yields a period of 0.04s per frame. Now say it will also |
| 44 | have to play some music and respond to input, leaving it with around 80% CPU |
| 45 | time dedicated for the graphics. We can then give this group a run time of 0.8 |
| 46 | * 0.04s = 0.032s. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 47 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 48 | This way the graphics group will have a 0.04s period with a 0.032s run time |
| 49 | limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but |
| 50 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = |
| 51 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time |
| 52 | of 0.00015s. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 53 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 54 | The remaining CPU time will be used for user input and other tass. Because |
| 55 | realtime tasks have explicitly allocated the CPU time they need to perform |
| 56 | their tasks, buffer underruns in the graphocs or audio can be eliminated. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 57 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 58 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
| 59 | lack an EDF scheduler to make non-uniform periods usable. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 60 | |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 61 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 62 | 2. The Interface |
| 63 | ================ |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 64 | |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 65 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 66 | 2.1 System wide settings |
| 67 | ------------------------ |
| 68 | |
| 69 | The system wide settings are configured under the /proc virtual file system: |
| 70 | |
| 71 | /proc/sys/kernel/sched_rt_period_us: |
| 72 | The scheduling period that is equivalent to 100% CPU bandwidth |
| 73 | |
| 74 | /proc/sys/kernel/sched_rt_runtime_us: |
| 75 | A global limit on how much time realtime scheduling may use. Even without |
| 76 | CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime |
| 77 | processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth |
| 78 | available to all realtime groups. |
| 79 | |
| 80 | * Time is specified in us because the interface is s32. This gives an |
| 81 | operating range from 1us to about 35 minutes. |
| 82 | * sched_rt_period_us takes values from 1 to INT_MAX. |
| 83 | * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). |
| 84 | * A run time of -1 specifies runtime == period, ie. no limit. |
| 85 | |
| 86 | |
| 87 | 2.2 Default behaviour |
| 88 | --------------------- |
| 89 | |
| 90 | The default values for sched_rt_period_us (1000000 or 1s) and |
| 91 | sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by |
| 92 | SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away |
| 93 | realtime tasks will not lock up the machine but leave a little time to recover |
| 94 | it. By setting runtime to -1 you'd get the old behaviour back. |
| 95 | |
| 96 | By default all bandwidth is assigned to the root group and new groups get the |
| 97 | period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you |
| 98 | want to assign bandwidth to another group, reduce the root group's bandwidth |
| 99 | and assign some or all of the difference to another group. |
| 100 | |
| 101 | Realtime group scheduling means you have to assign a portion of total CPU |
| 102 | bandwidth to the group before it will accept realtime tasks. Therefore you will |
| 103 | not be able to run realtime tasks as any user other than root until you have |
| 104 | done that, even if the user has the rights to run processes with realtime |
| 105 | priority! |
| 106 | |
| 107 | |
| 108 | 2.3 Basis for grouping tasks |
| 109 | ---------------------------- |
| 110 | |
| 111 | There are two compile-time settings for allocating CPU bandwidth. These are |
| 112 | configured using the "Basis for grouping tasks" multiple choice menu under |
| 113 | General setup > Group CPU Scheduler: |
| 114 | |
| 115 | a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") |
| 116 | |
| 117 | This lets you use the virtual files under |
| 118 | "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for |
| 119 | each user . |
| 120 | |
| 121 | The other option is: |
| 122 | |
| 123 | .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") |
| 124 | |
| 125 | This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" |
| 126 | to control the CPU time reserved for each control group instead. |
| 127 | |
| 128 | For more information on working with control groups, you should read |
| 129 | Documentation/cgroups.txt as well. |
| 130 | |
| 131 | Group settings are checked against the following limits in order to keep the configuration |
| 132 | schedulable: |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 133 | |
| 134 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
| 135 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 136 | For now, this can be simplified to just the following (but see Future plans): |
| 137 | |
| 138 | \Sum_{i} runtime_{i} <= global_runtime |
| 139 | |
| 140 | |
| 141 | 3. Future plans |
| 142 | =============== |
| 143 | |
| 144 | There is work in progress to make the scheduling period for each group |
| 145 | ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or |
| 146 | "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. |
| 147 | |
| 148 | The constraint on the period is that a subgroup must have a smaller or |
| 149 | equal period to its parent. But realistically its not very useful _yet_ |
| 150 | as its prone to starvation without deadline scheduling. |
| 151 | |
| 152 | Consider two sibling groups A and B; both have 50% bandwidth, but A's |
| 153 | period is twice the length of B's. |
| 154 | |
| 155 | * group A: period=100000us, runtime=10000us |
| 156 | - this runs for 0.01s once every 0.1s |
| 157 | |
| 158 | * group B: period= 50000us, runtime=10000us |
| 159 | - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). |
| 160 | |
| 161 | This means that currently a while (1) loop in A will run for the full period of |
| 162 | B and can starve B's tasks (assuming they are of lower priority) for a whole |
| 163 | period. |
| 164 | |
| 165 | The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring |
| 166 | full deadline scheduling to the linux kernel. Deadline scheduling the above |
| 167 | groups and treating end of the period as a deadline will ensure that they both |
| 168 | get their allocated time. |
| 169 | |
| 170 | Implementing SCHED_EDF might take a while to complete. Priority Inheritance is |
| 171 | the biggest challenge as the current linux PI infrastructure is geared towards |
| 172 | the limited static priority levels 0-139. With deadline scheduling you need to |
| 173 | do deadline inheritance (since priority is inversely proportional to the |
| 174 | deadline delta (deadline - now). |
| 175 | |
| 176 | This means the whole PI machinery will have to be reworked - and that is one of |
| 177 | the most complex pieces of code we have. |