sched: Enhance the scheduler migration load fixup feature

In the current frequency guidance implementation the scheduler migrates
task load from the source CPU to the destination CPU when a task migrates.
The underlying assumption is that a task will stay on the destination CPU
following the migration. Hence a CPU's load should reflect the sum of
all tasks that last ran on that CPU prior to window expiration even if
these tasks executed on some other CPU in that window prior to being
migrated.

However, given the ubiquitous nature of migrations the above assumption
is flawed causing the scheduler to often add up load on a single CPU
that in reality ran concurrently on multiple CPUs and will continue to
run concurrently in subsequent windows. This leads to load over
reporting on a single CPU which in turn causes CPU frequency to be higher
than necessary.

This is the first patch in a series of patches that attempts to change
how load fixups are done upon migration to prevent load over reporting.
In this patch, we stop doing migration fixups for intra-cluster
migrations. Inter-cluster migration fixups are still retained.

In order to achieve the above, we make use the per CPU footprint of each
task introduced in the previous patch. Upon inter cluster migration, we
go through every CPU in the source cluster to subtract the migrating
task's contribution to the busy time on each one of those CPUs. The sum
of the contributions is then added to the destination CPU allowing it
to ramp up to the appropriate frequency for that task.

Subtracting load from each of the source CPUs is not trivial, however,
as it would require all runqueue locks to held. To get around this
we introduce a deferred load subtraction mechanism whereby subtracting
load from each of the source CPUs in deferred until an opportune moment.
This opportune moment is when the governor comes asking the scheduler
for load. At that time, all necessary runqueue locks are already held.

There are a few cases to consider when doing deferred subtraction. Since
we are not holding all runqueue locks other CPUs in the source cluster
can be in a different window than the source CPU where the task
is migrating from.

Case 1: Other CPU in the source cluster is in the same window
No special consideration

Case 2: Other CPU in the source cluster is ahead by 1 window
In this case, we will be doing redundant updates to subtraction load
for the prev window. There is no way to avoid this redundant update
though, without holding the rq lock.

Case 3: Other CPU in the source cluster is trailing by 1 window
In this case, we might end up overwriting old data for that CPU. But
this is not a problem as when the other CPU calls update_task_ravg()
it will move to the same window. This relies on maintaining
synchronized windows between CPUs, which is true today.

Finally, we must deal with frequency aggregation. When frequency
aggregation is in effect, there is little point in dealing with per
CPU footprint since the load of all related tasks have to be reported
on a single CPU. Therefore when a task enters a related group we clear
out all per CPU contributions and add it to the task CPU's cpu_time
struct. From that point onwards we stop managing per CPU contributions
upon inter cluster migrations since that work is redundant. Finally
when a task exits a related group we must walk every CPU in reset
all CPU contributions. We then set the task CPU contribution to the
respective curr/prev sum values and add that sum to the task CPU
rq runnable sum.

Change-Id: I1f8d596e6c930f3f6f00e24109ddbe8b121f8d6b
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
3 files changed