Blame - Documentation/arm/cluster-pm-race-avoidance.txt - kernel/msm-4.9

blob: 750b6fc24af92438e6f9a112b312819e5ecf7b94 [file] [log] [blame]

Dave Martin	7fe31d2	2012-07-17 14:25:42 +0100	[diff] [blame]	1	Cluster-wide Power-up/power-down race avoidance algorithm
				2	=========================================================
				3
				4	This file documents the algorithm which is used to coordinate CPU and
				5	cluster setup and teardown operations and to manage hardware coherency
				6	controls safely.
				7
				8	The section "Rationale" explains what the algorithm is for and why it is
				9	needed. "Basic model" explains general concepts using a simplified view
				10	of the system. The other sections explain the actual details of the
				11	algorithm in use.
				12
				13
				14	Rationale
				15	---------
				16
				17	In a system containing multiple CPUs, it is desirable to have the
				18	ability to turn off individual CPUs when the system is idle, reducing
				19	power consumption and thermal dissipation.
				20
				21	In a system containing multiple clusters of CPUs, it is also desirable
				22	to have the ability to turn off entire clusters.
				23
				24	Turning entire clusters off and on is a risky business, because it
				25	involves performing potentially destructive operations affecting a group
				26	of independently running CPUs, while the OS continues to run. This
				27	means that we need some coordination in order to ensure that critical
				28	cluster-level operations are only performed when it is truly safe to do
				29	so.
				30
				31	Simple locking may not be sufficient to solve this problem, because
				32	mechanisms like Linux spinlocks may rely on coherency mechanisms which
				33	are not immediately enabled when a cluster powers up. Since enabling or
				34	disabling those mechanisms may itself be a non-atomic operation (such as
				35	writing some hardware registers and invalidating large caches), other
				36	methods of coordination are required in order to guarantee safe
				37	power-down and power-up at the cluster level.
				38
				39	The mechanism presented in this document describes a coherent memory
				40	based protocol for performing the needed coordination. It aims to be as
				41	lightweight as possible, while providing the required safety properties.
				42
				43
				44	Basic model
				45	-----------
				46
				47	Each cluster and CPU is assigned a state, as follows:
				48
				49	DOWN
				50	COMING_UP
				51	UP
				52	GOING_DOWN
				53
				54	+---------> UP ----------+
				55	\| v
				56
				57	COMING_UP GOING_DOWN
				58
				59	^ \|
				60	+--------- DOWN <--------+
				61
				62
				63	DOWN: The CPU or cluster is not coherent, and is either powered off or
				64	suspended, or is ready to be powered off or suspended.
				65
				66	COMING_UP: The CPU or cluster has committed to moving to the UP state.
				67	It may be part way through the process of initialisation and
				68	enabling coherency.
				69
				70	UP: The CPU or cluster is active and coherent at the hardware
				71	level. A CPU in this state is not necessarily being used
				72	actively by the kernel.
				73
				74	GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
				75	state. It may be part way through the process of teardown and
				76	coherency exit.
				77
				78
				79	Each CPU has one of these states assigned to it at any point in time.
				80	The CPU states are described in the "CPU state" section, below.
				81
				82	Each cluster is also assigned a state, but it is necessary to split the
				83	state value into two parts (the "cluster" state and "inbound" state) and
				84	to introduce additional states in order to avoid races between different
				85	CPUs in the cluster simultaneously modifying the state. The cluster-
				86	level states are described in the "Cluster state" section.
				87
				88	To help distinguish the CPU states from cluster states in this
				89	discussion, the state names are given a CPU_ prefix for the CPU states,
				90	and a CLUSTER_ or INBOUND_ prefix for the cluster states.
				91
				92
				93	CPU state
				94	---------
				95
				96	In this algorithm, each individual core in a multi-core processor is
				97	referred to as a "CPU". CPUs are assumed to be single-threaded:
				98	therefore, a CPU can only be doing one thing at a single point in time.
				99
				100	This means that CPUs fit the basic model closely.
				101
				102	The algorithm defines the following states for each CPU in the system:
				103
				104	CPU_DOWN
				105	CPU_COMING_UP
				106	CPU_UP
				107	CPU_GOING_DOWN
				108
				109	cluster setup and
				110	CPU setup complete policy decision
				111	+-----------> CPU_UP ------------+
				112	\| v
				113
				114	CPU_COMING_UP CPU_GOING_DOWN
				115
				116	^ \|
				117	+----------- CPU_DOWN <----------+
				118	policy decision CPU teardown complete
				119	or hardware event
				120
				121
				122	The definitions of the four states correspond closely to the states of
				123	the basic model.
				124
				125	Transitions between states occur as follows.
				126
				127	A trigger event (spontaneous) means that the CPU can transition to the
				128	next state as a result of making local progress only, with no
				129	requirement for any external event to happen.
				130
				131
				132	CPU_DOWN:
				133
				134	A CPU reaches the CPU_DOWN state when it is ready for
				135	power-down. On reaching this state, the CPU will typically
				136	power itself down or suspend itself, via a WFI instruction or a
				137	firmware call.
				138
				139	Next state: CPU_COMING_UP
				140	Conditions: none
				141
				142	Trigger events:
				143
				144	a) an explicit hardware power-up operation, resulting
				145	from a policy decision on another CPU;
				146
				147	b) a hardware event, such as an interrupt.
				148
				149
				150	CPU_COMING_UP:
				151
				152	A CPU cannot start participating in hardware coherency until the
				153	cluster is set up and coherent. If the cluster is not ready,
				154	then the CPU will wait in the CPU_COMING_UP state until the
				155	cluster has been set up.
				156
				157	Next state: CPU_UP
				158	Conditions: The CPU's parent cluster must be in CLUSTER_UP.
				159	Trigger events: Transition of the parent cluster to CLUSTER_UP.
				160
				161	Refer to the "Cluster state" section for a description of the
				162	CLUSTER_UP state.
				163
				164
				165	CPU_UP:
				166	When a CPU reaches the CPU_UP state, it is safe for the CPU to
				167	start participating in local coherency.
				168
				169	This is done by jumping to the kernel's CPU resume code.
				170
				171	Note that the definition of this state is slightly different
				172	from the basic model definition: CPU_UP does not mean that the
				173	CPU is coherent yet, but it does mean that it is safe to resume
				174	the kernel. The kernel handles the rest of the resume
				175	procedure, so the remaining steps are not visible as part of the
				176	race avoidance algorithm.
				177
				178	The CPU remains in this state until an explicit policy decision
				179	is made to shut down or suspend the CPU.
				180
				181	Next state: CPU_GOING_DOWN
				182	Conditions: none
				183	Trigger events: explicit policy decision
				184
				185
				186	CPU_GOING_DOWN:
				187
				188	While in this state, the CPU exits coherency, including any
				189	operations required to achieve this (such as cleaning data
				190	caches).
				191
				192	Next state: CPU_DOWN
				193	Conditions: local CPU teardown complete
				194	Trigger events: (spontaneous)
				195
				196
				197	Cluster state
				198	-------------
				199
				200	A cluster is a group of connected CPUs with some common resources.
				201	Because a cluster contains multiple CPUs, it can be doing multiple
				202	things at the same time. This has some implications. In particular, a
				203	CPU can start up while another CPU is tearing the cluster down.
				204
				205	In this discussion, the "outbound side" is the view of the cluster state
				206	as seen by a CPU tearing the cluster down. The "inbound side" is the
				207	view of the cluster state as seen by a CPU setting the CPU up.
				208
				209	In order to enable safe coordination in such situations, it is important
				210	that a CPU which is setting up the cluster can advertise its state
				211	independently of the CPU which is tearing down the cluster. For this
				212	reason, the cluster state is split into two parts:
				213
				214	"cluster" state: The global state of the cluster; or the state
				215	on the outbound side:
				216
				217	CLUSTER_DOWN
				218	CLUSTER_UP
				219	CLUSTER_GOING_DOWN
				220
				221	"inbound" state: The state of the cluster on the inbound side.
				222
				223	INBOUND_NOT_COMING_UP
				224	INBOUND_COMING_UP
				225
				226
				227	The different pairings of these states results in six possible
				228	states for the cluster as a whole:
				229
				230	CLUSTER_UP
				231	+==========> INBOUND_NOT_COMING_UP -------------+
				232	# \|
				233	\|
				234	CLUSTER_UP <----+ \|
				235	INBOUND_COMING_UP \| v
				236
				237	^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN
				238	# INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
				239
				240	CLUSTER_DOWN \| \|
				241	INBOUND_COMING_UP <----+ \|
				242	\|
				243	^ \|
				244	+=========== CLUSTER_DOWN <------------+
				245	INBOUND_NOT_COMING_UP
				246
				247	Transitions -----> can only be made by the outbound CPU, and
				248	only involve changes to the "cluster" state.
				249
				250	Transitions ===##> can only be made by the inbound CPU, and only
				251	involve changes to the "inbound" state, except where there is no
				252	further transition possible on the outbound side (i.e., the
				253	outbound CPU has put the cluster into the CLUSTER_DOWN state).
				254
				255	The race avoidance algorithm does not provide a way to determine
				256	which exact CPUs within the cluster play these roles. This must
				257	be decided in advance by some other means. Refer to the section
				258	"Last man and first man selection" for more explanation.
				259
				260
				261	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
				262	cluster can actually be powered down.
				263
				264	The parallelism of the inbound and outbound CPUs is observed by
				265	the existence of two different paths from CLUSTER_GOING_DOWN/
				266	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
				267	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
				268	COMING_UP in the basic model). The second path avoids cluster
				269	teardown completely.
				270
				271	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
				272	model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
				273	is trivial and merely resets the state machine ready for the
				274	next cycle.
				275
				276	Details of the allowable transitions follow.
				277
				278	The next state in each case is notated
				279
				280	<cluster state>/<inbound state> (<transitioner>)
				281
				282	where the <transitioner> is the side on which the transition
				283	can occur; either the inbound or the outbound side.
				284
				285
				286	CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
				287
				288	Next state: CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
				289	Conditions: none
				290	Trigger events:
				291
				292	a) an explicit hardware power-up operation, resulting
				293	from a policy decision on another CPU;
				294
				295	b) a hardware event, such as an interrupt.
				296
				297
				298	CLUSTER_DOWN/INBOUND_COMING_UP:
				299
				300	In this state, an inbound CPU sets up the cluster, including
				301	enabling of hardware coherency at the cluster level and any
				302	other operations (such as cache invalidation) which are required
				303	in order to achieve this.
				304
				305	The purpose of this state is to do sufficient cluster-level
				306	setup to enable other CPUs in the cluster to enter coherency
				307	safely.
				308
				309	Next state: CLUSTER_UP/INBOUND_COMING_UP (inbound)
				310	Conditions: cluster-level setup and hardware coherency complete
				311	Trigger events: (spontaneous)
				312
				313
				314	CLUSTER_UP/INBOUND_COMING_UP:
				315
				316	Cluster-level setup is complete and hardware coherency is
				317	enabled for the cluster. Other CPUs in the cluster can safely
				318	enter coherency.
				319
				320	This is a transient state, leading immediately to
				321	CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster
				322	should consider treat these two states as equivalent.
				323
				324	Next state: CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
				325	Conditions: none
				326	Trigger events: (spontaneous)
				327
				328
				329	CLUSTER_UP/INBOUND_NOT_COMING_UP:
				330
				331	Cluster-level setup is complete and hardware coherency is
				332	enabled for the cluster. Other CPUs in the cluster can safely
				333	enter coherency.
				334
				335	The cluster will remain in this state until a policy decision is
				336	made to power the cluster down.
				337
				338	Next state: CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
				339	Conditions: none
				340	Trigger events: policy decision to power down the cluster
				341
				342
				343	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
				344
				345	An outbound CPU is tearing the cluster down. The selected CPU
				346	must wait in this state until all CPUs in the cluster are in the
				347	CPU_DOWN state.
				348
				349	When all CPUs are in the CPU_DOWN state, the cluster can be torn
				350	down, for example by cleaning data caches and exiting
				351	cluster-level coherency.
				352
				353	To avoid wasteful unnecessary teardown operations, the outbound
				354	should check the inbound cluster state for asynchronous
				355	transitions to INBOUND_COMING_UP. Alternatively, individual
				356	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
				357
				358
				359	Next states:
				360
				361	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
				362	Conditions: cluster torn down and ready to power off
				363	Trigger events: (spontaneous)
				364
				365	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
				366	Conditions: none
				367	Trigger events:
				368
				369	a) an explicit hardware power-up operation,
				370	resulting from a policy decision on another
				371	CPU;
				372
				373	b) a hardware event, such as an interrupt.
				374
				375
				376	CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
				377
				378	The cluster is (or was) being torn down, but another CPU has
				379	come online in the meantime and is trying to set up the cluster
				380	again.
				381
				382	If the outbound CPU observes this state, it has two choices:
				383
				384	a) back out of teardown, restoring the cluster to the
				385	CLUSTER_UP state;
				386
				387	b) finish tearing the cluster down and put the cluster
				388	in the CLUSTER_DOWN state; the inbound CPU will
				389	set up the cluster again from there.
				390
				391	Choice (a) permits the removal of some latency by avoiding
				392	unnecessary teardown and setup operations in situations where
				393	the cluster is not really going to be powered down.
				394
				395
				396	Next states:
				397
				398	CLUSTER_UP/INBOUND_COMING_UP (outbound)
				399	Conditions: cluster-level setup and hardware
				400	coherency complete
				401	Trigger events: (spontaneous)
				402
				403	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
				404	Conditions: cluster torn down and ready to power off
				405	Trigger events: (spontaneous)
				406
				407
				408	Last man and First man selection
				409	--------------------------------
				410
				411	The CPU which performs cluster tear-down operations on the outbound side
				412	is commonly referred to as the "last man".
				413
				414	The CPU which performs cluster setup on the inbound side is commonly
				415	referred to as the "first man".
				416
				417	The race avoidance algorithm documented above does not provide a
				418	mechanism to choose which CPUs should play these roles.
				419
				420
				421	Last man:
				422
				423	When shutting down the cluster, all the CPUs involved are initially
				424	executing Linux and hence coherent. Therefore, ordinary spinlocks can
				425	be used to select a last man safely, before the CPUs become
				426	non-coherent.
				427
				428
				429	First man:
				430
				431	Because CPUs may power up asynchronously in response to external wake-up
				432	events, a dynamic mechanism is needed to make sure that only one CPU
				433	attempts to play the first man role and do the cluster-level
				434	initialisation: any other CPUs must wait for this to complete before
				435	proceeding.
				436
				437	Cluster-level initialisation may involve actions such as configuring
				438	coherency controls in the bus fabric.
				439
				440	The current implementation in mcpm_head.S uses a separate mutual exclusion
				441	mechanism to do this arbitration. This mechanism is documented in
				442	detail in vlocks.txt.
				443
				444
				445	Features and Limitations
				446	------------------------
				447
				448	Implementation:
				449
				450	The current ARM-based implementation is split between
				451	arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
				452	arch/arm/common/mcpm_entry.c (everything else):
				453
				454	__mcpm_cpu_going_down() signals the transition of a CPU to the
				455	CPU_GOING_DOWN state.
				456
				457	__mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
				458	state.
				459
				460	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
				461	low-level power-up code in mcpm_head.S. This could
				462	involve CPU-specific setup code, but in the current
				463	implementation it does not.
				464
				465	__mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
				466	handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
				467	and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
				468	the case of an aborted cluster power-down).
				469
				470	These functions are more complex than the __mcpm_cpu_*()
				471	functions due to the extra inter-CPU coordination which
				472	is needed for safe transitions at the cluster level.
				473
				474	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
				475	the low-level power-up code in mcpm_head.S. This
				476	typically involves platform-specific setup code,
				477	provided by the platform-specific power_up_setup
				478	function registered via mcpm_sync_init.
				479
				480	Deep topologies:
				481
				482	As currently described and implemented, the algorithm does not
				483	support CPU topologies involving more than two levels (i.e.,
				484	clusters of clusters are not supported). The algorithm could be
				485	extended by replicating the cluster-level states for the
				486	additional topological levels, and modifying the transition
				487	rules for the intermediate (non-outermost) cluster levels.
				488
				489
				490	Colophon
				491	--------
				492
				493	Originally created and documented by Dave Martin for Linaro Limited, in
				494	collaboration with Nicolas Pitre and Achin Gupta.
				495
				496	Copyright (C) 2012-2013 Linaro Limited
				497	Distributed under the terms of Version 2 of the GNU General Public
				498	License, as defined in linux/COPYING.