Blame - Documentation/x86/intel_rdt_ui.txt - kernel/msm-5.4

blob: 76f21e2ac1761dc09f1a2a0cbc99d29110c53691 [file] [log] [blame]

Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	1	User Interface for Resource Allocation in Intel Resource Director Technology
				2
				3	Copyright (C) 2016 Intel Corporation
				4
				5	Fenghua Yu <fenghua.yu@intel.com>
				6	Tony Luck <tony.luck@intel.com>
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	7	Vikas Shivappa <vikas.shivappa@intel.com>
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	8
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	9	This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
				10	X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	11
				12	To use the feature mount the file system:
				13
				14	# mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl
				15
				16	mount options are:
				17
				18	"cdp": Enable code/data prioritization in L3 cache allocations.
				19
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	20	RDT features are orthogonal. A particular system may support only
				21	monitoring, only control, or both monitoring and control.
				22
				23	The mount succeeds if either of allocation or monitoring is present, but
				24	only those files and directories supported by the system will be created.
				25	For more details on the behavior of the interface during monitoring
				26	and allocation, see the "Resource alloc and monitor groups" section.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	27
Thomas Gleixner	458b0d6	2016-11-07 11:58:12 +0100	[diff] [blame]	28	Info directory
				29	--------------
				30
				31	The 'info' directory contains information about the enabled
				32	resources. Each resource has its own subdirectory. The subdirectory
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	33	names reflect the resource names.
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	34
				35	Each subdirectory contains the following files with respect to
				36	allocation:
				37
				38	Cache resource(L3/L2) subdirectory contains the following files
				39	related to allocation:
Thomas Gleixner	458b0d6	2016-11-07 11:58:12 +0100	[diff] [blame]	40
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	41	"num_closids": The number of CLOSIDs which are valid for this
				42	resource. The kernel uses the smallest number of
				43	CLOSIDs of all enabled resources as limit.
Thomas Gleixner	458b0d6	2016-11-07 11:58:12 +0100	[diff] [blame]	44
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	45	"cbm_mask": The bitmask which is valid for this resource.
				46	This mask is equivalent to 100%.
Thomas Gleixner	458b0d6	2016-11-07 11:58:12 +0100	[diff] [blame]	47
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	48	"min_cbm_bits": The minimum number of consecutive bits which
				49	must be set when writing a mask.
Thomas Gleixner	458b0d6	2016-11-07 11:58:12 +0100	[diff] [blame]	50
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	51	Memory bandwitdh(MB) subdirectory contains the following files
				52	with respect to allocation:
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	53
				54	"min_bandwidth": The minimum memory bandwidth percentage which
				55	user can request.
				56
				57	"bandwidth_gran": The granularity in which the memory bandwidth
				58	percentage is allocated. The allocated
				59	b/w percentage is rounded off to the next
				60	control step available on the hardware. The
				61	available bandwidth control steps are:
				62	min_bandwidth + N * bandwidth_gran.
				63
				64	"delay_linear": Indicates if the delay scale is linear or
				65	non-linear. This field is purely informational
				66	only.
Thomas Gleixner	458b0d6	2016-11-07 11:58:12 +0100	[diff] [blame]	67
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	68	If RDT monitoring is available there will be an "L3_MON" directory
				69	with the following files:
				70
				71	"num_rmids": The number of RMIDs available. This is the
				72	upper bound for how many "CTRL_MON" + "MON"
				73	groups can be created.
				74
				75	"mon_features": Lists the monitoring events if
				76	monitoring is enabled for the resource.
				77
				78	"max_threshold_occupancy":
				79	Read/write file provides the largest value (in
				80	bytes) at which a previously used LLC_occupancy
				81	counter can be considered for re-use.
				82
				83
				84	Resource alloc and monitor groups
				85	---------------------------------
				86
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	87	Resource groups are represented as directories in the resctrl file
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	88	system. The default group is the root directory which, immediately
				89	after mounting, owns all the tasks and cpus in the system and can make
				90	full use of all resources.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	91
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	92	On a system with RDT control features additional directories can be
				93	created in the root directory that specify different amounts of each
				94	resource (see "schemata" below). The root and these additional top level
				95	directories are referred to as "CTRL_MON" groups below.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	96
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	97	On a system with RDT monitoring the root directory and other top level
				98	directories contain a directory named "mon_groups" in which additional
				99	directories can be created to monitor subsets of tasks in the CTRL_MON
				100	group that is their ancestor. These are called "MON" groups in the rest
				101	of this document.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	102
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	103	Removing a directory will move all tasks and cpus owned by the group it
				104	represents to the parent. Removing one of the created CTRL_MON groups
				105	will automatically remove all MON groups below it.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	106
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	107	All groups contain the following files:
Jiri Olsa	4ffa3c9	2017-04-10 16:52:32 +0200	[diff] [blame]	108
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	109	"tasks":
				110	Reading this file shows the list of all tasks that belong to
				111	this group. Writing a task id to the file will add a task to the
				112	group. If the group is a CTRL_MON group the task is removed from
				113	whichever previous CTRL_MON group owned the task and also from
				114	any MON group that owned the task. If the group is a MON group,
				115	then the task must already belong to the CTRL_MON parent of this
				116	group. The task is removed from any previous MON group.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	117
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	118
				119	"cpus":
				120	Reading this file shows a bitmask of the logical CPUs owned by
				121	this group. Writing a mask to this file will add and remove
				122	CPUs to/from this group. As with the tasks file a hierarchy is
				123	maintained where MON groups may only include CPUs owned by the
				124	parent CTRL_MON group.
				125
				126
				127	"cpus_list":
				128	Just like "cpus", only using ranges of CPUs instead of bitmasks.
				129
				130
				131	When control is enabled all CTRL_MON groups will also contain:
				132
				133	"schemata":
				134	A list of all the resources available to this group.
				135	Each resource has its own line and format - see below for details.
				136
				137	When monitoring is enabled all MON groups will also contain:
				138
				139	"mon_data":
				140	This contains a set of files organized by L3 domain and by
				141	RDT event. E.g. on a system with two L3 domains there will
				142	be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
				143	directories have one file per event (e.g. "llc_occupancy",
				144	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
				145	files provide a read out of the current value of the event for
				146	all tasks in the group. In CTRL_MON groups these files provide
				147	the sum for all tasks in the CTRL_MON group and all tasks in
				148	MON groups. Please see example section for more details on usage.
				149
				150	Resource allocation rules
				151	-------------------------
				152	When a task is running the following rules define which resources are
				153	available to it:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	154
				155	1) If the task is a member of a non-default group, then the schemata
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	156	for that group is used.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	157
				158	2) Else if the task belongs to the default group, but is running on a
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	159	CPU that is assigned to some specific group, then the schemata for the
				160	CPU's group is used.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	161
				162	3) Otherwise the schemata for the default group is used.
				163
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	164	Resource monitoring rules
				165	-------------------------
				166	1) If a task is a member of a MON group, or non-default CTRL_MON group
				167	then RDT events for the task will be reported in that group.
				168
				169	2) If a task is a member of the default CTRL_MON group, but is running
				170	on a CPU that is assigned to some specific group, then the RDT events
				171	for the task will be reported in that group.
				172
				173	3) Otherwise RDT events for the task will be reported in the root level
				174	"mon_data" group.
				175
				176
				177	Notes on cache occupancy monitoring and control
				178	-----------------------------------------------
				179	When moving a task from one group to another you should remember that
				180	this only affects new cache allocations by the task. E.g. you may have
				181	a task in a monitor group showing 3 MB of cache occupancy. If you move
				182	to a new group and immediately check the occupancy of the old and new
				183	groups you will likely see that the old group is still showing 3 MB and
				184	the new group zero. When the task accesses locations still in cache from
				185	before the move, the h/w does not update any counters. On a busy system
				186	you will likely see the occupancy in the old group go down as cache lines
				187	are evicted and re-used while the occupancy in the new group rises as
				188	the task accesses memory and loads into the cache are counted based on
				189	membership in the new group.
				190
				191	The same applies to cache allocation control. Moving a task to a group
				192	with a smaller cache partition will not evict any cache lines. The
				193	process may continue to use them from the old partition.
				194
				195	Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
				196	to identify a control group and a monitoring group respectively. Each of
				197	the resource groups are mapped to these IDs based on the kind of group. The
				198	number of CLOSid and RMID are limited by the hardware and hence the creation of
				199	a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
				200	and creation of "MON" group may fail if we run out of RMIDs.
				201
				202	max_threshold_occupancy - generic concepts
				203	------------------------------------------
				204
				205	Note that an RMID once freed may not be immediately available for use as
				206	the RMID is still tagged the cache lines of the previous user of RMID.
				207	Hence such RMIDs are placed on limbo list and checked back if the cache
				208	occupancy has gone down. If there is a time when system has a lot of
				209	limbo RMIDs but which are not ready to be used, user may see an -EBUSY
				210	during mkdir.
				211
				212	max_threshold_occupancy is a user configurable value to determine the
				213	occupancy at which an RMID can be freed.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	214
				215	Schemata files - general concepts
				216	---------------------------------
				217	Each line in the file describes one resource. The line starts with
				218	the name of the resource, followed by specific values to be applied
				219	in each of the instances of that resource on the system.
				220
				221	Cache IDs
				222	---------
				223	On current generation systems there is one L3 cache per socket and L2
				224	caches are generally just shared by the hyperthreads on a core, but this
				225	isn't an architectural requirement. We could have multiple separate L3
				226	caches on a socket, multiple cores could share an L2 cache. So instead
				227	of using "socket" or "core" to define the set of logical cpus sharing
				228	a resource we use a "Cache ID". At a given cache level this will be a
				229	unique number across the whole system (but it isn't guaranteed to be a
				230	contiguous sequence, there may be gaps). To find the ID for each logical
				231	CPU look in /sys/devices/system/cpu/cpu/cache/index/id
				232
				233	Cache Bit Masks (CBM)
				234	---------------------
				235	For cache resources we describe the portion of the cache that is available
				236	for allocation using a bitmask. The maximum value of the mask is defined
				237	by each cpu model (and may be different for different cache levels). It
				238	is found using CPUID, but is also provided in the "info" directory of
				239	the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
				240	requires that these masks have all the '1' bits in a contiguous block. So
				241	0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
				242	and 0xA are not. On a system with a 20-bit mask each bit represents 5%
				243	of the capacity of the cache. You could partition the cache into four
				244	equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
				245
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	246	Memory bandwidth(b/w) percentage
				247	--------------------------------
				248	For Memory b/w resource, user controls the resource by indicating the
				249	percentage of total memory b/w.
				250
				251	The minimum bandwidth percentage value for each cpu model is predefined
				252	and can be looked up through "info/MB/min_bandwidth". The bandwidth
				253	granularity that is allocated is also dependent on the cpu model and can
				254	be looked up at "info/MB/bandwidth_gran". The available bandwidth
				255	control steps are: min_bw + N * bw_gran. Intermediate values are rounded
				256	to the next control step available on the hardware.
				257
				258	The bandwidth throttling is a core specific mechanism on some of Intel
				259	SKUs. Using a high bandwidth and a low bandwidth setting on two threads
				260	sharing a core will result in both threads being throttled to use the
				261	low bandwidth.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	262
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	263	L3 schemata file details (code and data prioritization disabled)
				264	----------------------------------------------------------------
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	265	With CDP disabled the L3 schemata format is:
				266
				267	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				268
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	269	L3 schemata file details (CDP enabled via mount option to resctrl)
				270	------------------------------------------------------------------
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	271	When CDP is enabled L3 control is split into two separate resources
				272	so you can specify independent masks for code and data like this:
				273
				274	L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				275	L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				276
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	277	L2 schemata file details
				278	------------------------
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	279	L2 cache does not support code and data prioritization, so the
				280	schemata format is always:
				281
				282	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				283
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	284	Memory b/w Allocation details
				285	-----------------------------
				286
				287	Memory b/w domain is L3 cache.
				288
				289	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
				290
Tony Luck	c4026b7	2017-04-03 14:44:16 -0700	[diff] [blame]	291	Reading/writing the schemata file
				292	---------------------------------
				293	Reading the schemata file will show the state of all resources
				294	on all domains. When writing you only need to specify those values
				295	which you wish to change. E.g.
				296
				297	# cat schemata
				298	L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
				299	L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
				300	# echo "L3DATA:2=3c0;" > schemata
				301	# cat schemata
				302	L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
				303	L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
				304
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	305	Examples for RDT allocation usage:
				306
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	307	Example 1
				308	---------
				309	On a two socket machine (one L3 cache per socket) with just four bits
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	310	for cache bit masks, minimum b/w of 10% with a memory bandwidth
				311	granularity of 10%
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	312
				313	# mount -t resctrl resctrl /sys/fs/resctrl
				314	# cd /sys/fs/resctrl
				315	# mkdir p0 p1
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	316	# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
				317	# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	318
				319	The default resource group is unmodified, so we have access to all parts
				320	of all caches (its schemata file reads "L3:0=f;1=f").
				321
				322	Tasks that are under the control of group "p0" may only allocate from the
				323	"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
				324	Tasks in group "p1" use the "lower" 50% of cache on both sockets.
				325
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	326	Similarly, tasks that are under the control of group "p0" may use a
				327	maximum memory b/w of 50% on socket0 and 50% on socket 1.
				328	Tasks in group "p1" may also use 50% memory b/w on both sockets.
				329	Note that unlike cache masks, memory b/w cannot specify whether these
				330	allocations can overlap or not. The allocations specifies the maximum
				331	b/w that the group may be able to use and the system admin can configure
				332	the b/w accordingly.
				333
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	334	Example 2
				335	---------
				336	Again two sockets, but this time with a more realistic 20-bit mask.
				337
				338	Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
				339	processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
				340	neighbors, each of the two real-time tasks exclusively occupies one quarter
				341	of L3 cache on socket 0.
				342
				343	# mount -t resctrl resctrl /sys/fs/resctrl
				344	# cd /sys/fs/resctrl
				345
				346	First we reset the schemata for the default group so that the "upper"
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	347	50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
				348	ordinary tasks:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	349
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	350	# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	351
				352	Next we make a resource group for our first real time task and give
				353	it access to the "top" 25% of the cache on socket 0.
				354
				355	# mkdir p0
				356	# echo "L3:0=f8000;1=fffff" > p0/schemata
				357
				358	Finally we move our first real time task into this resource group. We
				359	also use taskset(1) to ensure the task always runs on a dedicated CPU
				360	on socket 0. Most uses of resource groups will also constrain which
				361	processors tasks run on.
				362
				363	# echo 1234 > p0/tasks
				364	# taskset -cp 1 1234
				365
				366	Ditto for the second real time task (with the remaining 25% of cache):
				367
				368	# mkdir p1
				369	# echo "L3:0=7c00;1=fffff" > p1/schemata
				370	# echo 5678 > p1/tasks
				371	# taskset -cp 2 5678
				372
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	373	For the same 2 socket system with memory b/w resource and CAT L3 the
				374	schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
				375	10):
				376
				377	For our first real time task this would request 20% memory b/w on socket
				378	0.
				379
				380	# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
				381
				382	For our second real time task this would request an other 20% memory b/w
				383	on socket 0.
				384
				385	# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
				386
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	387	Example 3
				388	---------
				389
				390	A single socket system which has real-time tasks running on core 4-7 and
				391	non real-time workload assigned to core 0-3. The real-time tasks share text
				392	and data, so a per task association is not required and due to interaction
				393	with the kernel it's desired that the kernel on these cores shares L3 with
				394	the tasks.
				395
				396	# mount -t resctrl resctrl /sys/fs/resctrl
				397	# cd /sys/fs/resctrl
				398
				399	First we reset the schemata for the default group so that the "upper"
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	400	50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
				401	cannot be used by ordinary tasks:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	402
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	403	# echo "L3:0=3ff\nMB:0=50" > schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	404
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	405	Next we make a resource group for our real time cores and give it access
				406	to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
				407	socket 0.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	408
				409	# mkdir p0
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	410	# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	411
				412	Finally we move core 4-7 over to the new group and make sure that the
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	413	kernel and the tasks running there get 50% of the cache. They should
				414	also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
				415	siblings and only the real time threads are scheduled on the cores 4-7.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	416
Xiaochen Shen	fb8fb46	2017-05-03 11:15:56 +0800	[diff] [blame]	417	# echo F0 > p0/cpus
Marcelo Tosatti	3c2a769	2016-12-14 15:08:37 -0200	[diff] [blame]	418
				419	4) Locking between applications
				420
				421	Certain operations on the resctrl filesystem, composed of read/writes
				422	to/from multiple files, must be atomic.
				423
				424	As an example, the allocation of an exclusive reservation of L3 cache
				425	involves:
				426
				427	1. Read the cbmmasks from each directory
				428	2. Find a contiguous set of bits in the global CBM bitmask that is clear
				429	in any of the directory cbmmasks
				430	3. Create a new directory
				431	4. Set the bits found in step 2 to the new directory "schemata" file
				432
				433	If two applications attempt to allocate space concurrently then they can
				434	end up allocating the same bits so the reservations are shared instead of
				435	exclusive.
				436
				437	To coordinate atomic operations on the resctrlfs and to avoid the problem
				438	above, the following locking procedure is recommended:
				439
				440	Locking is based on flock, which is available in libc and also as a shell
				441	script command
				442
				443	Write lock:
				444
				445	A) Take flock(LOCK_EX) on /sys/fs/resctrl
				446	B) Read/write the directory structure.
				447	C) funlock
				448
				449	Read lock:
				450
				451	A) Take flock(LOCK_SH) on /sys/fs/resctrl
				452	B) If success read the directory structure.
				453	C) funlock
				454
				455	Example with bash:
				456
				457	# Atomically read directory structure
				458	$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
				459
				460	# Read directory contents and create new subdirectory
				461
				462	$ cat create-dir.sh
				463	find /sys/fs/resctrl/ > output.txt
				464	mask = function-of(output.txt)
				465	mkdir /sys/fs/resctrl/newres/
				466	echo mask > /sys/fs/resctrl/newres/schemata
				467
				468	$ flock /sys/fs/resctrl/ ./create-dir.sh
				469
				470	Example with C:
				471
				472	/*
				473	* Example code do take advisory locks
				474	* before accessing resctrl filesystem
				475	*/
				476	#include <sys/file.h>
				477	#include <stdlib.h>
				478
				479	void resctrl_take_shared_lock(int fd)
				480	{
				481	int ret;
				482
				483	/* take shared lock on resctrl filesystem */
				484	ret = flock(fd, LOCK_SH);
				485	if (ret) {
				486	perror("flock");
				487	exit(-1);
				488	}
				489	}
				490
				491	void resctrl_take_exclusive_lock(int fd)
				492	{
				493	int ret;
				494
				495	/* release lock on resctrl filesystem */
				496	ret = flock(fd, LOCK_EX);
				497	if (ret) {
				498	perror("flock");
				499	exit(-1);
				500	}
				501	}
				502
				503	void resctrl_release_lock(int fd)
				504	{
				505	int ret;
				506
				507	/* take shared lock on resctrl filesystem */
				508	ret = flock(fd, LOCK_UN);
				509	if (ret) {
				510	perror("flock");
				511	exit(-1);
				512	}
				513	}
				514
				515	void main(void)
				516	{
				517	int fd, ret;
				518
				519	fd = open("/sys/fs/resctrl", O_DIRECTORY);
				520	if (fd == -1) {
				521	perror("open");
				522	exit(-1);
				523	}
				524	resctrl_take_shared_lock(fd);
				525	/* code to read directory contents */
				526	resctrl_release_lock(fd);
				527
				528	resctrl_take_exclusive_lock(fd);
				529	/* code to read and write directory contents */
				530	resctrl_release_lock(fd);
				531	}
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame^]	532
				533	Examples for RDT Monitoring along with allocation usage:
				534
				535	Reading monitored data
				536	----------------------
				537	Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
				538	show the current snapshot of LLC occupancy of the corresponding MON
				539	group or CTRL_MON group.
				540
				541
				542	Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
				543	---------
				544	On a two socket machine (one L3 cache per socket) with just four bits
				545	for cache bit masks
				546
				547	# mount -t resctrl resctrl /sys/fs/resctrl
				548	# cd /sys/fs/resctrl
				549	# mkdir p0 p1
				550	# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
				551	# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
				552	# echo 5678 > p1/tasks
				553	# echo 5679 > p1/tasks
				554
				555	The default resource group is unmodified, so we have access to all parts
				556	of all caches (its schemata file reads "L3:0=f;1=f").
				557
				558	Tasks that are under the control of group "p0" may only allocate from the
				559	"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
				560	Tasks in group "p1" use the "lower" 50% of cache on both sockets.
				561
				562	Create monitor groups and assign a subset of tasks to each monitor group.
				563
				564	# cd /sys/fs/resctrl/p1/mon_groups
				565	# mkdir m11 m12
				566	# echo 5678 > m11/tasks
				567	# echo 5679 > m12/tasks
				568
				569	fetch data (data shown in bytes)
				570
				571	# cat m11/mon_data/mon_L3_00/llc_occupancy
				572	16234000
				573	# cat m11/mon_data/mon_L3_01/llc_occupancy
				574	14789000
				575	# cat m12/mon_data/mon_L3_00/llc_occupancy
				576	16789000
				577
				578	The parent ctrl_mon group shows the aggregated data.
				579
				580	# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
				581	31234000
				582
				583	Example 2 (Monitor a task from its creation)
				584	---------
				585	On a two socket machine (one L3 cache per socket)
				586
				587	# mount -t resctrl resctrl /sys/fs/resctrl
				588	# cd /sys/fs/resctrl
				589	# mkdir p0 p1
				590
				591	An RMID is allocated to the group once its created and hence the <cmd>
				592	below is monitored from its creation.
				593
				594	# echo $$ > /sys/fs/resctrl/p1/tasks
				595	# <cmd>
				596
				597	Fetch the data
				598
				599	# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
				600	31789000
				601
				602	Example 3 (Monitor without CAT support or before creating CAT groups)
				603	---------
				604
				605	Assume a system like HSW has only CQM and no CAT support. In this case
				606	the resctrl will still mount but cannot create CTRL_MON directories.
				607	But user can create different MON groups within the root group thereby
				608	able to monitor all tasks including kernel threads.
				609
				610	This can also be used to profile jobs cache size footprint before being
				611	able to allocate them to different allocation groups.
				612
				613	# mount -t resctrl resctrl /sys/fs/resctrl
				614	# cd /sys/fs/resctrl
				615	# mkdir mon_groups/m01
				616	# mkdir mon_groups/m02
				617
				618	# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
				619	# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
				620
				621	Monitor the groups separately and also get per domain data. From the
				622	below its apparent that the tasks are mostly doing work on
				623	domain(socket) 0.
				624
				625	# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
				626	31234000
				627	# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
				628	34555
				629	# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
				630	31234000
				631	# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
				632	32789
				633
				634
				635	Example 4 (Monitor real time tasks)
				636	-----------------------------------
				637
				638	A single socket system which has real time tasks running on cores 4-7
				639	and non real time tasks on other cpus. We want to monitor the cache
				640	occupancy of the real time threads on these cores.
				641
				642	# mount -t resctrl resctrl /sys/fs/resctrl
				643	# cd /sys/fs/resctrl
				644	# mkdir p1
				645
				646	Move the cpus 4-7 over to p1
				647	# echo f0 > p0/cpus
				648
				649	View the llc occupancy snapshot
				650
				651	# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
				652	11234000