blob: 71c30984e94d58e9e1de13ac0ee05ef7655a020a [file] [log] [blame]
Fenghua Yuf20e5782016-10-28 15:04:40 -07001User Interface for Resource Allocation in Intel Resource Director Technology
2
3Copyright (C) 2016 Intel Corporation
4
5Fenghua Yu <fenghua.yu@intel.com>
6Tony Luck <tony.luck@intel.com>
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -07007Vikas Shivappa <vikas.shivappa@intel.com>
Fenghua Yuf20e5782016-10-28 15:04:40 -07008
Vikas Shivappa1640ae92017-07-25 14:14:21 -07009This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
Fenghua Yu0ff8e082017-12-20 14:57:19 -080010X86 /proc/cpuinfo flag bits:
11RDT (Resource Director Technology) Allocation - "rdt_a"
12CAT (Cache Allocation Technology) - "cat_l3", "cat_l2"
Fenghua Yuaa55d5a2017-12-20 14:57:20 -080013CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2"
Fenghua Yu0ff8e082017-12-20 14:57:19 -080014CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc"
15MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local"
16MBA (Memory Bandwidth Allocation) - "mba"
Fenghua Yuf20e5782016-10-28 15:04:40 -070017
18To use the feature mount the file system:
19
Fenghua Yuaa55d5a2017-12-20 14:57:20 -080020 # mount -t resctrl resctrl [-o cdp[,cdpl2]] /sys/fs/resctrl
Fenghua Yuf20e5782016-10-28 15:04:40 -070021
22mount options are:
23
24"cdp": Enable code/data prioritization in L3 cache allocations.
Fenghua Yuaa55d5a2017-12-20 14:57:20 -080025"cdpl2": Enable code/data prioritization in L2 cache allocations.
26
27L2 and L3 CDP are controlled seperately.
Fenghua Yuf20e5782016-10-28 15:04:40 -070028
Vikas Shivappa1640ae92017-07-25 14:14:21 -070029RDT features are orthogonal. A particular system may support only
30monitoring, only control, or both monitoring and control.
31
32The mount succeeds if either of allocation or monitoring is present, but
33only those files and directories supported by the system will be created.
34For more details on the behavior of the interface during monitoring
35and allocation, see the "Resource alloc and monitor groups" section.
Fenghua Yuf20e5782016-10-28 15:04:40 -070036
Thomas Gleixner458b0d62016-11-07 11:58:12 +010037Info directory
38--------------
39
40The 'info' directory contains information about the enabled
41resources. Each resource has its own subdirectory. The subdirectory
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070042names reflect the resource names.
Vikas Shivappa1640ae92017-07-25 14:14:21 -070043
44Each subdirectory contains the following files with respect to
45allocation:
46
47Cache resource(L3/L2) subdirectory contains the following files
48related to allocation:
Thomas Gleixner458b0d62016-11-07 11:58:12 +010049
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070050"num_closids": The number of CLOSIDs which are valid for this
51 resource. The kernel uses the smallest number of
52 CLOSIDs of all enabled resources as limit.
Thomas Gleixner458b0d62016-11-07 11:58:12 +010053
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070054"cbm_mask": The bitmask which is valid for this resource.
55 This mask is equivalent to 100%.
Thomas Gleixner458b0d62016-11-07 11:58:12 +010056
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070057"min_cbm_bits": The minimum number of consecutive bits which
58 must be set when writing a mask.
Thomas Gleixner458b0d62016-11-07 11:58:12 +010059
Fenghua Yu0dd2d742017-07-25 15:39:04 -070060"shareable_bits": Bitmask of shareable resource with other executing
61 entities (e.g. I/O). User can use this when
62 setting up exclusive cache partitions. Note that
63 some platforms support devices that have their
64 own settings for cache use which can over-ride
65 these bits.
66
Vikas Shivappa1640ae92017-07-25 14:14:21 -070067Memory bandwitdh(MB) subdirectory contains the following files
68with respect to allocation:
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070069
70"min_bandwidth": The minimum memory bandwidth percentage which
71 user can request.
72
73"bandwidth_gran": The granularity in which the memory bandwidth
74 percentage is allocated. The allocated
75 b/w percentage is rounded off to the next
76 control step available on the hardware. The
77 available bandwidth control steps are:
78 min_bandwidth + N * bandwidth_gran.
79
80"delay_linear": Indicates if the delay scale is linear or
81 non-linear. This field is purely informational
82 only.
Thomas Gleixner458b0d62016-11-07 11:58:12 +010083
Vikas Shivappa1640ae92017-07-25 14:14:21 -070084If RDT monitoring is available there will be an "L3_MON" directory
85with the following files:
86
87"num_rmids": The number of RMIDs available. This is the
88 upper bound for how many "CTRL_MON" + "MON"
89 groups can be created.
90
91"mon_features": Lists the monitoring events if
92 monitoring is enabled for the resource.
93
94"max_threshold_occupancy":
95 Read/write file provides the largest value (in
96 bytes) at which a previously used LLC_occupancy
97 counter can be considered for re-use.
98
Tony Luck165d3ad2017-09-25 16:39:38 -070099Finally, in the top level of the "info" directory there is a file
100named "last_cmd_status". This is reset with every "command" issued
101via the file system (making new directories or writing to any of the
102control files). If the command was successful, it will read as "ok".
103If the command failed, it will provide more information that can be
104conveyed in the error returns from file operations. E.g.
105
106 # echo L3:0=f7 > schemata
107 bash: echo: write error: Invalid argument
108 # cat info/last_cmd_status
109 mask f7 has non-consecutive 1-bits
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700110
111Resource alloc and monitor groups
112---------------------------------
113
Fenghua Yuf20e5782016-10-28 15:04:40 -0700114Resource groups are represented as directories in the resctrl file
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700115system. The default group is the root directory which, immediately
116after mounting, owns all the tasks and cpus in the system and can make
117full use of all resources.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700118
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700119On a system with RDT control features additional directories can be
120created in the root directory that specify different amounts of each
121resource (see "schemata" below). The root and these additional top level
122directories are referred to as "CTRL_MON" groups below.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700123
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700124On a system with RDT monitoring the root directory and other top level
125directories contain a directory named "mon_groups" in which additional
126directories can be created to monitor subsets of tasks in the CTRL_MON
127group that is their ancestor. These are called "MON" groups in the rest
128of this document.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700129
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700130Removing a directory will move all tasks and cpus owned by the group it
131represents to the parent. Removing one of the created CTRL_MON groups
132will automatically remove all MON groups below it.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700133
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700134All groups contain the following files:
Jiri Olsa4ffa3c92017-04-10 16:52:32 +0200135
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700136"tasks":
137 Reading this file shows the list of all tasks that belong to
138 this group. Writing a task id to the file will add a task to the
139 group. If the group is a CTRL_MON group the task is removed from
140 whichever previous CTRL_MON group owned the task and also from
141 any MON group that owned the task. If the group is a MON group,
142 then the task must already belong to the CTRL_MON parent of this
143 group. The task is removed from any previous MON group.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700144
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700145
146"cpus":
147 Reading this file shows a bitmask of the logical CPUs owned by
148 this group. Writing a mask to this file will add and remove
149 CPUs to/from this group. As with the tasks file a hierarchy is
150 maintained where MON groups may only include CPUs owned by the
151 parent CTRL_MON group.
152
153
154"cpus_list":
155 Just like "cpus", only using ranges of CPUs instead of bitmasks.
156
157
158When control is enabled all CTRL_MON groups will also contain:
159
160"schemata":
161 A list of all the resources available to this group.
162 Each resource has its own line and format - see below for details.
163
164When monitoring is enabled all MON groups will also contain:
165
166"mon_data":
167 This contains a set of files organized by L3 domain and by
168 RDT event. E.g. on a system with two L3 domains there will
169 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
170 directories have one file per event (e.g. "llc_occupancy",
171 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
172 files provide a read out of the current value of the event for
173 all tasks in the group. In CTRL_MON groups these files provide
174 the sum for all tasks in the CTRL_MON group and all tasks in
175 MON groups. Please see example section for more details on usage.
176
177Resource allocation rules
178-------------------------
179When a task is running the following rules define which resources are
180available to it:
Fenghua Yuf20e5782016-10-28 15:04:40 -0700181
1821) If the task is a member of a non-default group, then the schemata
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700183 for that group is used.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700184
1852) Else if the task belongs to the default group, but is running on a
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700186 CPU that is assigned to some specific group, then the schemata for the
187 CPU's group is used.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700188
1893) Otherwise the schemata for the default group is used.
190
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700191Resource monitoring rules
192-------------------------
1931) If a task is a member of a MON group, or non-default CTRL_MON group
194 then RDT events for the task will be reported in that group.
195
1962) If a task is a member of the default CTRL_MON group, but is running
197 on a CPU that is assigned to some specific group, then the RDT events
198 for the task will be reported in that group.
199
2003) Otherwise RDT events for the task will be reported in the root level
201 "mon_data" group.
202
203
204Notes on cache occupancy monitoring and control
205-----------------------------------------------
206When moving a task from one group to another you should remember that
207this only affects *new* cache allocations by the task. E.g. you may have
208a task in a monitor group showing 3 MB of cache occupancy. If you move
209to a new group and immediately check the occupancy of the old and new
210groups you will likely see that the old group is still showing 3 MB and
211the new group zero. When the task accesses locations still in cache from
212before the move, the h/w does not update any counters. On a busy system
213you will likely see the occupancy in the old group go down as cache lines
214are evicted and re-used while the occupancy in the new group rises as
215the task accesses memory and loads into the cache are counted based on
216membership in the new group.
217
218The same applies to cache allocation control. Moving a task to a group
219with a smaller cache partition will not evict any cache lines. The
220process may continue to use them from the old partition.
221
222Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
223to identify a control group and a monitoring group respectively. Each of
224the resource groups are mapped to these IDs based on the kind of group. The
225number of CLOSid and RMID are limited by the hardware and hence the creation of
226a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
227and creation of "MON" group may fail if we run out of RMIDs.
228
229max_threshold_occupancy - generic concepts
230------------------------------------------
231
232Note that an RMID once freed may not be immediately available for use as
233the RMID is still tagged the cache lines of the previous user of RMID.
234Hence such RMIDs are placed on limbo list and checked back if the cache
235occupancy has gone down. If there is a time when system has a lot of
236limbo RMIDs but which are not ready to be used, user may see an -EBUSY
237during mkdir.
238
239max_threshold_occupancy is a user configurable value to determine the
240occupancy at which an RMID can be freed.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700241
242Schemata files - general concepts
243---------------------------------
244Each line in the file describes one resource. The line starts with
245the name of the resource, followed by specific values to be applied
246in each of the instances of that resource on the system.
247
248Cache IDs
249---------
250On current generation systems there is one L3 cache per socket and L2
251caches are generally just shared by the hyperthreads on a core, but this
252isn't an architectural requirement. We could have multiple separate L3
253caches on a socket, multiple cores could share an L2 cache. So instead
254of using "socket" or "core" to define the set of logical cpus sharing
255a resource we use a "Cache ID". At a given cache level this will be a
256unique number across the whole system (but it isn't guaranteed to be a
257contiguous sequence, there may be gaps). To find the ID for each logical
258CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
259
260Cache Bit Masks (CBM)
261---------------------
262For cache resources we describe the portion of the cache that is available
263for allocation using a bitmask. The maximum value of the mask is defined
264by each cpu model (and may be different for different cache levels). It
265is found using CPUID, but is also provided in the "info" directory of
266the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
267requires that these masks have all the '1' bits in a contiguous block. So
2680x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
269and 0xA are not. On a system with a 20-bit mask each bit represents 5%
270of the capacity of the cache. You could partition the cache into four
271equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
272
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700273Memory bandwidth(b/w) percentage
274--------------------------------
275For Memory b/w resource, user controls the resource by indicating the
276percentage of total memory b/w.
277
278The minimum bandwidth percentage value for each cpu model is predefined
279and can be looked up through "info/MB/min_bandwidth". The bandwidth
280granularity that is allocated is also dependent on the cpu model and can
281be looked up at "info/MB/bandwidth_gran". The available bandwidth
282control steps are: min_bw + N * bw_gran. Intermediate values are rounded
283to the next control step available on the hardware.
284
285The bandwidth throttling is a core specific mechanism on some of Intel
286SKUs. Using a high bandwidth and a low bandwidth setting on two threads
287sharing a core will result in both threads being throttled to use the
288low bandwidth.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700289
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700290L3 schemata file details (code and data prioritization disabled)
291----------------------------------------------------------------
Fenghua Yuf20e5782016-10-28 15:04:40 -0700292With CDP disabled the L3 schemata format is:
293
294 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
295
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700296L3 schemata file details (CDP enabled via mount option to resctrl)
297------------------------------------------------------------------
Fenghua Yuf20e5782016-10-28 15:04:40 -0700298When CDP is enabled L3 control is split into two separate resources
299so you can specify independent masks for code and data like this:
300
301 L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
302 L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
303
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700304L2 schemata file details
305------------------------
Fenghua Yuf20e5782016-10-28 15:04:40 -0700306L2 cache does not support code and data prioritization, so the
307schemata format is always:
308
309 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
310
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700311Memory b/w Allocation details
312-----------------------------
313
314Memory b/w domain is L3 cache.
315
316 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
317
Tony Luckc4026b72017-04-03 14:44:16 -0700318Reading/writing the schemata file
319---------------------------------
320Reading the schemata file will show the state of all resources
321on all domains. When writing you only need to specify those values
322which you wish to change. E.g.
323
324# cat schemata
325L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
326L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
327# echo "L3DATA:2=3c0;" > schemata
328# cat schemata
329L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
330L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
331
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700332Examples for RDT allocation usage:
333
Fenghua Yuf20e5782016-10-28 15:04:40 -0700334Example 1
335---------
336On a two socket machine (one L3 cache per socket) with just four bits
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700337for cache bit masks, minimum b/w of 10% with a memory bandwidth
338granularity of 10%
Fenghua Yuf20e5782016-10-28 15:04:40 -0700339
340# mount -t resctrl resctrl /sys/fs/resctrl
341# cd /sys/fs/resctrl
342# mkdir p0 p1
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700343# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
344# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700345
346The default resource group is unmodified, so we have access to all parts
347of all caches (its schemata file reads "L3:0=f;1=f").
348
349Tasks that are under the control of group "p0" may only allocate from the
350"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
351Tasks in group "p1" use the "lower" 50% of cache on both sockets.
352
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700353Similarly, tasks that are under the control of group "p0" may use a
354maximum memory b/w of 50% on socket0 and 50% on socket 1.
355Tasks in group "p1" may also use 50% memory b/w on both sockets.
356Note that unlike cache masks, memory b/w cannot specify whether these
357allocations can overlap or not. The allocations specifies the maximum
358b/w that the group may be able to use and the system admin can configure
359the b/w accordingly.
360
Fenghua Yuf20e5782016-10-28 15:04:40 -0700361Example 2
362---------
363Again two sockets, but this time with a more realistic 20-bit mask.
364
365Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
366processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
367neighbors, each of the two real-time tasks exclusively occupies one quarter
368of L3 cache on socket 0.
369
370# mount -t resctrl resctrl /sys/fs/resctrl
371# cd /sys/fs/resctrl
372
373First we reset the schemata for the default group so that the "upper"
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070037450% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
375ordinary tasks:
Fenghua Yuf20e5782016-10-28 15:04:40 -0700376
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700377# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700378
379Next we make a resource group for our first real time task and give
380it access to the "top" 25% of the cache on socket 0.
381
382# mkdir p0
383# echo "L3:0=f8000;1=fffff" > p0/schemata
384
385Finally we move our first real time task into this resource group. We
386also use taskset(1) to ensure the task always runs on a dedicated CPU
387on socket 0. Most uses of resource groups will also constrain which
388processors tasks run on.
389
390# echo 1234 > p0/tasks
391# taskset -cp 1 1234
392
393Ditto for the second real time task (with the remaining 25% of cache):
394
395# mkdir p1
396# echo "L3:0=7c00;1=fffff" > p1/schemata
397# echo 5678 > p1/tasks
398# taskset -cp 2 5678
399
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700400For the same 2 socket system with memory b/w resource and CAT L3 the
401schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
40210):
403
404For our first real time task this would request 20% memory b/w on socket
4050.
406
407# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
408
409For our second real time task this would request an other 20% memory b/w
410on socket 0.
411
412# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
413
Fenghua Yuf20e5782016-10-28 15:04:40 -0700414Example 3
415---------
416
417A single socket system which has real-time tasks running on core 4-7 and
418non real-time workload assigned to core 0-3. The real-time tasks share text
419and data, so a per task association is not required and due to interaction
420with the kernel it's desired that the kernel on these cores shares L3 with
421the tasks.
422
423# mount -t resctrl resctrl /sys/fs/resctrl
424# cd /sys/fs/resctrl
425
426First we reset the schemata for the default group so that the "upper"
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070042750% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
428cannot be used by ordinary tasks:
Fenghua Yuf20e5782016-10-28 15:04:40 -0700429
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700430# echo "L3:0=3ff\nMB:0=50" > schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700431
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700432Next we make a resource group for our real time cores and give it access
433to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
434socket 0.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700435
436# mkdir p0
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700437# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700438
439Finally we move core 4-7 over to the new group and make sure that the
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700440kernel and the tasks running there get 50% of the cache. They should
441also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
442siblings and only the real time threads are scheduled on the cores 4-7.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700443
Xiaochen Shenfb8fb462017-05-03 11:15:56 +0800444# echo F0 > p0/cpus
Marcelo Tosatti3c2a7692016-12-14 15:08:37 -0200445
4464) Locking between applications
447
448Certain operations on the resctrl filesystem, composed of read/writes
449to/from multiple files, must be atomic.
450
451As an example, the allocation of an exclusive reservation of L3 cache
452involves:
453
454 1. Read the cbmmasks from each directory
455 2. Find a contiguous set of bits in the global CBM bitmask that is clear
456 in any of the directory cbmmasks
457 3. Create a new directory
458 4. Set the bits found in step 2 to the new directory "schemata" file
459
460If two applications attempt to allocate space concurrently then they can
461end up allocating the same bits so the reservations are shared instead of
462exclusive.
463
464To coordinate atomic operations on the resctrlfs and to avoid the problem
465above, the following locking procedure is recommended:
466
467Locking is based on flock, which is available in libc and also as a shell
468script command
469
470Write lock:
471
472 A) Take flock(LOCK_EX) on /sys/fs/resctrl
473 B) Read/write the directory structure.
474 C) funlock
475
476Read lock:
477
478 A) Take flock(LOCK_SH) on /sys/fs/resctrl
479 B) If success read the directory structure.
480 C) funlock
481
482Example with bash:
483
484# Atomically read directory structure
485$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
486
487# Read directory contents and create new subdirectory
488
489$ cat create-dir.sh
490find /sys/fs/resctrl/ > output.txt
491mask = function-of(output.txt)
492mkdir /sys/fs/resctrl/newres/
493echo mask > /sys/fs/resctrl/newres/schemata
494
495$ flock /sys/fs/resctrl/ ./create-dir.sh
496
497Example with C:
498
499/*
500 * Example code do take advisory locks
501 * before accessing resctrl filesystem
502 */
503#include <sys/file.h>
504#include <stdlib.h>
505
506void resctrl_take_shared_lock(int fd)
507{
508 int ret;
509
510 /* take shared lock on resctrl filesystem */
511 ret = flock(fd, LOCK_SH);
512 if (ret) {
513 perror("flock");
514 exit(-1);
515 }
516}
517
518void resctrl_take_exclusive_lock(int fd)
519{
520 int ret;
521
522 /* release lock on resctrl filesystem */
523 ret = flock(fd, LOCK_EX);
524 if (ret) {
525 perror("flock");
526 exit(-1);
527 }
528}
529
530void resctrl_release_lock(int fd)
531{
532 int ret;
533
534 /* take shared lock on resctrl filesystem */
535 ret = flock(fd, LOCK_UN);
536 if (ret) {
537 perror("flock");
538 exit(-1);
539 }
540}
541
542void main(void)
543{
544 int fd, ret;
545
546 fd = open("/sys/fs/resctrl", O_DIRECTORY);
547 if (fd == -1) {
548 perror("open");
549 exit(-1);
550 }
551 resctrl_take_shared_lock(fd);
552 /* code to read directory contents */
553 resctrl_release_lock(fd);
554
555 resctrl_take_exclusive_lock(fd);
556 /* code to read and write directory contents */
557 resctrl_release_lock(fd);
558}
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700559
560Examples for RDT Monitoring along with allocation usage:
561
562Reading monitored data
563----------------------
564Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
565show the current snapshot of LLC occupancy of the corresponding MON
566group or CTRL_MON group.
567
568
569Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
570---------
571On a two socket machine (one L3 cache per socket) with just four bits
572for cache bit masks
573
574# mount -t resctrl resctrl /sys/fs/resctrl
575# cd /sys/fs/resctrl
576# mkdir p0 p1
577# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
578# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
579# echo 5678 > p1/tasks
580# echo 5679 > p1/tasks
581
582The default resource group is unmodified, so we have access to all parts
583of all caches (its schemata file reads "L3:0=f;1=f").
584
585Tasks that are under the control of group "p0" may only allocate from the
586"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
587Tasks in group "p1" use the "lower" 50% of cache on both sockets.
588
589Create monitor groups and assign a subset of tasks to each monitor group.
590
591# cd /sys/fs/resctrl/p1/mon_groups
592# mkdir m11 m12
593# echo 5678 > m11/tasks
594# echo 5679 > m12/tasks
595
596fetch data (data shown in bytes)
597
598# cat m11/mon_data/mon_L3_00/llc_occupancy
59916234000
600# cat m11/mon_data/mon_L3_01/llc_occupancy
60114789000
602# cat m12/mon_data/mon_L3_00/llc_occupancy
60316789000
604
605The parent ctrl_mon group shows the aggregated data.
606
607# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
60831234000
609
610Example 2 (Monitor a task from its creation)
611---------
612On a two socket machine (one L3 cache per socket)
613
614# mount -t resctrl resctrl /sys/fs/resctrl
615# cd /sys/fs/resctrl
616# mkdir p0 p1
617
618An RMID is allocated to the group once its created and hence the <cmd>
619below is monitored from its creation.
620
621# echo $$ > /sys/fs/resctrl/p1/tasks
622# <cmd>
623
624Fetch the data
625
626# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
62731789000
628
629Example 3 (Monitor without CAT support or before creating CAT groups)
630---------
631
632Assume a system like HSW has only CQM and no CAT support. In this case
633the resctrl will still mount but cannot create CTRL_MON directories.
634But user can create different MON groups within the root group thereby
635able to monitor all tasks including kernel threads.
636
637This can also be used to profile jobs cache size footprint before being
638able to allocate them to different allocation groups.
639
640# mount -t resctrl resctrl /sys/fs/resctrl
641# cd /sys/fs/resctrl
642# mkdir mon_groups/m01
643# mkdir mon_groups/m02
644
645# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
646# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
647
648Monitor the groups separately and also get per domain data. From the
649below its apparent that the tasks are mostly doing work on
650domain(socket) 0.
651
652# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
65331234000
654# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
65534555
656# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
65731234000
658# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
65932789
660
661
662Example 4 (Monitor real time tasks)
663-----------------------------------
664
665A single socket system which has real time tasks running on cores 4-7
666and non real time tasks on other cpus. We want to monitor the cache
667occupancy of the real time threads on these cores.
668
669# mount -t resctrl resctrl /sys/fs/resctrl
670# cd /sys/fs/resctrl
671# mkdir p1
672
673Move the cpus 4-7 over to p1
Li RongQing30009742018-02-27 14:17:51 +0800674# echo f0 > p1/cpus
Vikas Shivappa1640ae92017-07-25 14:14:21 -0700675
676View the llc occupancy snapshot
677
678# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
67911234000