blob: 857e5165df37dcf334714a482a2b06217f88ce3b [file] [log] [blame]
Brendan Gregg3a391c22016-02-08 01:20:31 -08001Demonstrations of runqlat, the Linux eBPF/bcc version.
2
3
4This program summarizes scheduler run queue latency as a histogram, showing
5how long tasks spent waiting their turn to run on-CPU.
6
7Here is a heavily loaded system:
8
9# ./runqlat
10Tracing run queue latency... Hit Ctrl-C to end.
11^C
12 usecs : count distribution
13 0 -> 1 : 233 |*********** |
14 2 -> 3 : 742 |************************************ |
15 4 -> 7 : 203 |********** |
16 8 -> 15 : 173 |******** |
17 16 -> 31 : 24 |* |
18 32 -> 63 : 0 | |
19 64 -> 127 : 30 |* |
20 128 -> 255 : 6 | |
21 256 -> 511 : 3 | |
22 512 -> 1023 : 5 | |
23 1024 -> 2047 : 27 |* |
24 2048 -> 4095 : 30 |* |
25 4096 -> 8191 : 20 | |
26 8192 -> 16383 : 29 |* |
27 16384 -> 32767 : 809 |****************************************|
28 32768 -> 65535 : 64 |*** |
29
30The distribution is bimodal, with one mode between 0 and 15 microseconds,
31and another between 16 and 65 milliseconds. These modes are visible as the
32spikes in the ASCII distribution (which is merely a visual representation
33of the "count" column). As an example of reading one line: 809 events fell
34into the 16384 to 32767 microsecond range (16 to 32 ms) while tracing.
35
36I would expect the two modes to be due the workload: 16 hot CPU-bound threads,
37and many other mostly idle threads doing occasional work. I suspect the mostly
38idle threads will run with a higher priority when they wake up, and are
39the reason for the low latency mode. The high latency mode will be the
40CPU-bound threads. More analysis with this and other tools can confirm.
41
42
43A -m option can be used to show milliseconds instead, as well as an interval
44and a count. For example, showing three x five second summary in milliseconds:
45
46# ./runqlat -m 5 3
47Tracing run queue latency... Hit Ctrl-C to end.
48
49 msecs : count distribution
50 0 -> 1 : 3818 |****************************************|
51 2 -> 3 : 39 | |
52 4 -> 7 : 39 | |
53 8 -> 15 : 62 | |
54 16 -> 31 : 2214 |*********************** |
55 32 -> 63 : 226 |** |
56
57 msecs : count distribution
58 0 -> 1 : 3775 |****************************************|
59 2 -> 3 : 52 | |
60 4 -> 7 : 37 | |
61 8 -> 15 : 65 | |
62 16 -> 31 : 2230 |*********************** |
63 32 -> 63 : 212 |** |
64
65 msecs : count distribution
66 0 -> 1 : 3816 |****************************************|
67 2 -> 3 : 49 | |
68 4 -> 7 : 40 | |
69 8 -> 15 : 53 | |
70 16 -> 31 : 2228 |*********************** |
71 32 -> 63 : 221 |** |
72
73This shows a similar distribution across the three summaries.
74
75
76A -p option can be used to show one PID only, which is filtered in kernel for
77efficiency. For example, PID 4505, and one second summaries:
78
79# ./runqlat -mp 4505 1
80Tracing run queue latency... Hit Ctrl-C to end.
81
82 msecs : count distribution
83 0 -> 1 : 1 |* |
84 2 -> 3 : 2 |*** |
85 4 -> 7 : 1 |* |
86 8 -> 15 : 0 | |
87 16 -> 31 : 25 |****************************************|
88 32 -> 63 : 3 |**** |
89
90 msecs : count distribution
91 0 -> 1 : 0 | |
92 2 -> 3 : 2 |** |
93 4 -> 7 : 0 | |
94 8 -> 15 : 1 |* |
95 16 -> 31 : 30 |****************************************|
96 32 -> 63 : 1 |* |
97
98 msecs : count distribution
99 0 -> 1 : 0 | |
100 2 -> 3 : 0 | |
101 4 -> 7 : 0 | |
102 8 -> 15 : 0 | |
103 16 -> 31 : 28 |****************************************|
104 32 -> 63 : 2 |** |
105
106 msecs : count distribution
107 0 -> 1 : 1 |* |
108 2 -> 3 : 0 | |
109 4 -> 7 : 0 | |
110 8 -> 15 : 0 | |
111 16 -> 31 : 27 |****************************************|
112 32 -> 63 : 4 |***** |
113[...]
114
115For comparison, here is pidstat(1) for that process:
116
117# pidstat -p 4505 1
118Linux 4.4.0-virtual (bgregg-xxxxxxxx) 02/08/2016 _x86_64_ (8 CPU)
119
12008:56:11 AM UID PID %usr %system %guest %CPU CPU Command
12108:56:12 AM 0 4505 9.00 3.00 0.00 12.00 0 bash
12208:56:13 AM 0 4505 7.00 5.00 0.00 12.00 0 bash
12308:56:14 AM 0 4505 10.00 2.00 0.00 12.00 0 bash
12408:56:15 AM 0 4505 11.00 2.00 0.00 13.00 0 bash
12508:56:16 AM 0 4505 9.00 3.00 0.00 12.00 0 bash
126[...]
127
128This is a synthetic workload that is CPU bound. It's only spending 12% on-CPU
129each second because of high CPU demand on this server: the remaining time
130is spent waiting on a run queue, as visualized by runqlat.
131
132
133Here is the same system, but when it is CPU idle:
134
135# ./runqlat 5 1
136Tracing run queue latency... Hit Ctrl-C to end.
137
138 usecs : count distribution
139 0 -> 1 : 2250 |******************************** |
140 2 -> 3 : 2340 |********************************** |
141 4 -> 7 : 2746 |****************************************|
142 8 -> 15 : 418 |****** |
143 16 -> 31 : 93 |* |
144 32 -> 63 : 28 | |
145 64 -> 127 : 119 |* |
146 128 -> 255 : 9 | |
147 256 -> 511 : 4 | |
148 512 -> 1023 : 20 | |
149 1024 -> 2047 : 22 | |
150 2048 -> 4095 : 5 | |
151 4096 -> 8191 : 2 | |
152
153Back to a microsecond scale, this time there is little run queue latency past 1
154millisecond, as would be expected.
155
156
157Now 16 threads are performing heavy disk I/O:
158
159# ./runqlat 5 1
160Tracing run queue latency... Hit Ctrl-C to end.
161
162 usecs : count distribution
163 0 -> 1 : 204 | |
164 2 -> 3 : 944 |* |
165 4 -> 7 : 16315 |********************* |
166 8 -> 15 : 29897 |****************************************|
167 16 -> 31 : 1044 |* |
168 32 -> 63 : 23 | |
169 64 -> 127 : 128 | |
170 128 -> 255 : 24 | |
171 256 -> 511 : 5 | |
172 512 -> 1023 : 13 | |
173 1024 -> 2047 : 15 | |
174 2048 -> 4095 : 13 | |
175 4096 -> 8191 : 10 | |
176
177The distribution hasn't changed too much. While the disks are 100% busy, there
178is still plenty of CPU headroom, and threads still don't spend much time
179waiting their turn.
180
181
182A -P option will print a distribution for each PID:
183
184# ./runqlat -P
185Tracing run queue latency... Hit Ctrl-C to end.
186^C
187
188pid = 0
189 usecs : count distribution
190 0 -> 1 : 351 |******************************** |
191 2 -> 3 : 96 |******** |
192 4 -> 7 : 437 |****************************************|
193 8 -> 15 : 12 |* |
194 16 -> 31 : 10 | |
195 32 -> 63 : 0 | |
196 64 -> 127 : 16 |* |
197 128 -> 255 : 0 | |
198 256 -> 511 : 0 | |
199 512 -> 1023 : 0 | |
200 1024 -> 2047 : 0 | |
201 2048 -> 4095 : 0 | |
202 4096 -> 8191 : 0 | |
203 8192 -> 16383 : 1 | |
204
205pid = 12929
206 usecs : count distribution
207 0 -> 1 : 1 |****************************************|
208 2 -> 3 : 0 | |
209 4 -> 7 : 1 |****************************************|
210
211pid = 12930
212 usecs : count distribution
213 0 -> 1 : 0 | |
214 2 -> 3 : 0 | |
215 4 -> 7 : 0 | |
216 8 -> 15 : 0 | |
217 16 -> 31 : 1 |****************************************|
218 32 -> 63 : 0 | |
219 64 -> 127 : 1 |****************************************|
220
221pid = 12931
222 usecs : count distribution
223 0 -> 1 : 0 | |
224 2 -> 3 : 0 | |
225 4 -> 7 : 1 |******************** |
226 8 -> 15 : 0 | |
227 16 -> 31 : 0 | |
228 32 -> 63 : 0 | |
229 64 -> 127 : 0 | |
230 128 -> 255 : 0 | |
231 256 -> 511 : 0 | |
232 512 -> 1023 : 2 |****************************************|
233
234pid = 12932
235 usecs : count distribution
236 0 -> 1 : 0 | |
237 2 -> 3 : 0 | |
238 4 -> 7 : 0 | |
239 8 -> 15 : 0 | |
240 16 -> 31 : 0 | |
241 32 -> 63 : 0 | |
242 64 -> 127 : 0 | |
243 128 -> 255 : 1 |****************************************|
244 256 -> 511 : 0 | |
245 512 -> 1023 : 1 |****************************************|
246
247pid = 7
248 usecs : count distribution
249 0 -> 1 : 0 | |
250 2 -> 3 : 426 |************************************* |
251 4 -> 7 : 457 |****************************************|
252 8 -> 15 : 16 |* |
253
254pid = 9
255 usecs : count distribution
256 0 -> 1 : 0 | |
257 2 -> 3 : 0 | |
258 4 -> 7 : 425 |****************************************|
259 8 -> 15 : 16 |* |
260
261pid = 11
262 usecs : count distribution
263 0 -> 1 : 0 | |
264 2 -> 3 : 10 |****************************************|
265
266pid = 14
267 usecs : count distribution
268 0 -> 1 : 0 | |
269 2 -> 3 : 8 |****************************************|
270 4 -> 7 : 2 |********** |
271
272pid = 18
273 usecs : count distribution
274 0 -> 1 : 414 |****************************************|
275 2 -> 3 : 0 | |
276 4 -> 7 : 20 |* |
277 8 -> 15 : 8 | |
278
279pid = 12928
280 usecs : count distribution
281 0 -> 1 : 0 | |
282 2 -> 3 : 0 | |
283 4 -> 7 : 1 |****************************************|
284 8 -> 15 : 0 | |
285 16 -> 31 : 0 | |
286 32 -> 63 : 0 | |
287 64 -> 127 : 1 |****************************************|
288
289pid = 1867
290 usecs : count distribution
291 0 -> 1 : 0 | |
292 2 -> 3 : 0 | |
293 4 -> 7 : 0 | |
294 8 -> 15 : 15 |****************************************|
295 16 -> 31 : 1 |** |
296 32 -> 63 : 0 | |
297 64 -> 127 : 0 | |
298 128 -> 255 : 4 |********** |
299
300pid = 1871
301 usecs : count distribution
302 0 -> 1 : 0 | |
303 2 -> 3 : 0 | |
304 4 -> 7 : 0 | |
305 8 -> 15 : 2 |****************************************|
306 16 -> 31 : 0 | |
307 32 -> 63 : 0 | |
308 64 -> 127 : 0 | |
309 128 -> 255 : 0 | |
310 256 -> 511 : 0 | |
311 512 -> 1023 : 1 |******************** |
312
313pid = 1876
314 usecs : count distribution
315 0 -> 1 : 0 | |
316 2 -> 3 : 0 | |
317 4 -> 7 : 0 | |
318 8 -> 15 : 1 |****************************************|
319 16 -> 31 : 0 | |
320 32 -> 63 : 0 | |
321 64 -> 127 : 0 | |
322 128 -> 255 : 0 | |
323 256 -> 511 : 1 |****************************************|
324
325pid = 1878
326 usecs : count distribution
327 0 -> 1 : 0 | |
328 2 -> 3 : 0 | |
329 4 -> 7 : 0 | |
330 8 -> 15 : 0 | |
331 16 -> 31 : 3 |****************************************|
332
333pid = 1880
334 usecs : count distribution
335 0 -> 1 : 0 | |
336 2 -> 3 : 0 | |
337 4 -> 7 : 0 | |
338 8 -> 15 : 3 |****************************************|
339
340pid = 9307
341 usecs : count distribution
342 0 -> 1 : 0 | |
343 2 -> 3 : 0 | |
344 4 -> 7 : 0 | |
345 8 -> 15 : 1 |****************************************|
346
347pid = 1886
348 usecs : count distribution
349 0 -> 1 : 0 | |
350 2 -> 3 : 0 | |
351 4 -> 7 : 1 |******************** |
352 8 -> 15 : 2 |****************************************|
353
354pid = 1888
355 usecs : count distribution
356 0 -> 1 : 0 | |
357 2 -> 3 : 0 | |
358 4 -> 7 : 0 | |
359 8 -> 15 : 3 |****************************************|
360
361pid = 3297
362 usecs : count distribution
363 0 -> 1 : 0 | |
364 2 -> 3 : 0 | |
365 4 -> 7 : 0 | |
366 8 -> 15 : 1 |****************************************|
367
368pid = 1892
369 usecs : count distribution
370 0 -> 1 : 0 | |
371 2 -> 3 : 0 | |
372 4 -> 7 : 0 | |
373 8 -> 15 : 0 | |
374 16 -> 31 : 1 |******************** |
375 32 -> 63 : 0 | |
376 64 -> 127 : 0 | |
377 128 -> 255 : 0 | |
378 256 -> 511 : 0 | |
379 512 -> 1023 : 2 |****************************************|
380
381pid = 7024
382 usecs : count distribution
383 0 -> 1 : 0 | |
384 2 -> 3 : 0 | |
385 4 -> 7 : 0 | |
386 8 -> 15 : 4 |****************************************|
387
388pid = 16468
389 usecs : count distribution
390 0 -> 1 : 0 | |
391 2 -> 3 : 0 | |
392 4 -> 7 : 0 | |
393 8 -> 15 : 3 |****************************************|
394
395pid = 12922
396 usecs : count distribution
397 0 -> 1 : 1 |****************************************|
398 2 -> 3 : 0 | |
399 4 -> 7 : 0 | |
400 8 -> 15 : 1 |****************************************|
401 16 -> 31 : 1 |****************************************|
402 32 -> 63 : 0 | |
403 64 -> 127 : 1 |****************************************|
404
405pid = 12923
406 usecs : count distribution
407 0 -> 1 : 0 | |
408 2 -> 3 : 0 | |
409 4 -> 7 : 1 |******************** |
410 8 -> 15 : 0 | |
411 16 -> 31 : 0 | |
412 32 -> 63 : 0 | |
413 64 -> 127 : 2 |****************************************|
414 128 -> 255 : 0 | |
415 256 -> 511 : 0 | |
416 512 -> 1023 : 1 |******************** |
417 1024 -> 2047 : 1 |******************** |
418
419pid = 12924
420 usecs : count distribution
421 0 -> 1 : 0 | |
422 2 -> 3 : 0 | |
423 4 -> 7 : 2 |******************** |
424 8 -> 15 : 4 |****************************************|
425 16 -> 31 : 1 |********** |
426 32 -> 63 : 0 | |
427 64 -> 127 : 0 | |
428 128 -> 255 : 0 | |
429 256 -> 511 : 0 | |
430 512 -> 1023 : 0 | |
431 1024 -> 2047 : 1 |********** |
432
433pid = 12925
434 usecs : count distribution
435 0 -> 1 : 0 | |
436 2 -> 3 : 0 | |
437 4 -> 7 : 0 | |
438 8 -> 15 : 0 | |
439 16 -> 31 : 0 | |
440 32 -> 63 : 0 | |
441 64 -> 127 : 1 |****************************************|
442
443pid = 12926
444 usecs : count distribution
445 0 -> 1 : 0 | |
446 2 -> 3 : 1 |****************************************|
447 4 -> 7 : 0 | |
448 8 -> 15 : 1 |****************************************|
449 16 -> 31 : 0 | |
450 32 -> 63 : 0 | |
451 64 -> 127 : 0 | |
452 128 -> 255 : 0 | |
453 256 -> 511 : 0 | |
454 512 -> 1023 : 1 |****************************************|
455
456pid = 12927
457 usecs : count distribution
458 0 -> 1 : 1 |****************************************|
459 2 -> 3 : 0 | |
460 4 -> 7 : 1 |****************************************|
461
462
Brenden Blancoa6875602016-05-02 23:32:44 -0700463A -L option will print a distribution for each TID:
464
465# ./runqlat -L
466Tracing run queue latency... Hit Ctrl-C to end.
467^C
468
469tid = 0
470 usecs : count distribution
471 0 -> 1 : 593 |**************************** |
472 2 -> 3 : 829 |****************************************|
473 4 -> 7 : 300 |************** |
474 8 -> 15 : 321 |*************** |
475 16 -> 31 : 132 |****** |
476 32 -> 63 : 58 |** |
477 64 -> 127 : 0 | |
478 128 -> 255 : 0 | |
479 256 -> 511 : 13 | |
480
481tid = 7
482 usecs : count distribution
483 0 -> 1 : 8 |******** |
484 2 -> 3 : 19 |******************** |
485 4 -> 7 : 37 |****************************************|
486[...]
487
488
Brendan Gregg82769382017-04-18 14:23:14 -0500489And a --pidnss option (short for PID namespaces) will print for each PID
490namespace, for analyzing container performance:
491
492# ./runqlat --pidnss -m
493Tracing run queue latency... Hit Ctrl-C to end.
494^C
495
496pidns = 4026532870
497 msecs : count distribution
498 0 -> 1 : 40 |****************************************|
499 2 -> 3 : 1 |* |
500 4 -> 7 : 0 | |
501 8 -> 15 : 0 | |
502 16 -> 31 : 0 | |
503 32 -> 63 : 2 |** |
504 64 -> 127 : 5 |***** |
505
506pidns = 4026532809
507 msecs : count distribution
508 0 -> 1 : 67 |****************************************|
509
510pidns = 4026532748
511 msecs : count distribution
512 0 -> 1 : 63 |****************************************|
513
514pidns = 4026532687
515 msecs : count distribution
516 0 -> 1 : 7 |****************************************|
517
518pidns = 4026532626
519 msecs : count distribution
520 0 -> 1 : 45 |****************************************|
521 2 -> 3 : 0 | |
522 4 -> 7 : 0 | |
523 8 -> 15 : 0 | |
524 16 -> 31 : 0 | |
525 32 -> 63 : 0 | |
526 64 -> 127 : 3 |** |
527
528pidns = 4026531836
529 msecs : count distribution
530 0 -> 1 : 314 |****************************************|
531 2 -> 3 : 1 | |
532 4 -> 7 : 11 |* |
533 8 -> 15 : 28 |*** |
534 16 -> 31 : 137 |***************** |
535 32 -> 63 : 86 |********** |
536 64 -> 127 : 1 | |
537
538pidns = 4026532382
539 msecs : count distribution
540 0 -> 1 : 285 |****************************************|
541 2 -> 3 : 5 | |
542 4 -> 7 : 16 |** |
543 8 -> 15 : 9 |* |
544 16 -> 31 : 69 |********* |
545 32 -> 63 : 25 |*** |
546
547Many of these distributions have two modes: the second, in this case, is
548caused by capping CPU usage via CPU shares.
549
550
Brendan Gregg3a391c22016-02-08 01:20:31 -0800551USAGE message:
552
553# ./runqlat -h
Brendan Gregg82769382017-04-18 14:23:14 -0500554usage: runqlat.py [-h] [-T] [-m] [-P] [--pidnss] [-L] [-p PID]
555 [interval] [count]
Brendan Gregg3a391c22016-02-08 01:20:31 -0800556
Brendan Gregg82769382017-04-18 14:23:14 -0500557Summarize run queue (scheduler) latency as a histogram
Brendan Gregg3a391c22016-02-08 01:20:31 -0800558
559positional arguments:
560 interval output interval, in seconds
561 count number of outputs
562
563optional arguments:
564 -h, --help show this help message and exit
565 -T, --timestamp include timestamp on output
566 -m, --milliseconds millisecond histogram
567 -P, --pids print a histogram per process ID
Brendan Gregg82769382017-04-18 14:23:14 -0500568 --pidnss print a histogram per PID namespace
Brenden Blancoa6875602016-05-02 23:32:44 -0700569 -L, --tids print a histogram per thread ID
Brendan Gregg3a391c22016-02-08 01:20:31 -0800570 -p PID, --pid PID trace this PID only
571
572examples:
573 ./runqlat # summarize run queue latency as a histogram
574 ./runqlat 1 10 # print 1 second summaries, 10 times
575 ./runqlat -mT 1 # 1s summaries, milliseconds, and timestamps
576 ./runqlat -P # show each PID separately
577 ./runqlat -p 185 # trace PID 185 only