Wenbo Zhang | 15fbd7c | 2019-11-15 03:47:27 -0500 | [diff] [blame] | 1 | Demonstrations of compactstall, the Linux eBPF/bcc version. |
| 2 | |
| 3 | |
| 4 | compactsnoop traces the compact zone system-wide, and print various details. |
| 5 | Example output (manual trigger by echo 1 > /proc/sys/vm/compact_memory): |
| 6 | |
| 7 | # ./compactsnoop |
| 8 | COMM PID NODE ZONE ORDER MODE LAT(ms) STATUS |
| 9 | zsh 23685 0 ZONE_DMA -1 SYNC 0.025 complete |
| 10 | zsh 23685 0 ZONE_DMA32 -1 SYNC 3.925 complete |
| 11 | zsh 23685 0 ZONE_NORMAL -1 SYNC 113.975 complete |
| 12 | zsh 23685 1 ZONE_NORMAL -1 SYNC 81.57 complete |
| 13 | zsh 23685 0 ZONE_DMA -1 SYNC 0.02 complete |
| 14 | zsh 23685 0 ZONE_DMA32 -1 SYNC 4.631 complete |
| 15 | zsh 23685 0 ZONE_NORMAL -1 SYNC 113.975 complete |
| 16 | zsh 23685 1 ZONE_NORMAL -1 SYNC 80.647 complete |
| 17 | zsh 23685 0 ZONE_DMA -1 SYNC 0.020 complete |
| 18 | zsh 23685 0 ZONE_DMA32 -1 SYNC 3.367 complete |
| 19 | zsh 23685 0 ZONE_NORMAL -1 SYNC 115.18 complete |
| 20 | zsh 23685 1 ZONE_NORMAL -1 SYNC 81.766 complete |
| 21 | zsh 23685 0 ZONE_DMA -1 SYNC 0.025 complete |
| 22 | zsh 23685 0 ZONE_DMA32 -1 SYNC 4.346 complete |
| 23 | zsh 23685 0 ZONE_NORMAL -1 SYNC 114.570 complete |
| 24 | zsh 23685 1 ZONE_NORMAL -1 SYNC 80.820 complete |
| 25 | zsh 23685 0 ZONE_DMA -1 SYNC 0.026 complete |
| 26 | zsh 23685 0 ZONE_DMA32 -1 SYNC 4.611 complete |
| 27 | zsh 23685 0 ZONE_NORMAL -1 SYNC 113.993 complete |
| 28 | zsh 23685 1 ZONE_NORMAL -1 SYNC 80.928 complete |
| 29 | zsh 23685 0 ZONE_DMA -1 SYNC 0.02 complete |
| 30 | zsh 23685 0 ZONE_DMA32 -1 SYNC 3.889 complete |
| 31 | zsh 23685 0 ZONE_NORMAL -1 SYNC 113.776 complete |
| 32 | zsh 23685 1 ZONE_NORMAL -1 SYNC 80.727 complete |
| 33 | ^C |
| 34 | |
| 35 | While tracing, the processes alloc pages due to memory fragmentation is too |
| 36 | serious to meet contiguous memory requirements in the system, compact zone |
| 37 | events happened, which will increase the waiting delay of the processes. |
| 38 | |
| 39 | compactsnoop can be useful for discovering when compact_stall(/proc/vmstat) |
| 40 | continues to increase, whether it is caused by some critical processes or not. |
| 41 | |
| 42 | The STATUS include (CentOS 7.6's kernel) |
| 43 | |
| 44 | compact_status = { |
| 45 | # COMPACT_SKIPPED: compaction didn't start as it was not possible or direct reclaim was more suitable |
| 46 | 0: "skipped", |
| 47 | # COMPACT_CONTINUE: compaction should continue to another pageblock |
| 48 | 1: "continue", |
| 49 | # COMPACT_PARTIAL: direct compaction partially compacted a zone and there are suitable pages |
| 50 | 2: "partial", |
| 51 | # COMPACT_COMPLETE: The full zone was compacted |
| 52 | 3: "complete", |
| 53 | } |
| 54 | |
| 55 | or (kernel 4.7 and above) |
| 56 | |
| 57 | compact_status = { |
| 58 | # COMPACT_NOT_SUITABLE_ZONE: For more detailed tracepoint output - internal to compaction |
| 59 | 0: "not_suitable_zone", |
| 60 | # COMPACT_SKIPPED: compaction didn't start as it was not possible or direct reclaim was more suitable |
| 61 | 1: "skipped", |
| 62 | # COMPACT_DEFERRED: compaction didn't start as it was deferred due to past failures |
| 63 | 2: "deferred", |
| 64 | # COMPACT_NOT_SUITABLE_PAGE: For more detailed tracepoint output - internal to compaction |
| 65 | 3: "no_suitable_page", |
| 66 | # COMPACT_CONTINUE: compaction should continue to another pageblock |
| 67 | 4: "continue", |
Michael Prokop | c14d02a | 2020-01-09 02:29:18 +0100 | [diff] [blame] | 68 | # COMPACT_COMPLETE: The full zone was compacted scanned but wasn't successful to compact suitable pages. |
Wenbo Zhang | 15fbd7c | 2019-11-15 03:47:27 -0500 | [diff] [blame] | 69 | 5: "complete", |
Michael Prokop | c14d02a | 2020-01-09 02:29:18 +0100 | [diff] [blame] | 70 | # COMPACT_PARTIAL_SKIPPED: direct compaction has scanned part of the zone but wasn't successful to compact suitable pages. |
Wenbo Zhang | 15fbd7c | 2019-11-15 03:47:27 -0500 | [diff] [blame] | 71 | 6: "partial_skipped", |
| 72 | # COMPACT_CONTENDED: compaction terminated prematurely due to lock contentions |
| 73 | 7: "contended", |
| 74 | # COMPACT_SUCCESS: direct compaction terminated after concluding that the allocation should now succeed |
| 75 | 8: "success", |
| 76 | } |
| 77 | |
| 78 | The -p option can be used to filter on a PID, which is filtered in-kernel. Here |
| 79 | I've used it with -T to print timestamps: |
| 80 | |
| 81 | # ./compactsnoop -Tp 24376 |
| 82 | TIME(s) COMM PID NODE ZONE ORDER MODE LAT(ms) STATUS |
| 83 | 101.364115000 zsh 24376 0 ZONE_DMA -1 SYNC 0.025 complete |
| 84 | 101.364555000 zsh 24376 0 ZONE_DMA32 -1 SYNC 3.925 complete |
| 85 | ^C |
| 86 | |
| 87 | This shows the zsh process allocs pages, and compact zone events happening, |
| 88 | and the delays are not affected much. |
| 89 | |
| 90 | A maximum tracing duration can be set with the -d option. For example, to trace |
| 91 | for 2 seconds: |
| 92 | |
| 93 | # ./compactsnoop -d 2 |
| 94 | COMM PID NODE ZONE ORDER MODE LAT(ms) STATUS |
| 95 | zsh 26385 0 ZONE_DMA -1 SYNC 0.025444 complete |
| 96 | ^C |
| 97 | |
| 98 | The -e option prints out extra columns |
| 99 | |
| 100 | # ./compactsnoop -e |
| 101 | COMM PID NODE ZONE ORDER MODE FRAGIDX MIN LOW HIGH FREE LAT(ms) STATUS |
| 102 | summ 28276 1 ZONE_NORMAL 3 ASYNC 0.728 11284 14105 16926 14193 3.58 partial |
| 103 | summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 14479 0.0 complete |
| 104 | summ 28276 1 ZONE_NORMAL 2 ASYNC -1.000 11284 14105 16926 14785 0.019 complete |
| 105 | summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 15199 0.006 partial |
| 106 | summ 28276 1 ZONE_NORMAL 2 ASYNC -1.000 11284 14105 16926 17360 0.030 complete |
| 107 | summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 15443 0.024 complete |
| 108 | summ 28276 1 ZONE_NORMAL 2 ASYNC -1.000 11284 14105 16926 15634 0.018 complete |
| 109 | summ 28276 1 ZONE_NORMAL 3 ASYNC 0.832 11284 14105 16926 15301 0.006 partial |
| 110 | summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 14774 0.005 partial |
| 111 | summ 28276 1 ZONE_NORMAL 3 ASYNC 0.733 11284 14105 16926 19888 0.012 partial |
| 112 | ^C |
| 113 | |
| 114 | The FRAGIDX is short for fragmentation index, which only makes sense if an |
| 115 | allocation of a requested size would fail. If that is true, the fragmentation |
| 116 | index indicates whether external fragmentation or a lack of memory was the |
| 117 | problem. The value can be used to determine if page reclaim or compaction |
| 118 | should be used. |
| 119 | |
| 120 | Index is between 0 and 1 so return within 3 decimal places |
| 121 | |
| 122 | 0 => allocation would fail due to lack of memory |
| 123 | 1 => allocation would fail due to fragmentation |
| 124 | |
| 125 | We can see the whole buddy's fragmentation index from /sys/kernel/debug/extfrag/extfrag_index |
| 126 | |
| 127 | The MIN/LOW/HIGH shows the watermarks of the zone, which can also get from |
| 128 | /proc/zoneinfo, and FREE means nr_free_pages (can be found in /proc/zoneinfo too). |
| 129 | |
| 130 | |
| 131 | The -K option prints out kernel stack |
| 132 | |
| 133 | # ./compactsnoop -K -e |
| 134 | |
| 135 | summ 28276 0 ZONE_NORMAL 3 ASYNC 0.528 11043 13803 16564 22654 13.258 partial |
| 136 | kretprobe_trampoline+0x0 |
| 137 | try_to_compact_pages+0x121 |
| 138 | __alloc_pages_direct_compact+0xac |
| 139 | __alloc_pages_slowpath+0x3e9 |
| 140 | __alloc_pages_nodemask+0x404 |
| 141 | alloc_pages_current+0x98 |
| 142 | new_slab+0x2c5 |
| 143 | ___slab_alloc+0x3ac |
| 144 | __slab_alloc+0x40 |
| 145 | kmem_cache_alloc_node+0x8b |
| 146 | copy_process+0x18e |
| 147 | do_fork+0x91 |
| 148 | sys_clone+0x16 |
| 149 | stub_clone+0x44 |
| 150 | |
| 151 | summ 28276 1 ZONE_NORMAL 3 ASYNC -1.000 11284 14105 16926 22074 0.008 partial |
| 152 | kretprobe_trampoline+0x0 |
| 153 | try_to_compact_pages+0x121 |
| 154 | __alloc_pages_direct_compact+0xac |
| 155 | __alloc_pages_slowpath+0x3e9 |
| 156 | __alloc_pages_nodemask+0x404 |
| 157 | alloc_pages_current+0x98 |
| 158 | new_slab+0x2c5 |
| 159 | ___slab_alloc+0x3ac |
| 160 | __slab_alloc+0x40 |
| 161 | kmem_cache_alloc_node+0x8b |
| 162 | copy_process+0x18e |
| 163 | do_fork+0x91 |
| 164 | sys_clone+0x16 |
| 165 | stub_clone+0x44 |
| 166 | |
| 167 | summ 28276 0 ZONE_NORMAL 3 ASYNC 0.527 11043 13803 16564 25653 9.812 partial |
| 168 | kretprobe_trampoline+0x0 |
| 169 | try_to_compact_pages+0x121 |
| 170 | __alloc_pages_direct_compact+0xac |
| 171 | __alloc_pages_slowpath+0x3e9 |
| 172 | __alloc_pages_nodemask+0x404 |
| 173 | alloc_pages_current+0x98 |
| 174 | new_slab+0x2c5 |
| 175 | ___slab_alloc+0x3ac |
| 176 | __slab_alloc+0x40 |
| 177 | kmem_cache_alloc_node+0x8b |
| 178 | copy_process+0x18e |
| 179 | do_fork+0x91 |
| 180 | sys_clone+0x16 |
| 181 | stub_clone+0x44 |
| 182 | |
| 183 | # ./compactsnoop -h |
| 184 | usage: compactsnoop.py [-h] [-T] [-p PID] [-d DURATION] [-K] [-e] |
| 185 | |
| 186 | Trace compact zone |
| 187 | |
| 188 | optional arguments: |
| 189 | -h, --help show this help message and exit |
| 190 | -T, --timestamp include timestamp on output |
| 191 | -p PID, --pid PID trace this PID only |
| 192 | -d DURATION, --duration DURATION |
| 193 | total duration of trace in seconds |
| 194 | -K, --kernel-stack output kernel stack trace |
| 195 | -e, --extended_fields |
| 196 | show system memory state |
| 197 | |
| 198 | examples: |
| 199 | ./compactsnoop # trace all compact stall |
| 200 | ./compactsnoop -T # include timestamps |
| 201 | ./compactsnoop -d 10 # trace for 10 seconds only |
| 202 | ./compactsnoop -K # output kernel stack trace |
| 203 | ./compactsnoop -e # show extended fields |