samuelnair | ceff3e0 | 2017-09-12 13:13:10 +0800 | [diff] [blame] | 1 | Demonstrations of nfsdist, the Linux eBPF/bcc version. |
| 2 | |
| 3 | nfsdist traces NFS reads, writes, opens, and getattr, and summarizes their |
| 4 | latency as a power-of-2 histogram. For example: |
| 5 | |
| 6 | |
| 7 | ./nfsdist.py |
| 8 | |
| 9 | Tracing NFS operation latency... Hit Ctrl-C to end. |
| 10 | |
| 11 | operation = read |
| 12 | usecs : count distribution |
| 13 | 0 -> 1 : 4 | | |
| 14 | 2 -> 3 : 0 | | |
| 15 | 4 -> 7 : 0 | | |
| 16 | 8 -> 15 : 7107 |************** | |
| 17 | 16 -> 31 : 19864 |****************************************| |
| 18 | 32 -> 63 : 1494 |*** | |
| 19 | 64 -> 127 : 491 | | |
| 20 | 128 -> 255 : 1810 |*** | |
| 21 | 256 -> 511 : 6356 |************ | |
| 22 | 512 -> 1023 : 4860 |********* | |
| 23 | 1024 -> 2047 : 3070 |****** | |
| 24 | 2048 -> 4095 : 1853 |*** | |
| 25 | 4096 -> 8191 : 921 |* | |
| 26 | 8192 -> 16383 : 122 | | |
| 27 | 16384 -> 32767 : 15 | | |
| 28 | 32768 -> 65535 : 5 | | |
| 29 | 65536 -> 131071 : 2 | | |
| 30 | 131072 -> 262143 : 1 | | |
| 31 | |
| 32 | operation = write |
| 33 | usecs : count distribution |
| 34 | 0 -> 1 : 0 | | |
| 35 | 2 -> 3 : 0 | | |
| 36 | 4 -> 7 : 0 | | |
| 37 | 8 -> 15 : 1 | | |
| 38 | 16 -> 31 : 0 | | |
| 39 | 32 -> 63 : 9 | | |
| 40 | 64 -> 127 : 19491 |****************************************| |
| 41 | 128 -> 255 : 3064 |****** | |
| 42 | 256 -> 511 : 940 |* | |
| 43 | 512 -> 1023 : 365 | | |
| 44 | 1024 -> 2047 : 312 | | |
| 45 | 2048 -> 4095 : 119 | | |
| 46 | 4096 -> 8191 : 31 | | |
| 47 | 8192 -> 16383 : 84 | | |
| 48 | 16384 -> 32767 : 31 | | |
| 49 | 32768 -> 65535 : 5 | | |
| 50 | 65536 -> 131071 : 3 | | |
| 51 | 131072 -> 262143 : 0 | | |
| 52 | 262144 -> 524287 : 1 | | |
| 53 | |
| 54 | operation = getattr |
| 55 | usecs : count distribution |
| 56 | 0 -> 1 : 27 |****************************************| |
| 57 | 2 -> 3 : 2 |** | |
| 58 | 4 -> 7 : 3 |**** | |
| 59 | 8 -> 15 : 0 | | |
| 60 | 16 -> 31 : 0 | | |
| 61 | 32 -> 63 : 0 | | |
| 62 | 64 -> 127 : 0 | | |
| 63 | 128 -> 255 : 0 | | |
| 64 | 256 -> 511 : 2 |** | |
| 65 | 512 -> 1023 : 2 |** | |
| 66 | |
| 67 | operation = open |
| 68 | usecs : count distribution |
| 69 | 0 -> 1 : 0 | | |
| 70 | 2 -> 3 : 0 | | |
| 71 | 4 -> 7 : 0 | | |
| 72 | 8 -> 15 : 0 | | |
| 73 | 16 -> 31 : 0 | | |
| 74 | 32 -> 63 : 0 | | |
| 75 | 64 -> 127 : 0 | | |
| 76 | 128 -> 255 : 0 | | |
| 77 | 256 -> 511 : 2 |****************************************| |
| 78 | |
| 79 | |
| 80 | In this example you can see that the read traffic is rather bi-modal, with about |
| 81 | 26K reads falling within 8 - 30 usecs and about 18K reads spread between 128 - |
| 82 | 8191 usecs. Write traffic is largely clustered in the 64 - 127 usecs bracket. |
| 83 | The faster read traffic is probably coming from a filesystem cache and the slower |
| 84 | traffic from disk. The reason why the writes are so consistently fast is because |
| 85 | this example test was run on a couple of VM's and I believe the hypervisor was |
| 86 | caching all the write traffic to memory. |
| 87 | |
| 88 | This "latency" is measured from when the operation was issued from the VFS |
| 89 | interface to the file system, to when it completed. This spans everything: |
| 90 | RPC latency, network latency, file system CPU cycles, file system locks, run |
| 91 | queue latency, etc. This is a better measure of the latency suffered by |
| 92 | applications reading from a NFS share and can better expose problems |
| 93 | experienced by NFS clients. |
| 94 | |
| 95 | Note that this only traces the common NFS operations (read, write, open and |
| 96 | getattr). I chose to include getattr as a significant percentage of NFS |
| 97 | traffic end up being getattr calls and are a good indicator of problems |
| 98 | with an NFS server. |
| 99 | |
| 100 | An optional interval and a count can be provided, as well as -m to show the |
| 101 | distributions in milliseconds. For example: |
| 102 | |
| 103 | ./nfsdist -m 1 5 |
| 104 | Tracing NFS operation latency... Hit Ctrl-C to end. |
| 105 | |
| 106 | 11:02:39: |
| 107 | |
| 108 | operation = write |
| 109 | msecs : count distribution |
| 110 | 0 -> 1 : 1 | | |
| 111 | 2 -> 3 : 24 |******** | |
| 112 | 4 -> 7 : 114 |****************************************| |
| 113 | 8 -> 15 : 9 |*** | |
| 114 | 16 -> 31 : 1 | | |
| 115 | 32 -> 63 : 1 | | |
| 116 | |
| 117 | 11:02:40: |
| 118 | |
| 119 | operation = write |
| 120 | msecs : count distribution |
| 121 | 0 -> 1 : 0 | | |
| 122 | 2 -> 3 : 11 |*** | |
| 123 | 4 -> 7 : 111 |****************************************| |
| 124 | 8 -> 15 : 13 |**** | |
| 125 | 16 -> 31 : 1 | | |
| 126 | |
| 127 | 11:02:41: |
| 128 | |
| 129 | operation = write |
| 130 | msecs : count distribution |
| 131 | 0 -> 1 : 0 | | |
| 132 | 2 -> 3 : 21 |****** | |
| 133 | 4 -> 7 : 137 |****************************************| |
| 134 | 8 -> 15 : 3 | | |
| 135 | |
| 136 | This shows a write workload, with writes hovering primarily in the 4-7ms range. |
| 137 | |
| 138 | USAGE message: |
| 139 | |
| 140 | |
| 141 | ./nfsdist -h |
| 142 | usage: nfsdist.py [-h] [-T] [-m] [-p PID] [interval] [count] |
| 143 | |
| 144 | Summarize NFS operation latency |
| 145 | |
| 146 | positional arguments: |
| 147 | interval output interval, in seconds |
| 148 | count number of outputs |
| 149 | |
| 150 | optional arguments: |
| 151 | -h, --help show this help message and exit |
| 152 | -T, --notimestamp don't include timestamp on interval output |
| 153 | -m, --milliseconds output in milliseconds |
| 154 | -p PID, --pid PID trace this PID only |
| 155 | |
| 156 | examples: |
| 157 | ./nfsdist # show operation latency as a histogram |
| 158 | ./nfsdist -p 181 # trace PID 181 only |
| 159 | ./nfsdist 1 10 # print 1 second summaries, 10 times |
| 160 | ./nfsdist -m 5 # 5s summaries, milliseconds |