3 This repository contains various scripts to extract monitoring data and metrics
4 from LTTng kernel traces.
6 This README only describes the usage for iotop.py, but the procedure is pretty-much
7 the same for the other scripts.
9 `iotop.py` displays the I/O usage in the trace, per-disk, per-network interface,
10 per-FD and per-process. It tracks the number of bytes and requests performed
11 and the latency of the I/O syscalls and block devices.
13 The user can specify a threshold to see the requests that took more than a
14 certain time to complete, this extracts the timestamp begin and end of the
15 requests, so it is then easier to dig into the trace and understand why this
16 latency happened. It is also possible to see all the I/O requests performed by
17 a list of processes (see `--name` parameter).
21 * Babeltrace 1.2 (with python bindings compiled)
24 ## Install on Ubuntu (12.04 and 14.04 at least)
26 apt-get install -y software-properties-common (or python-software-properties on 12.04)
27 apt-add-repository -y ppa:lttng/ppa
29 apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar
32 (If your user is part of the tracing group, you can avoid needing to be root
33 next, after a fresh install it requires to logout and login)
38 lttng enable-channel -k bla --subbuf-size=4M
39 lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer,irq_handler_entry,irq_handler_exit,softirq_entry,softirq_exit,softirq_raise -c bla
40 lttng enable-event -k --syscall -a -c bla
47 ## Remote trace creation
48 You can also create a trace on a server and send it to a remote host. The
49 remote host only need to run `lttng-relayd -d` and be reachable by network.
50 The only difference with the above commands is the trace session creation :
52 lttng create -U net://<remote-host>
56 Once you have collected your trace, you can run iotop.py on it. In this
57 example, we want to extract all the I/O requests that took more than 50ms to
58 complete and also display the general statistics. The use-case in this example
59 is concurrent read and writes in a postgresql database. Here is some extracts
60 from the output of the script :
63 ./iotop.py --no-progress --latency 50 /home/julien/lttng-traces/reddit-postgres01/pgread-writes/kernel/
64 [16:36:02.916132296 - 16:36:02.994389303] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 78.257 ms
65 [16:36:03.007633298 - 16:36:03.091402482] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 83.769 ms
66 [16:36:03.107392794 - 16:36:03.200894697] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 93.502 ms
68 [16:36:09.580490482 - 16:36:09.897385130] postgres (4219) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 316.895 ms
69 2715 pages allocated during the period
70 [16:36:09.614727970 - 16:36:09.914140692] postgres (4220) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 299.413 ms
71 2396 pages allocated during the period
72 [16:36:09.750251458 - 16:36:09.950608296] postgres (4224) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 200.357 ms
73 904 pages allocated during the period
75 [16:36:06.783206662 - 16:36:11.748147198] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 4964.941 ms
76 3541 pages allocated during the period
77 Tue Oct 7 16:36:00 2014 to Tue Oct 7 16:36:18 2014
79 ################################################################################################################################################
80 ██████████████████████████████████████████████████ 16777220 16.0 MB lttng-consumerd (2619), 0 B disk, 4.0 B net, 0 B block, 16.0 MB unknown
81 █████ 1804880 1.72 MB lttng-consumerd (2623), 0 B disk, 0 B net, 0 B block, 1.72 MB unknown
82 █ 407686 398.13 KB postgres (4219), 121.05 KB disk, 277.07 KB net, 0 B block, 8.0 B unknown
83 262235 256.09 KB postgres (1348), 0 B disk, 255.97 KB net, 0 B block, 117.0 B unknown
84 209730 204.81 KB postgres (4218), 204.81 KB disk, 0 B net, 0 B block, 0 B unknown
85 126737 123.77 KB postgres (4220), 117.5 KB disk, 6.26 KB net, 0 B block, 8.0 B unknown
86 124579 121.66 KB postgres (4226), 117.5 KB disk, 4.15 KB net, 0 B block, 8.0 B unknown
87 124189 121.28 KB postgres (4221), 117.5 KB disk, 3.77 KB net, 0 B block, 8.0 B unknown
88 124125 121.22 KB postgres (4222), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown
89 124125 121.22 KB postgres (4224), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown
91 ###################################################################################################################################################
92 ██████████████████████████████████████████████████ 16777336 16.0 MB lttng-consumerd (2619), 0 B disk, 8.0 MB net, 0 B block, 8.0 MB unknown
93 ██████ 2304240 2.2 MB postgres (4219), 2.0 MB disk, 202.23 KB net, 1.76 MB block, 0 B unknown
94 █████ 1812800 1.73 MB lttng-consumerd (2623), 0 B disk, 887.73 KB net, 0 B block, 882.58 KB unknown
95 ██ 743760 726.33 KB postgres (1165), 8.0 KB disk, 6.33 KB net, 0 B block, 712.0 KB unknown
96 162500 158.69 KB postgres (1168), 158.69 KB disk, 0 B net, 160.0 KB block, 0 B unknown
97 82592 80.66 KB postgres (1348), 0 B disk, 80.66 KB net, 0 B block, 0 B unknown
98 40960 40.0 KB postgres (1166), 0 B disk, 0 B net, 40.0 KB block, 40.0 KB unknown
99 13156 12.85 KB lttng (4227), 12.85 KB disk, 0 B net, 0 B block, 0 B unknown
100 12256 11.97 KB postgres (4220), 2.0 B disk, 11.97 KB net, 0 B block, 0 B unknown
101 10450 10.21 KB postgres (4226), 2.0 B disk, 10.2 KB net, 0 B block, 0 B unknown
103 ###############################################################################
106 0 0 B ksoftirqd/0 (3)
107 0 0 B kworker/0:0 (4)
108 0 0 B kworker/0:0H (5)
115 #########################################################################################
116 ██████████████████████████████████████████████████ 1843200 1.76 MB postgres (4219)
117 ████ 163840 160.0 KB postgres (1168)
118 ██ 102400 100.0 KB kworker/u8:0 (1540)
119 ██ 98304 96.0 KB jbd2/vda1-8 (257)
120 █ 40960 40.0 KB postgres (1166)
121 8192 8.0 KB kworker/u9:0 (4197)
122 4096 4.0 KB kworker/u9:2 (1381)
125 0 0 B ksoftirqd/0 (3)
127 ##########################################################################################################################
128 ██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619)
129 █████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619)
130 ████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619)
131 █ 325340 socket (postgres) 317.71 KB (9 4219)
132 █ 262235 socket:[8893] (postgres) 256.09 KB (9 1348)
133 76600 socket:[10713] (postgres) 74.8 KB (8 4218)
134 52 /dev/ptmx 52.0 B (17 1)
135 48 socket:[54589] (sshd) 48.0 B (3 3329)
136 32 /root/trace2.sh 32.0 B (255 4211)
137 1 /dev/pts/1 1.0 B (0 4211)
139 ###########################################################################################################################
140 ██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619)
141 █████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619)
142 ████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619)
143 █ 325340 socket (postgres) 317.71 KB (9 4219)
144 █ 262235 socket:[8893] (postgres) 256.09 KB (9 1348)
145 76600 socket:[10713] (postgres) 74.8 KB (8 4218)
146 65536 /var/lib/postgresql/9.1/main/base/16384/16611 64.0 KB (9 1165)
147 52 /dev/ptmx 52.0 B (17 1)
148 48 socket:[54589] (sshd) 48.0 B (3 3329)
149 32 /root/trace2.sh 32.0 B (255 4211)
150 1 /dev/pts/1 1.0 B (0 4211)
152 ###############################################################################
153 ███████████████████████████████████████████████████████████████████ 4416 vda1
155 ###############################################################################
156 ████████████████████████████████████████████████████████████████████ 177 vda1
157 Disk request time/sector
158 ###############################################################################
159 ███████████████████████████████████████████████████████████████ 0.014 ms vda1
162 ###############################################################################
163 ████████████████████████████████████████████████████████ 757250 739.5 KB eth0
164 ██████ 82200 80.27 KB lo
166 ###############################################################################
167 ████████████████████████████████████████████████████████ 9811620 9.36 MB eth0
169 trace2.sh requests latency (ms)
170 ####################################################################################
171 ██████████████████████████████████████████████████ 18011.544456 16:36:00.788362068
172 postgres requests latency (ms)
173 ###################################################################################
174 78.257007 16:36:02.916132296
175 83.769184 16:36:03.007633298
176 93.501903 16:36:03.107392794
178 ███ 316.894648 16:36:09.580490482
179 ███ 299.412722 16:36:09.614727970
180 ██ 200.356838 16:36:09.750251458
182 ██████████████████████████████████████████████████ 4964.940536 16:36:06.783206662
186 Track the page cache and extract the latencies associated with pages flush to disk.
187 In order to do that, we rely on the assumption that the pages are flushed in a FIFO
188 order. It might not be 100% accurate, but it already gives great results :
190 An example here when saving a file in vim:
192 [19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms
193 1 dirty page(s) were flushed (assuming FIFO):
196 13 active dirty filesystem page(s) (known):
197 redis-server (2092): 2 pages
198 - /var/log/redis/redis-server.log : 2 pages
200 - .blabla.swp : 2 pages
201 lttng-consumerd (6750): 9 pages
202 - unknown (origin not found) : 9 pages
205 An other example when running the 'sync' command:
207 [19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms
208 23 dirty page(s) were flushed (assuming FIFO):
209 redis-server (2092): 2 pages
210 - /var/log/redis/redis-server.log : 2 pages
212 - /home/julien/.viminfo.tmp : 6 pages
213 - .blabla.swp : 3 pages
214 lttng-consumerd (6750): 12 pages
215 - unknown (origin not found) : 12 pages
218 Postgresql with 'sys_fdatasync':
220 [13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms
221 2 pages allocated during the period
222 88 dirty page(s) were flushed (assuming FIFO):
223 postgres (1137): 88 pages
224 - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages
225 68 last dirtied filesystem page(s):
226 postgres (2419): 68 pages
227 - base/11951/18410 : 46 pages
228 - base/11951/18407 : 10 pages
229 - base/11951/18407_fsm : 6 pages
230 - base/11951/18410_fsm : 6 pages
233 Detecting a fight for the I/O between a huge write and postgresql:
235 [13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms
236 34760 pages allocated during the period
237 woke up kswapd during the period
238 10046 pages written on disk
239 freed 33753 pages from the cache during the period
240 1397 last dirtied filesystem page(s):
241 python (2353): 1325 pages
242 - /root/bla : 1325 pages
243 postgres (2419): 72 pages
244 - base/11951/18419 : 72 pages
248 Hope you have fun trying it and please remember it is a work in progress, feedbacks, bug reports and ideas are alway welcome !