fix: process the latency of the closed FDs as well
[deliverable/lttng-analyses.git] / README.md
CommitLineData
28ccfe49
JD
1# lttng-analyses
2
3This repository contains various scripts to extract monitoring data and metrics
4from LTTng kernel traces.
5
6This README only describes the usage for iotop.py, but the procedure is pretty-much
7the same for the other scripts.
8
9`iotop.py` displays the I/O usage in the trace, per-disk, per-network interface,
10per-FD and per-process. It tracks the number of bytes and requests performed
11and the latency of the I/O syscalls and block devices.
12
13The user can specify a threshold to see the requests that took more than a
14certain time to complete, this extracts the timestamp begin and end of the
45cf8b51 15requests, so it is then easier to dig into the trace and understand why this
28ccfe49 16latency happened. It is also possible to see all the I/O requests performed by
45cf8b51 17a list of processes (see `--name` parameter).
28ccfe49
JD
18
19## Requirements
45cf8b51
JD
20* LTTng 2.5
21* Babeltrace 1.2 (with python bindings compiled)
a650f373 22* Python 3.4+
28ccfe49
JD
23
24## Install on Ubuntu (12.04 and 14.04 at least)
25```
851bf235
JD
26apt-get install -y software-properties-common (or python-software-properties on 12.04)
27apt-add-repository -y ppa:lttng/ppa
28apt-get update
29apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar
28ccfe49
JD
30```
31
32(If your user is part of the tracing group, you can avoid needing to be root
33next, after a fresh install it requires to logout and login)
34
35## Trace creation
36```
851bf235
JD
37lttng create
38lttng enable-channel -k bla --subbuf-size=4M
39lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer -c bla
40lttng enable-event -k --syscall -a -c bla
41lttng start
42..do stuff...
43lttng stop
44lttng destroy
28ccfe49
JD
45```
46
45cf8b51 47## Remote trace creation
28ccfe49
JD
48You can also create a trace on a server and send it to a remote host. The
49remote host only need to run `lttng-relayd -d` and be reachable by network.
3d184c69
JD
50The only difference with the above commands is the trace session creation :
51```
851bf235 52lttng create -U net://<remote-host>
3d184c69 53```
28ccfe49
JD
54
55## Run the analysis
19952e0f 56Once you have collected your trace, you can run iotop.py on it. In this
74835adf 57example, we want to extract all the I/O requests that took more than 50ms to
28ccfe49 58complete and also display the general statistics. The use-case in this example
19952e0f
JD
59is concurrent read and writes in a postgresql database. Here is some extracts
60from the output of the script :
61
62```
63./iotop.py --no-progress --latency 50 /home/julien/lttng-traces/reddit-postgres01/pgread-writes/kernel/
64[16:36:02.916132296 - 16:36:02.994389303] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 78.257 ms
65[16:36:03.007633298 - 16:36:03.091402482] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 83.769 ms
66[16:36:03.107392794 - 16:36:03.200894697] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 93.502 ms
67[...]
68[16:36:09.580490482 - 16:36:09.897385130] postgres (4219) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 316.895 ms
69 2715 pages allocated during the period
70[16:36:09.614727970 - 16:36:09.914140692] postgres (4220) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 299.413 ms
71 2396 pages allocated during the period
72[16:36:09.750251458 - 16:36:09.950608296] postgres (4224) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 200.357 ms
73 904 pages allocated during the period
74[...]
75[16:36:06.783206662 - 16:36:11.748147198] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 4964.941 ms
76 3541 pages allocated during the period
77Tue Oct 7 16:36:00 2014 to Tue Oct 7 16:36:18 2014
28ccfe49 78Syscall I/O Read
19952e0f
JD
79################################################################################################################################################
80██████████████████████████████████████████████████ 16777220 16.0 MB lttng-consumerd (2619), 0 B disk, 4.0 B net, 0 B block, 16.0 MB unknown
81█████ 1804880 1.72 MB lttng-consumerd (2623), 0 B disk, 0 B net, 0 B block, 1.72 MB unknown
82█ 407686 398.13 KB postgres (4219), 121.05 KB disk, 277.07 KB net, 0 B block, 8.0 B unknown
83 262235 256.09 KB postgres (1348), 0 B disk, 255.97 KB net, 0 B block, 117.0 B unknown
84 209730 204.81 KB postgres (4218), 204.81 KB disk, 0 B net, 0 B block, 0 B unknown
85 126737 123.77 KB postgres (4220), 117.5 KB disk, 6.26 KB net, 0 B block, 8.0 B unknown
86 124579 121.66 KB postgres (4226), 117.5 KB disk, 4.15 KB net, 0 B block, 8.0 B unknown
87 124189 121.28 KB postgres (4221), 117.5 KB disk, 3.77 KB net, 0 B block, 8.0 B unknown
88 124125 121.22 KB postgres (4222), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown
89 124125 121.22 KB postgres (4224), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown
90Syscall I/O Write
91###################################################################################################################################################
92██████████████████████████████████████████████████ 16777336 16.0 MB lttng-consumerd (2619), 0 B disk, 8.0 MB net, 0 B block, 8.0 MB unknown
93██████ 2304240 2.2 MB postgres (4219), 2.0 MB disk, 202.23 KB net, 1.76 MB block, 0 B unknown
94█████ 1812800 1.73 MB lttng-consumerd (2623), 0 B disk, 887.73 KB net, 0 B block, 882.58 KB unknown
95██ 743760 726.33 KB postgres (1165), 8.0 KB disk, 6.33 KB net, 0 B block, 712.0 KB unknown
96 162500 158.69 KB postgres (1168), 158.69 KB disk, 0 B net, 160.0 KB block, 0 B unknown
97 82592 80.66 KB postgres (1348), 0 B disk, 80.66 KB net, 0 B block, 0 B unknown
98 40960 40.0 KB postgres (1166), 0 B disk, 0 B net, 40.0 KB block, 40.0 KB unknown
99 13156 12.85 KB lttng (4227), 12.85 KB disk, 0 B net, 0 B block, 0 B unknown
100 12256 11.97 KB postgres (4220), 2.0 B disk, 11.97 KB net, 0 B block, 0 B unknown
101 10450 10.21 KB postgres (4226), 2.0 B disk, 10.2 KB net, 0 B block, 0 B unknown
28ccfe49
JD
102Block I/O Read
103###############################################################################
104 0 0 B init (1)
105 0 0 B kthreadd (2)
106 0 0 B ksoftirqd/0 (3)
19952e0f 107 0 0 B kworker/0:0 (4)
28ccfe49
JD
108 0 0 B kworker/0:0H (5)
109 0 0 B rcu_sched (7)
110 0 0 B rcuos/0 (8)
111 0 0 B rcuos/1 (9)
112 0 0 B rcuos/2 (10)
113 0 0 B rcuos/3 (11)
28ccfe49 114Block I/O Write
19952e0f
JD
115#########################################################################################
116██████████████████████████████████████████████████ 1843200 1.76 MB postgres (4219)
117████ 163840 160.0 KB postgres (1168)
118██ 102400 100.0 KB kworker/u8:0 (1540)
119██ 98304 96.0 KB jbd2/vda1-8 (257)
120█ 40960 40.0 KB postgres (1166)
121 8192 8.0 KB kworker/u9:0 (4197)
122 4096 4.0 KB kworker/u9:2 (1381)
123 0 0 B init (1)
124 0 0 B kthreadd (2)
125 0 0 B ksoftirqd/0 (3)
28ccfe49 126Files Read
19952e0f
JD
127##########################################################################################################################
128██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619)
129█████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619)
130████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619)
131█ 325340 socket (postgres) 317.71 KB (9 4219)
132█ 262235 socket:[8893] (postgres) 256.09 KB (9 1348)
133 76600 socket:[10713] (postgres) 74.8 KB (8 4218)
134 52 /dev/ptmx 52.0 B (17 1)
135 48 socket:[54589] (sshd) 48.0 B (3 3329)
136 32 /root/trace2.sh 32.0 B (255 4211)
137 1 /dev/pts/1 1.0 B (0 4211)
138Files Write
139###########################################################################################################################
140██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619)
141█████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619)
142████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619)
143█ 325340 socket (postgres) 317.71 KB (9 4219)
144█ 262235 socket:[8893] (postgres) 256.09 KB (9 1348)
145 76600 socket:[10713] (postgres) 74.8 KB (8 4218)
146 65536 /var/lib/postgresql/9.1/main/base/16384/16611 64.0 KB (9 1165)
147 52 /dev/ptmx 52.0 B (17 1)
148 48 socket:[54589] (sshd) 48.0 B (3 3329)
149 32 /root/trace2.sh 32.0 B (255 4211)
150 1 /dev/pts/1 1.0 B (0 4211)
28ccfe49
JD
151Disk nr_sector
152###############################################################################
19952e0f 153███████████████████████████████████████████████████████████████████ 4416 vda1
28ccfe49
JD
154Disk nr_requests
155###############################################################################
19952e0f 156████████████████████████████████████████████████████████████████████ 177 vda1
28ccfe49
JD
157Disk request time/sector
158###############################################################################
19952e0f
JD
159███████████████████████████████████████████████████████████████ 0.014 ms vda1
160 0.0 ms 0
28ccfe49
JD
161Network recv_bytes
162###############################################################################
19952e0f
JD
163████████████████████████████████████████████████████████ 757250 739.5 KB eth0
164██████ 82200 80.27 KB lo
28ccfe49
JD
165Network sent_bytes
166###############################################################################
19952e0f
JD
167████████████████████████████████████████████████████████ 9811620 9.36 MB eth0
168 85000 83.01 KB lo
169trace2.sh requests latency (ms)
28ccfe49 170####################################################################################
19952e0f
JD
171██████████████████████████████████████████████████ 18011.544456 16:36:00.788362068
172postgres requests latency (ms)
173###################################################################################
174 78.257007 16:36:02.916132296
175 83.769184 16:36:03.007633298
176 93.501903 16:36:03.107392794
177[...]
178███ 316.894648 16:36:09.580490482
179███ 299.412722 16:36:09.614727970
180██ 200.356838 16:36:09.750251458
181[...]
182██████████████████████████████████████████████████ 4964.940536 16:36:06.783206662
28ccfe49 183```
19952e0f 184
173fa485 185## Work in progress
19952e0f
JD
186Track the page cache and extract the latencies associated with pages flush to disk.
187In order to do that, we rely on the assumption that the pages are flushed in a FIFO
188order. It might not be 100% accurate, but it already gives great results :
189
190An example here when saving a file in vim:
173fa485 191```
19952e0f
JD
192[19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms
193 1 dirty page(s) were flushed (assuming FIFO):
194 vim (31517): 1 pages
195 - blabla : 1 pages
196 13 active dirty filesystem page(s) (known):
197 redis-server (2092): 2 pages
198 - /var/log/redis/redis-server.log : 2 pages
199 vim (31517): 2 pages
200 - .blabla.swp : 2 pages
201 lttng-consumerd (6750): 9 pages
202 - unknown (origin not found) : 9 pages
203```
204
205An other example when running the 'sync' command:
173fa485 206```
19952e0f
JD
207[19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms
208 23 dirty page(s) were flushed (assuming FIFO):
209 redis-server (2092): 2 pages
210 - /var/log/redis/redis-server.log : 2 pages
211 vim (31517): 9 pages
212 - /home/julien/.viminfo.tmp : 6 pages
213 - .blabla.swp : 3 pages
214 lttng-consumerd (6750): 12 pages
215 - unknown (origin not found) : 12 pages
216```
217
218Postgresql with 'sys_fdatasync':
219```
220[13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms
221 2 pages allocated during the period
222 88 dirty page(s) were flushed (assuming FIFO):
223 postgres (1137): 88 pages
224 - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages
225 68 last dirtied filesystem page(s):
226 postgres (2419): 68 pages
227 - base/11951/18410 : 46 pages
228 - base/11951/18407 : 10 pages
229 - base/11951/18407_fsm : 6 pages
230 - base/11951/18410_fsm : 6 pages
231```
232
233Detecting a fight for the I/O between a huge write and postgresql:
234```
235[13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms
236 34760 pages allocated during the period
237 woke up kswapd during the period
238 10046 pages written on disk
239 freed 33753 pages from the cache during the period
240 1397 last dirtied filesystem page(s):
241 python (2353): 1325 pages
242 - /root/bla : 1325 pages
243 postgres (2419): 72 pages
244 - base/11951/18419 : 72 pages
245```
246
247## Conclusion
248Hope you have fun trying it and please remember it is a work in progress, feedbacks, bug reports and ideas are alway welcome !
This page took 0.060846 seconds and 5 git commands to generate.