fix: process the latency of the closed FDs as well
[deliverable/lttng-analyses.git] / README.md
1 # lttng-analyses
2
3 This repository contains various scripts to extract monitoring data and metrics
4 from LTTng kernel traces.
5
6 This README only describes the usage for iotop.py, but the procedure is pretty-much
7 the same for the other scripts.
8
9 `iotop.py` displays the I/O usage in the trace, per-disk, per-network interface,
10 per-FD and per-process. It tracks the number of bytes and requests performed
11 and the latency of the I/O syscalls and block devices.
12
13 The user can specify a threshold to see the requests that took more than a
14 certain time to complete, this extracts the timestamp begin and end of the
15 requests, so it is then easier to dig into the trace and understand why this
16 latency happened. It is also possible to see all the I/O requests performed by
17 a list of processes (see `--name` parameter).
18
19 ## Requirements
20 * LTTng 2.5
21 * Babeltrace 1.2 (with python bindings compiled)
22 * Python 3.4+
23
24 ## Install on Ubuntu (12.04 and 14.04 at least)
25 ```
26 apt-get install -y software-properties-common (or python-software-properties on 12.04)
27 apt-add-repository -y ppa:lttng/ppa
28 apt-get update
29 apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar
30 ```
31
32 (If your user is part of the tracing group, you can avoid needing to be root
33 next, after a fresh install it requires to logout and login)
34
35 ## Trace creation
36 ```
37 lttng create
38 lttng enable-channel -k bla --subbuf-size=4M
39 lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer -c bla
40 lttng enable-event -k --syscall -a -c bla
41 lttng start
42 ..do stuff...
43 lttng stop
44 lttng destroy
45 ```
46
47 ## Remote trace creation
48 You can also create a trace on a server and send it to a remote host. The
49 remote host only need to run `lttng-relayd -d` and be reachable by network.
50 The only difference with the above commands is the trace session creation :
51 ```
52 lttng create -U net://<remote-host>
53 ```
54
55 ## Run the analysis
56 Once you have collected your trace, you can run iotop.py on it. In this
57 example, we want to extract all the I/O requests that took more than 50ms to
58 complete and also display the general statistics. The use-case in this example
59 is concurrent read and writes in a postgresql database. Here is some extracts
60 from the output of the script :
61
62 ```
63 ./iotop.py --no-progress --latency 50 /home/julien/lttng-traces/reddit-postgres01/pgread-writes/kernel/
64 [16:36:02.916132296 - 16:36:02.994389303] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 78.257 ms
65 [16:36:03.007633298 - 16:36:03.091402482] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 83.769 ms
66 [16:36:03.107392794 - 16:36:03.200894697] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 93.502 ms
67 [...]
68 [16:36:09.580490482 - 16:36:09.897385130] postgres (4219) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 316.895 ms
69 2715 pages allocated during the period
70 [16:36:09.614727970 - 16:36:09.914140692] postgres (4220) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 299.413 ms
71 2396 pages allocated during the period
72 [16:36:09.750251458 - 16:36:09.950608296] postgres (4224) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 200.357 ms
73 904 pages allocated during the period
74 [...]
75 [16:36:06.783206662 - 16:36:11.748147198] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 4964.941 ms
76 3541 pages allocated during the period
77 Tue Oct 7 16:36:00 2014 to Tue Oct 7 16:36:18 2014
78 Syscall I/O Read
79 ################################################################################################################################################
80 ██████████████████████████████████████████████████ 16777220 16.0 MB lttng-consumerd (2619), 0 B disk, 4.0 B net, 0 B block, 16.0 MB unknown
81 █████ 1804880 1.72 MB lttng-consumerd (2623), 0 B disk, 0 B net, 0 B block, 1.72 MB unknown
82 █ 407686 398.13 KB postgres (4219), 121.05 KB disk, 277.07 KB net, 0 B block, 8.0 B unknown
83 262235 256.09 KB postgres (1348), 0 B disk, 255.97 KB net, 0 B block, 117.0 B unknown
84 209730 204.81 KB postgres (4218), 204.81 KB disk, 0 B net, 0 B block, 0 B unknown
85 126737 123.77 KB postgres (4220), 117.5 KB disk, 6.26 KB net, 0 B block, 8.0 B unknown
86 124579 121.66 KB postgres (4226), 117.5 KB disk, 4.15 KB net, 0 B block, 8.0 B unknown
87 124189 121.28 KB postgres (4221), 117.5 KB disk, 3.77 KB net, 0 B block, 8.0 B unknown
88 124125 121.22 KB postgres (4222), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown
89 124125 121.22 KB postgres (4224), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown
90 Syscall I/O Write
91 ###################################################################################################################################################
92 ██████████████████████████████████████████████████ 16777336 16.0 MB lttng-consumerd (2619), 0 B disk, 8.0 MB net, 0 B block, 8.0 MB unknown
93 ██████ 2304240 2.2 MB postgres (4219), 2.0 MB disk, 202.23 KB net, 1.76 MB block, 0 B unknown
94 █████ 1812800 1.73 MB lttng-consumerd (2623), 0 B disk, 887.73 KB net, 0 B block, 882.58 KB unknown
95 ██ 743760 726.33 KB postgres (1165), 8.0 KB disk, 6.33 KB net, 0 B block, 712.0 KB unknown
96 162500 158.69 KB postgres (1168), 158.69 KB disk, 0 B net, 160.0 KB block, 0 B unknown
97 82592 80.66 KB postgres (1348), 0 B disk, 80.66 KB net, 0 B block, 0 B unknown
98 40960 40.0 KB postgres (1166), 0 B disk, 0 B net, 40.0 KB block, 40.0 KB unknown
99 13156 12.85 KB lttng (4227), 12.85 KB disk, 0 B net, 0 B block, 0 B unknown
100 12256 11.97 KB postgres (4220), 2.0 B disk, 11.97 KB net, 0 B block, 0 B unknown
101 10450 10.21 KB postgres (4226), 2.0 B disk, 10.2 KB net, 0 B block, 0 B unknown
102 Block I/O Read
103 ###############################################################################
104 0 0 B init (1)
105 0 0 B kthreadd (2)
106 0 0 B ksoftirqd/0 (3)
107 0 0 B kworker/0:0 (4)
108 0 0 B kworker/0:0H (5)
109 0 0 B rcu_sched (7)
110 0 0 B rcuos/0 (8)
111 0 0 B rcuos/1 (9)
112 0 0 B rcuos/2 (10)
113 0 0 B rcuos/3 (11)
114 Block I/O Write
115 #########################################################################################
116 ██████████████████████████████████████████████████ 1843200 1.76 MB postgres (4219)
117 ████ 163840 160.0 KB postgres (1168)
118 ██ 102400 100.0 KB kworker/u8:0 (1540)
119 ██ 98304 96.0 KB jbd2/vda1-8 (257)
120 █ 40960 40.0 KB postgres (1166)
121 8192 8.0 KB kworker/u9:0 (4197)
122 4096 4.0 KB kworker/u9:2 (1381)
123 0 0 B init (1)
124 0 0 B kthreadd (2)
125 0 0 B ksoftirqd/0 (3)
126 Files Read
127 ##########################################################################################################################
128 ██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619)
129 █████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619)
130 ████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619)
131 █ 325340 socket (postgres) 317.71 KB (9 4219)
132 █ 262235 socket:[8893] (postgres) 256.09 KB (9 1348)
133 76600 socket:[10713] (postgres) 74.8 KB (8 4218)
134 52 /dev/ptmx 52.0 B (17 1)
135 48 socket:[54589] (sshd) 48.0 B (3 3329)
136 32 /root/trace2.sh 32.0 B (255 4211)
137 1 /dev/pts/1 1.0 B (0 4211)
138 Files Write
139 ###########################################################################################################################
140 ██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619)
141 █████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619)
142 ████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619)
143 █ 325340 socket (postgres) 317.71 KB (9 4219)
144 █ 262235 socket:[8893] (postgres) 256.09 KB (9 1348)
145 76600 socket:[10713] (postgres) 74.8 KB (8 4218)
146 65536 /var/lib/postgresql/9.1/main/base/16384/16611 64.0 KB (9 1165)
147 52 /dev/ptmx 52.0 B (17 1)
148 48 socket:[54589] (sshd) 48.0 B (3 3329)
149 32 /root/trace2.sh 32.0 B (255 4211)
150 1 /dev/pts/1 1.0 B (0 4211)
151 Disk nr_sector
152 ###############################################################################
153 ███████████████████████████████████████████████████████████████████ 4416 vda1
154 Disk nr_requests
155 ###############################################################################
156 ████████████████████████████████████████████████████████████████████ 177 vda1
157 Disk request time/sector
158 ###############################################################################
159 ███████████████████████████████████████████████████████████████ 0.014 ms vda1
160 0.0 ms 0
161 Network recv_bytes
162 ###############################################################################
163 ████████████████████████████████████████████████████████ 757250 739.5 KB eth0
164 ██████ 82200 80.27 KB lo
165 Network sent_bytes
166 ###############################################################################
167 ████████████████████████████████████████████████████████ 9811620 9.36 MB eth0
168 85000 83.01 KB lo
169 trace2.sh requests latency (ms)
170 ####################################################################################
171 ██████████████████████████████████████████████████ 18011.544456 16:36:00.788362068
172 postgres requests latency (ms)
173 ###################################################################################
174 78.257007 16:36:02.916132296
175 83.769184 16:36:03.007633298
176 93.501903 16:36:03.107392794
177 [...]
178 ███ 316.894648 16:36:09.580490482
179 ███ 299.412722 16:36:09.614727970
180 ██ 200.356838 16:36:09.750251458
181 [...]
182 ██████████████████████████████████████████████████ 4964.940536 16:36:06.783206662
183 ```
184
185 ## Work in progress
186 Track the page cache and extract the latencies associated with pages flush to disk.
187 In order to do that, we rely on the assumption that the pages are flushed in a FIFO
188 order. It might not be 100% accurate, but it already gives great results :
189
190 An example here when saving a file in vim:
191 ```
192 [19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms
193 1 dirty page(s) were flushed (assuming FIFO):
194 vim (31517): 1 pages
195 - blabla : 1 pages
196 13 active dirty filesystem page(s) (known):
197 redis-server (2092): 2 pages
198 - /var/log/redis/redis-server.log : 2 pages
199 vim (31517): 2 pages
200 - .blabla.swp : 2 pages
201 lttng-consumerd (6750): 9 pages
202 - unknown (origin not found) : 9 pages
203 ```
204
205 An other example when running the 'sync' command:
206 ```
207 [19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms
208 23 dirty page(s) were flushed (assuming FIFO):
209 redis-server (2092): 2 pages
210 - /var/log/redis/redis-server.log : 2 pages
211 vim (31517): 9 pages
212 - /home/julien/.viminfo.tmp : 6 pages
213 - .blabla.swp : 3 pages
214 lttng-consumerd (6750): 12 pages
215 - unknown (origin not found) : 12 pages
216 ```
217
218 Postgresql with 'sys_fdatasync':
219 ```
220 [13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms
221 2 pages allocated during the period
222 88 dirty page(s) were flushed (assuming FIFO):
223 postgres (1137): 88 pages
224 - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages
225 68 last dirtied filesystem page(s):
226 postgres (2419): 68 pages
227 - base/11951/18410 : 46 pages
228 - base/11951/18407 : 10 pages
229 - base/11951/18407_fsm : 6 pages
230 - base/11951/18410_fsm : 6 pages
231 ```
232
233 Detecting a fight for the I/O between a huge write and postgresql:
234 ```
235 [13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms
236 34760 pages allocated during the period
237 woke up kswapd during the period
238 10046 pages written on disk
239 freed 33753 pages from the cache during the period
240 1397 last dirtied filesystem page(s):
241 python (2353): 1325 pages
242 - /root/bla : 1325 pages
243 postgres (2419): 72 pages
244 - base/11951/18419 : 72 pages
245 ```
246
247 ## Conclusion
248 Hope you have fun trying it and please remember it is a work in progress, feedbacks, bug reports and ideas are alway welcome !
This page took 0.041584 seconds and 6 git commands to generate.