Commit | Line | Data |
---|---|---|
28ccfe49 JD |
1 | # lttng-analyses |
2 | ||
3 | This repository contains various scripts to extract monitoring data and metrics | |
4 | from LTTng kernel traces. | |
5 | ||
6 | This README only describes the usage for iotop.py, but the procedure is pretty-much | |
7 | the same for the other scripts. | |
8 | ||
9 | `iotop.py` displays the I/O usage in the trace, per-disk, per-network interface, | |
10 | per-FD and per-process. It tracks the number of bytes and requests performed | |
11 | and the latency of the I/O syscalls and block devices. | |
12 | ||
13 | The user can specify a threshold to see the requests that took more than a | |
14 | certain time to complete, this extracts the timestamp begin and end of the | |
45cf8b51 | 15 | requests, so it is then easier to dig into the trace and understand why this |
28ccfe49 | 16 | latency happened. It is also possible to see all the I/O requests performed by |
45cf8b51 | 17 | a list of processes (see `--name` parameter). |
28ccfe49 JD |
18 | |
19 | ## Requirements | |
45cf8b51 JD |
20 | * LTTng 2.5 |
21 | * Babeltrace 1.2 (with python bindings compiled) | |
a650f373 | 22 | * Python 3.4+ |
28ccfe49 JD |
23 | |
24 | ## Install on Ubuntu (12.04 and 14.04 at least) | |
25 | ``` | |
851bf235 JD |
26 | apt-get install -y software-properties-common (or python-software-properties on 12.04) |
27 | apt-add-repository -y ppa:lttng/ppa | |
28 | apt-get update | |
29 | apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar | |
28ccfe49 JD |
30 | ``` |
31 | ||
32 | (If your user is part of the tracing group, you can avoid needing to be root | |
33 | next, after a fresh install it requires to logout and login) | |
34 | ||
35 | ## Trace creation | |
36 | ``` | |
851bf235 JD |
37 | lttng create |
38 | lttng enable-channel -k bla --subbuf-size=4M | |
39 | lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer -c bla | |
40 | lttng enable-event -k --syscall -a -c bla | |
41 | lttng start | |
42 | ..do stuff... | |
43 | lttng stop | |
44 | lttng destroy | |
28ccfe49 JD |
45 | ``` |
46 | ||
45cf8b51 | 47 | ## Remote trace creation |
28ccfe49 JD |
48 | You can also create a trace on a server and send it to a remote host. The |
49 | remote host only need to run `lttng-relayd -d` and be reachable by network. | |
3d184c69 JD |
50 | The only difference with the above commands is the trace session creation : |
51 | ``` | |
851bf235 | 52 | lttng create -U net://<remote-host> |
3d184c69 | 53 | ``` |
28ccfe49 JD |
54 | |
55 | ## Run the analysis | |
19952e0f | 56 | Once you have collected your trace, you can run iotop.py on it. In this |
74835adf | 57 | example, we want to extract all the I/O requests that took more than 50ms to |
28ccfe49 | 58 | complete and also display the general statistics. The use-case in this example |
19952e0f JD |
59 | is concurrent read and writes in a postgresql database. Here is some extracts |
60 | from the output of the script : | |
61 | ||
62 | ``` | |
63 | ./iotop.py --no-progress --latency 50 /home/julien/lttng-traces/reddit-postgres01/pgread-writes/kernel/ | |
64 | [16:36:02.916132296 - 16:36:02.994389303] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 78.257 ms | |
65 | [16:36:03.007633298 - 16:36:03.091402482] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 83.769 ms | |
66 | [16:36:03.107392794 - 16:36:03.200894697] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 93.502 ms | |
67 | [...] | |
68 | [16:36:09.580490482 - 16:36:09.897385130] postgres (4219) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 316.895 ms | |
69 | 2715 pages allocated during the period | |
70 | [16:36:09.614727970 - 16:36:09.914140692] postgres (4220) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 299.413 ms | |
71 | 2396 pages allocated during the period | |
72 | [16:36:09.750251458 - 16:36:09.950608296] postgres (4224) syscall_entry_recvfrom(fd = 9 <socket>, count = 5) = 5, 200.357 ms | |
73 | 904 pages allocated during the period | |
74 | [...] | |
75 | [16:36:06.783206662 - 16:36:11.748147198] postgres (1348) syscall_entry_recvfrom(fd = 9 <socket:[8893]>, count = 5) = 5, 4964.941 ms | |
76 | 3541 pages allocated during the period | |
77 | Tue Oct 7 16:36:00 2014 to Tue Oct 7 16:36:18 2014 | |
28ccfe49 | 78 | Syscall I/O Read |
19952e0f JD |
79 | ################################################################################################################################################ |
80 | ██████████████████████████████████████████████████ 16777220 16.0 MB lttng-consumerd (2619), 0 B disk, 4.0 B net, 0 B block, 16.0 MB unknown | |
81 | █████ 1804880 1.72 MB lttng-consumerd (2623), 0 B disk, 0 B net, 0 B block, 1.72 MB unknown | |
82 | █ 407686 398.13 KB postgres (4219), 121.05 KB disk, 277.07 KB net, 0 B block, 8.0 B unknown | |
83 | 262235 256.09 KB postgres (1348), 0 B disk, 255.97 KB net, 0 B block, 117.0 B unknown | |
84 | 209730 204.81 KB postgres (4218), 204.81 KB disk, 0 B net, 0 B block, 0 B unknown | |
85 | 126737 123.77 KB postgres (4220), 117.5 KB disk, 6.26 KB net, 0 B block, 8.0 B unknown | |
86 | 124579 121.66 KB postgres (4226), 117.5 KB disk, 4.15 KB net, 0 B block, 8.0 B unknown | |
87 | 124189 121.28 KB postgres (4221), 117.5 KB disk, 3.77 KB net, 0 B block, 8.0 B unknown | |
88 | 124125 121.22 KB postgres (4222), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown | |
89 | 124125 121.22 KB postgres (4224), 117.5 KB disk, 3.71 KB net, 0 B block, 8.0 B unknown | |
90 | Syscall I/O Write | |
91 | ################################################################################################################################################### | |
92 | ██████████████████████████████████████████████████ 16777336 16.0 MB lttng-consumerd (2619), 0 B disk, 8.0 MB net, 0 B block, 8.0 MB unknown | |
93 | ██████ 2304240 2.2 MB postgres (4219), 2.0 MB disk, 202.23 KB net, 1.76 MB block, 0 B unknown | |
94 | █████ 1812800 1.73 MB lttng-consumerd (2623), 0 B disk, 887.73 KB net, 0 B block, 882.58 KB unknown | |
95 | ██ 743760 726.33 KB postgres (1165), 8.0 KB disk, 6.33 KB net, 0 B block, 712.0 KB unknown | |
96 | 162500 158.69 KB postgres (1168), 158.69 KB disk, 0 B net, 160.0 KB block, 0 B unknown | |
97 | 82592 80.66 KB postgres (1348), 0 B disk, 80.66 KB net, 0 B block, 0 B unknown | |
98 | 40960 40.0 KB postgres (1166), 0 B disk, 0 B net, 40.0 KB block, 40.0 KB unknown | |
99 | 13156 12.85 KB lttng (4227), 12.85 KB disk, 0 B net, 0 B block, 0 B unknown | |
100 | 12256 11.97 KB postgres (4220), 2.0 B disk, 11.97 KB net, 0 B block, 0 B unknown | |
101 | 10450 10.21 KB postgres (4226), 2.0 B disk, 10.2 KB net, 0 B block, 0 B unknown | |
28ccfe49 JD |
102 | Block I/O Read |
103 | ############################################################################### | |
104 | 0 0 B init (1) | |
105 | 0 0 B kthreadd (2) | |
106 | 0 0 B ksoftirqd/0 (3) | |
19952e0f | 107 | 0 0 B kworker/0:0 (4) |
28ccfe49 JD |
108 | 0 0 B kworker/0:0H (5) |
109 | 0 0 B rcu_sched (7) | |
110 | 0 0 B rcuos/0 (8) | |
111 | 0 0 B rcuos/1 (9) | |
112 | 0 0 B rcuos/2 (10) | |
113 | 0 0 B rcuos/3 (11) | |
28ccfe49 | 114 | Block I/O Write |
19952e0f JD |
115 | ######################################################################################### |
116 | ██████████████████████████████████████████████████ 1843200 1.76 MB postgres (4219) | |
117 | ████ 163840 160.0 KB postgres (1168) | |
118 | ██ 102400 100.0 KB kworker/u8:0 (1540) | |
119 | ██ 98304 96.0 KB jbd2/vda1-8 (257) | |
120 | █ 40960 40.0 KB postgres (1166) | |
121 | 8192 8.0 KB kworker/u9:0 (4197) | |
122 | 4096 4.0 KB kworker/u9:2 (1381) | |
123 | 0 0 B init (1) | |
124 | 0 0 B kthreadd (2) | |
125 | 0 0 B ksoftirqd/0 (3) | |
28ccfe49 | 126 | Files Read |
19952e0f JD |
127 | ########################################################################################################################## |
128 | ██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619) | |
129 | █████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619) | |
130 | ████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619) | |
131 | █ 325340 socket (postgres) 317.71 KB (9 4219) | |
132 | █ 262235 socket:[8893] (postgres) 256.09 KB (9 1348) | |
133 | 76600 socket:[10713] (postgres) 74.8 KB (8 4218) | |
134 | 52 /dev/ptmx 52.0 B (17 1) | |
135 | 48 socket:[54589] (sshd) 48.0 B (3 3329) | |
136 | 32 /root/trace2.sh 32.0 B (255 4211) | |
137 | 1 /dev/pts/1 1.0 B (0 4211) | |
138 | Files Write | |
139 | ########################################################################################################################### | |
140 | ██████████████████████████████████████████████████ 9289728 anon_inode:[lttng_stream] (lttng-consumerd) 8.86 MB (31 2619) | |
141 | █████████████████████████████████████████████ 8388608 pipe:[53306] (lttng-consumerd) 8.0 MB (11 2619) | |
142 | ████ 903760 pipe:[53309] (lttng-consumerd) 882.58 KB (17 2619) | |
143 | █ 325340 socket (postgres) 317.71 KB (9 4219) | |
144 | █ 262235 socket:[8893] (postgres) 256.09 KB (9 1348) | |
145 | 76600 socket:[10713] (postgres) 74.8 KB (8 4218) | |
146 | 65536 /var/lib/postgresql/9.1/main/base/16384/16611 64.0 KB (9 1165) | |
147 | 52 /dev/ptmx 52.0 B (17 1) | |
148 | 48 socket:[54589] (sshd) 48.0 B (3 3329) | |
149 | 32 /root/trace2.sh 32.0 B (255 4211) | |
150 | 1 /dev/pts/1 1.0 B (0 4211) | |
28ccfe49 JD |
151 | Disk nr_sector |
152 | ############################################################################### | |
19952e0f | 153 | ███████████████████████████████████████████████████████████████████ 4416 vda1 |
28ccfe49 JD |
154 | Disk nr_requests |
155 | ############################################################################### | |
19952e0f | 156 | ████████████████████████████████████████████████████████████████████ 177 vda1 |
28ccfe49 JD |
157 | Disk request time/sector |
158 | ############################################################################### | |
19952e0f JD |
159 | ███████████████████████████████████████████████████████████████ 0.014 ms vda1 |
160 | 0.0 ms 0 | |
28ccfe49 JD |
161 | Network recv_bytes |
162 | ############################################################################### | |
19952e0f JD |
163 | ████████████████████████████████████████████████████████ 757250 739.5 KB eth0 |
164 | ██████ 82200 80.27 KB lo | |
28ccfe49 JD |
165 | Network sent_bytes |
166 | ############################################################################### | |
19952e0f JD |
167 | ████████████████████████████████████████████████████████ 9811620 9.36 MB eth0 |
168 | 85000 83.01 KB lo | |
169 | trace2.sh requests latency (ms) | |
28ccfe49 | 170 | #################################################################################### |
19952e0f JD |
171 | ██████████████████████████████████████████████████ 18011.544456 16:36:00.788362068 |
172 | postgres requests latency (ms) | |
173 | ################################################################################### | |
174 | 78.257007 16:36:02.916132296 | |
175 | 83.769184 16:36:03.007633298 | |
176 | 93.501903 16:36:03.107392794 | |
177 | [...] | |
178 | ███ 316.894648 16:36:09.580490482 | |
179 | ███ 299.412722 16:36:09.614727970 | |
180 | ██ 200.356838 16:36:09.750251458 | |
181 | [...] | |
182 | ██████████████████████████████████████████████████ 4964.940536 16:36:06.783206662 | |
28ccfe49 | 183 | ``` |
19952e0f | 184 | |
173fa485 | 185 | ## Work in progress |
19952e0f JD |
186 | Track the page cache and extract the latencies associated with pages flush to disk. |
187 | In order to do that, we rely on the assumption that the pages are flushed in a FIFO | |
188 | order. It might not be 100% accurate, but it already gives great results : | |
189 | ||
190 | An example here when saving a file in vim: | |
173fa485 | 191 | ``` |
19952e0f JD |
192 | [19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms |
193 | 1 dirty page(s) were flushed (assuming FIFO): | |
194 | vim (31517): 1 pages | |
195 | - blabla : 1 pages | |
196 | 13 active dirty filesystem page(s) (known): | |
197 | redis-server (2092): 2 pages | |
198 | - /var/log/redis/redis-server.log : 2 pages | |
199 | vim (31517): 2 pages | |
200 | - .blabla.swp : 2 pages | |
201 | lttng-consumerd (6750): 9 pages | |
202 | - unknown (origin not found) : 9 pages | |
203 | ``` | |
204 | ||
205 | An other example when running the 'sync' command: | |
173fa485 | 206 | ``` |
19952e0f JD |
207 | [19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms |
208 | 23 dirty page(s) were flushed (assuming FIFO): | |
209 | redis-server (2092): 2 pages | |
210 | - /var/log/redis/redis-server.log : 2 pages | |
211 | vim (31517): 9 pages | |
212 | - /home/julien/.viminfo.tmp : 6 pages | |
213 | - .blabla.swp : 3 pages | |
214 | lttng-consumerd (6750): 12 pages | |
215 | - unknown (origin not found) : 12 pages | |
216 | ``` | |
217 | ||
218 | Postgresql with 'sys_fdatasync': | |
219 | ``` | |
220 | [13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms | |
221 | 2 pages allocated during the period | |
222 | 88 dirty page(s) were flushed (assuming FIFO): | |
223 | postgres (1137): 88 pages | |
224 | - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages | |
225 | 68 last dirtied filesystem page(s): | |
226 | postgres (2419): 68 pages | |
227 | - base/11951/18410 : 46 pages | |
228 | - base/11951/18407 : 10 pages | |
229 | - base/11951/18407_fsm : 6 pages | |
230 | - base/11951/18410_fsm : 6 pages | |
231 | ``` | |
232 | ||
233 | Detecting a fight for the I/O between a huge write and postgresql: | |
234 | ``` | |
235 | [13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms | |
236 | 34760 pages allocated during the period | |
237 | woke up kswapd during the period | |
238 | 10046 pages written on disk | |
239 | freed 33753 pages from the cache during the period | |
240 | 1397 last dirtied filesystem page(s): | |
241 | python (2353): 1325 pages | |
242 | - /root/bla : 1325 pages | |
243 | postgres (2419): 72 pages | |
244 | - base/11951/18419 : 72 pages | |
245 | ``` | |
246 | ||
247 | ## Conclusion | |
248 | Hope you have fun trying it and please remember it is a work in progress, feedbacks, bug reports and ideas are alway welcome ! |