fix: lost link to the LTTng documentation
[deliverable/lttng-analyses.git] / README.md
1 # lttng-analyses
2
3 This repository contains various scripts to extract monitoring data and metrics
4 from LTTng kernel traces.
5
6 As opposed to other diagnostic or monitoring solutions, this approach is designed
7 to allow users to record their system's activity with a low overhead, wait
8 for a problem to occur and then diagnose its cause offline.
9
10 This solution allows the user to target hard to find problems and dig until the
11 root cause is found.
12
13 This README describes the implemented analyses as well as how to use them.
14
15 ## Requirements
16 * LTTng >= 2.5
17 * Babeltrace >= 1.2 (with python bindings built)
18 * Python >= 3.4
19
20 ## Install on Ubuntu (12.04 and 14.04 at least)
21 ```bash
22 apt-get install -y software-properties-common (or python-software-properties on 12.04)
23 apt-add-repository -y ppa:lttng/ppa
24 apt-get update
25 apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar
26 ```
27
28 ```bash
29 git clone https://github.com/lttng/lttng-analyses.git
30 ```
31
32 ## Trace creation
33 Here are the basic commands to create a trace, for more information on the
34 LTTng setup, please refer to the [LTTng
35 documentation](http://lttng.org/docs/#doc-tracing-the-linux-kernel)
36
37 ### Automatic
38 From the cloned git tree:
39 ```bash
40 ./lttng-analyses-record
41 ```
42
43 ### Manual
44 ```bash
45 lttng create
46 lttng enable-channel -k bla --subbuf-size=4M
47 lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer,irq_handler_entry,irq_handler_exit,softirq_entry,softirq_exit,softirq_raise -c bla
48 lttng enable-event -k --syscall -a -c bla
49 lttng start
50 ..do stuff...
51 lttng stop
52 lttng destroy
53 ```
54
55 ### Remote trace creation
56 You can also create a trace on a server and send it to a remote host. The
57 remote host only needs to run `lttng-relayd -d` and be reachable over the network.
58 The only difference with the above commands is the tracing session's creation:
59 ```bash
60 lttng create -U net://<remote-host>
61 ```
62
63 ## Implemented analyses
64
65 * CPU usage for the whole system
66 * CPU usage per-process
67 * Process CPU migration count
68 * Memory usage per-process (as seen by the kernel)
69 * Memory usage system-wide (as seen by the kernel)
70 * I/O usage (syscalls, disk, network)
71 * I/O operations log (with latency and usage)
72 * I/O latency statistics (open, read, write, sync operations)
73 * I/O latency frequency distribution
74 * Interrupt handler duration statistics (count, min, max, average stdev)
75 * Interrupt handler duration top
76 * Interrupt handler duration log
77 * Interrupt handler duration frequency distribution
78 * SoftIRQ handler latency statistics
79 * Syscalls usage statistics
80
81 All of the analyses share the same code architecture making it possible
82 to filter by timerange, process name, PID, min and max values using the
83 same command-line options. Also note that reported timestamps can
84 optionally be expressed in the GMT timezone to allow easy sharing between
85 teams.
86
87 The project's architecture makes it easy to add new analyses or to reuse
88 the analysis backend in external tools which may then present the results
89 in their own format (as opposed to text).
90
91 ## Examples
92 After having collected your trace, any script contained in this repository
93 can be used to run an analysis. Read on for some examples!
94
95 ### I/O
96 #### I/O latency stats
97 ```bash
98 $ ./lttng-iolatencystats mytrace/
99 Timerange: [2015-01-06 10:58:26.140545481, 2015-01-06 10:58:27.229358936]
100 Syscalls latency statistics (usec):
101 Type Count Min Average Max Stdev
102 -----------------------------------------------------------------------------------------
103 Open 45 5.562 13.835 77.683 15.263
104 Read 109 0.316 5.774 62.569 9.277
105 Write 101 0.256 7.060 48.531 8.555
106 Sync 207 19.384 40.664 160.188 21.201
107
108 Disk latency statistics (usec):
109 Name Count Min Average Max Stdev
110 -----------------------------------------------------------------------------------------
111 dm-0 108 0.001 0.004 0.007 1.306
112 ```
113
114 #### I/O latency frequency distribution
115 ```bash
116 $ ./lttng-iolatencyfreq mytrace/
117 Timerange: [2015-01-06 10:58:26.140545481, 2015-01-06 10:58:27.229358936]
118 Open latency distribution (usec)
119 ###############################################################################
120 5.562 ███████████████████████████████████████████████████████████████████ 25
121 9.168 ██████████ 4
122 12.774 █████████████████████ 8
123 16.380 ████████ 3
124 19.986 █████ 2
125 23.592 0
126 27.198 0
127 30.804 0
128 34.410 ██ 1
129 38.016 0
130 41.623 0
131 45.229 0
132 48.835 0
133 52.441 0
134 56.047 0
135 59.653 0
136 63.259 0
137 66.865 0
138 70.471 0
139 74.077 █████ 2
140 ```
141
142 #### I/O latency top
143 ```bash
144 $ ./lttng-iolatencytop analysis-20150115-120942/ --limit 3 --minsize 2
145 Checking the trace for lost events...
146 Timerange: [2015-01-15 12:18:37.216484041, 2015-01-15 12:18:53.821580313]
147 Top open syscall latencies (usec)
148 Begin End Name Duration (usec) Size Proc PID Filename
149 [12:18:50.432950815,12:18:50.870648568] open 437697.753 N/A apache2 31517 /var/lib/php5/sess_0ifir2hangm8ggaljdphl9o5b5 (fd=13)
150 [12:18:52.946080165,12:18:52.946132278] open 52.113 N/A apache2 31588 /var/lib/php5/sess_mr9045p1k55vin1h0vg7rhgd63 (fd=13)
151 [12:18:46.800846035,12:18:46.800874916] open 28.881 N/A apache2 31591 /var/lib/php5/sess_r7c12pccfvjtas15g3j69u14h0 (fd=13)
152 [12:18:51.389797604,12:18:51.389824426] open 26.822 N/A apache2 31520 /var/lib/php5/sess_4sdb1rtjkhb78sabnoj8gpbl00 (fd=13)
153
154 Top read syscall latencies (usec)
155 Begin End Name Duration (usec) Size Proc PID Filename
156 [12:18:37.256073107,12:18:37.256555967] read 482.860 7.00 B bash 10237 unknown (origin not found) (fd=3)
157 [12:18:52.000209798,12:18:52.000252304] read 42.506 1.00 KB irqbalance 1337 /proc/interrupts (fd=3)
158 [12:18:37.256559439,12:18:37.256601615] read 42.176 5.00 B bash 10237 unknown (origin not found) (fd=3)
159 [12:18:42.000281918,12:18:42.000320016] read 38.098 1.00 KB irqbalance 1337 /proc/interrupts (fd=3)
160
161 Top write syscall latencies (usec)
162 Begin End Name Duration (usec) Size Proc PID Filename
163 [12:18:49.913241516,12:18:49.915908862] write 2667.346 95.00 B apache2 31584 /var/log/apache2/access.log (fd=8)
164 [12:18:37.472823631,12:18:37.472859836] writev 36.205 21.97 KB apache2 31544 unknown (origin not found) (fd=12)
165 [12:18:37.991578372,12:18:37.991612724] writev 34.352 21.97 KB apache2 31589 unknown (origin not found) (fd=12)
166 [12:18:39.547778549,12:18:39.547812515] writev 33.966 21.97 KB apache2 31584 unknown (origin not found) (fd=12)
167
168 Top sync syscall latencies (usec)
169 Begin End Name Duration (usec) Size Proc PID Filename
170 [12:18:50.162776739,12:18:51.157522361] sync 994745.622 N/A sync 22791 None (fd=None)
171 [12:18:37.227867532,12:18:37.232289687] sync_file_range 4422.155 N/A lttng-consumerd 19964 /home/julien/lttng-traces/analysis-20150115-120942/kernel/metadata (fd=32)
172 [12:18:37.238076585,12:18:37.239012027] sync_file_range 935.442 N/A lttng-consumerd 19964 /home/julien/lttng-traces/analysis-20150115-120942/kernel/metadata (fd=32)
173 [12:18:37.220974711,12:18:37.221647124] sync_file_range 672.413 N/A lttng-consumerd 19964 /home/julien/lttng-traces/analysis-20150115-120942/kernel/metadata (fd=32)
174 ```
175
176 #### I/O operations log
177 ```bash
178 $ ./lttng-iolog mytrace/
179 [10:58:26.221618530,10:58:26.221620659] write 2.129 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
180 [10:58:26.221623609,10:58:26.221628055] read 4.446 50.00 B /usr/bin/x-term 11793 /dev/ptmx (fd=24)
181 [10:58:26.221638929,10:58:26.221640008] write 1.079 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
182 [10:58:26.221676232,10:58:26.221677385] read 1.153 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
183 [10:58:26.223401804,10:58:26.223411683] open 9.879 N/A sleep 12420 /etc/ld.so.cache (fd=3)
184 [10:58:26.223448060,10:58:26.223455577] open 7.517 N/A sleep 12420 /lib/x86_64-linux-gnu/libc.so.6 (fd=3)
185 [10:58:26.223456522,10:58:26.223458898] read 2.376 832.00 B sleep 12420 /lib/x86_64-linux-gnu/libc.so.6 (fd=3)
186 [10:58:26.223918068,10:58:26.223929316] open 11.248 N/A sleep 12420 (fd=3)
187 [10:58:26.231881565,10:58:26.231895970] writev 14.405 16.00 B /usr/bin/x-term 11793 socket:[45650] (fd=4)
188 [10:58:26.231979636,10:58:26.231988446] recvmsg 8.810 16.00 B Xorg 1827 socket:[47480] (fd=38)
189 ```
190
191
192 #### I/O usage top
193 ```bash
194 $ ./lttng-iousagetop traces/pgread-writes
195 Timerange: [2014-10-07 16:36:00.733214969, 2014-10-07 16:36:18.804584183]
196 Per-process I/O Read
197 ###############################################################################
198 ██████████████████████████████████████████████████ 16.00 MB lttng-consumerd (2619) 0 B file 4.00 B net 16.00 MB unknown
199 █████ 1.72 MB lttng-consumerd (2619) 0 B file 0 B net 1.72 MB unknown
200 █ 398.13 KB postgres (4219) 121.05 KB file 277.07 KB net 8.00 B unknown
201 256.09 KB postgres (1348) 0 B file 255.97 KB net 117.00 B unknown
202 204.81 KB postgres (4218) 204.81 KB file 0 B net 0 B unknown
203 123.77 KB postgres (4220) 117.50 KB file 6.26 KB net 8.00 B unknown
204 Per-process I/O Write
205 ###############################################################################
206 ██████████████████████████████████████████████████ 16.00 MB lttng-consumerd (2619) 0 B file 8.00 MB net 8.00 MB unknown
207 ██████ 2.20 MB postgres (4219) 2.00 MB file 202.23 KB net 0 B unknown
208 █████ 1.73 MB lttng-consumerd (2619) 0 B file 887.73 KB net 882.58 KB unknown
209 ██ 726.33 KB postgres (1165) 8.00 KB file 6.33 KB net 712.00 KB unknown
210 158.69 KB postgres (1168) 158.69 KB file 0 B net 0 B unknown
211 80.66 KB postgres (1348) 0 B file 80.66 KB net 0 B unknown
212 Files Read
213 ###############################################################################
214 ██████████████████████████████████████████████████ 8.00 MB anon_inode:[lttng_stream] (lttng-consumerd) 'fd 32 in lttng-consumerd (2619)'
215 █████ 834.41 KB base/16384/pg_internal.init 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
216 █ 256.09 KB socket:[8893] (postgres) 'fd 9 in postgres (1348)'
217 █ 174.69 KB pg_stat_tmp/pgstat.stat 'fd 9 in postgres (4218)', 'fd 9 in postgres (1167)'
218 109.48 KB global/pg_internal.init 'fd 7 in postgres (4218)', 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
219 104.30 KB base/11951/pg_internal.init 'fd 7 in postgres (4218)'
220 12.85 KB socket (lttng-sessiond) 'fd 30 in lttng-sessiond (384)'
221 4.50 KB global/pg_filenode.map 'fd 7 in postgres (4218)', 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
222 4.16 KB socket (postgres) 'fd 9 in postgres (4226)'
223 4.00 KB /proc/interrupts 'fd 3 in irqbalance (1104)'
224 Files Write
225 ###############################################################################
226 ██████████████████████████████████████████████████ 8.00 MB socket:[56371] (lttng-consumerd) 'fd 30 in lttng-consumerd (2619)'
227 █████████████████████████████████████████████████ 8.00 MB pipe:[53306] (lttng-consumerd) 'fd 12 in lttng-consumerd (2619)'
228 ██████████ 1.76 MB pg_xlog/00000001000000000000000B 'fd 31 in postgres (4219)'
229 █████ 887.82 KB socket:[56369] (lttng-consumerd) 'fd 26 in lttng-consumerd (2619)'
230 █████ 882.58 KB pipe:[53309] (lttng-consumerd) 'fd 18 in lttng-consumerd (2619)'
231 160.00 KB /var/lib/postgresql/9.1/main/base/16384/16602 'fd 14 in postgres (1165)'
232 158.69 KB pg_stat_tmp/pgstat.tmp 'fd 3 in postgres (1168)'
233 144.00 KB /var/lib/postgresql/9.1/main/base/16384/16613 'fd 12 in postgres (1165)'
234 88.00 KB /var/lib/postgresql/9.1/main/base/16384/16609 'fd 11 in postgres (1165)'
235 78.28 KB socket:[8893] (postgres) 'fd 9 in postgres (1348)'
236 Block I/O Read
237 ###############################################################################
238 Block I/O Write
239 ###############################################################################
240 ██████████████████████████████████████████████████ 1.76 MB postgres (pid=4219)
241 ████ 160.00 KB postgres (pid=1168)
242 ██ 100.00 KB kworker/u8:0 (pid=1540)
243 ██ 96.00 KB jbd2/vda1-8 (pid=257)
244 █ 40.00 KB postgres (pid=1166)
245 8.00 KB kworker/u9:0 (pid=4197)
246 4.00 KB kworker/u9:2 (pid=1381)
247 Disk nr_sector
248 ###############################################################################
249 ███████████████████████████████████████████████████████████████████ 4416.00 sectors vda1
250 Disk nr_requests
251 ###############################################################################
252 ████████████████████████████████████████████████████████████████████ 177.00 requests vda1
253 Disk request time/sector
254 ###############################################################################
255 ██████████████████████████████████████████████████████████████████ 0.01 ms vda1
256 Network recv_bytes
257 ###############################################################################
258 ███████████████████████████████████████████████████████ 739.50 KB eth0
259 █████ 80.27 KB lo
260 Network sent_bytes
261 ###############################################################################
262 ████████████████████████████████████████████████████████ 9.36 MB eth0
263 ```
264
265 ### IRQ
266 #### Handler duration and raise latency statistics
267 ```bash
268 $ ./lttng-irqstats mytrace/
269 Timerange: [2014-03-11 16:05:41.314824752, 2014-03-11 16:05:45.041994298]
270 Hard IRQ Duration (us)
271 count min avg max stdev
272 ----------------------------------------------------------------------------------|
273 1: <i8042> 30 10.901 45.500 64.510 18.447 |
274 42: <ahci> 259 3.203 7.863 21.426 3.183 |
275 43: <eth0> 2 3.859 3.976 4.093 0.165 |
276 44: <iwlwifi> 92 0.300 3.995 6.542 2.181 |
277
278 Soft IRQ Duration (us) Raise latency (us)
279 count min avg max stdev | count min avg max stdev
280 ----------------------------------------------------------------------------------|------------------------------------------------------------
281 1: <TIMER_SOFTIRQ> 495 0.202 21.058 51.060 11.047 | 53 2.141 11.217 20.005 7.233
282 3: <NET_RX_SOFTIRQ> 14 0.133 9.177 32.774 10.483 | 14 0.763 3.703 10.902 3.448
283 4: <BLOCK_SOFTIRQ> 257 5.981 29.064 125.862 15.891 | 257 0.891 3.104 15.054 2.046
284 6: <TASKLET_SOFTIRQ> 26 0.309 1.198 1.748 0.329 | 26 9.636 39.222 51.430 11.246
285 7: <SCHED_SOFTIRQ> 299 1.185 14.768 90.465 15.992 | 298 1.286 31.387 61.700 11.866
286 9: <RCU_SOFTIRQ> 338 0.592 3.387 13.745 1.356 | 147 2.480 29.299 64.453 14.286
287 ```
288
289 #### Handler duration frequency distribution
290 ```bash
291 $ ./lttng-irqfreq --timerange [16:05:42,16:05:45] --irq 44 --stats mytrace/
292 Timerange: [2014-03-11 16:05:42.042034570, 2014-03-11 16:05:44.998914297]
293 Hard IRQ Duration (us)
294 count min avg max stdev
295 ----------------------------------------------------------------------------------|
296 44: <iwlwifi> 72 0.300 4.018 6.542 2.164 |
297 Frequency distribution iwlwifi (44)
298 ###############################################################################
299 0.300 █████ 1.00
300 0.612 ██████████████████████████████████████████████████████████████ 12.00
301 0.924 ████████████████████ 4.00
302 1.236 ██████████ 2.00
303 1.548 0.00
304 1.861 █████ 1.00
305 2.173 0.00
306 2.485 █████ 1.00
307 2.797 ██████████████████████████ 5.00
308 3.109 █████ 1.00
309 3.421 ███████████████ 3.00
310 3.733 0.00
311 4.045 █████ 1.00
312 4.357 █████ 1.00
313 4.669 ██████████ 2.00
314 4.981 ██████████ 2.00
315 5.294 █████████████████████████████████████████ 8.00
316 5.606 ████████████████████████████████████████████████████████████████████ 13.00
317 5.918 ██████████████████████████████████████████████████████████████ 12.00
318 6.230 ███████████████ 3.00
319 ```
320
321 ### Others
322 There are a lot of other scripts, we encourage you to try them and read the
323 ```--help``` to see all the available options.
324
325 ## Work in progress
326 Track the page cache and extract the latencies associated with pages flush to disk.
327 In order to do that, we rely on the assumption that the pages are flushed in a FIFO
328 order. It might not be 100% accurate, but it already gives great results :
329
330 An example here when saving a file in vim:
331 ```
332 [19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms
333 1 dirty page(s) were flushed (assuming FIFO):
334 vim (31517): 1 pages
335 - blabla : 1 pages
336 13 active dirty filesystem page(s) (known):
337 redis-server (2092): 2 pages
338 - /var/log/redis/redis-server.log : 2 pages
339 vim (31517): 2 pages
340 - .blabla.swp : 2 pages
341 lttng-consumerd (6750): 9 pages
342 - unknown (origin not found) : 9 pages
343 ```
344
345 An other example when running the 'sync' command:
346 ```
347 [19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms
348 23 dirty page(s) were flushed (assuming FIFO):
349 redis-server (2092): 2 pages
350 - /var/log/redis/redis-server.log : 2 pages
351 vim (31517): 9 pages
352 - /home/julien/.viminfo.tmp : 6 pages
353 - .blabla.swp : 3 pages
354 lttng-consumerd (6750): 12 pages
355 - unknown (origin not found) : 12 pages
356 ```
357
358 PostgreSQL with 'sys_fdatasync':
359 ```
360 [13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms
361 2 pages allocated during the period
362 88 dirty page(s) were flushed (assuming FIFO):
363 postgres (1137): 88 pages
364 - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages
365 68 last dirtied filesystem page(s):
366 postgres (2419): 68 pages
367 - base/11951/18410 : 46 pages
368 - base/11951/18407 : 10 pages
369 - base/11951/18407_fsm : 6 pages
370 - base/11951/18410_fsm : 6 pages
371 ```
372
373 Detecting a fight for the I/O between a huge write and postgresql:
374 ```
375 [13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms
376 34760 pages allocated during the period
377 woke up kswapd during the period
378 10046 pages written on disk
379 freed 33753 pages from the cache during the period
380 1397 last dirtied filesystem page(s):
381 python (2353): 1325 pages
382 - /root/bla : 1325 pages
383 postgres (2419): 72 pages
384 - base/11951/18419 : 72 pages
385 ```
386
387 ## Limitations
388
389 The main limitation of this project is the fact that it can be quite slow to
390 process a large trace. This project is a work in progress and we focus on the
391 problem-solving aspect. Therefore, features have been prioritized over
392 performance for now.
393
394 One other aspect is the fact that the state is not persistent; the trace has
395 to be re-processed if another analysis script is to be used on the same trace.
396 Some scripts belonging to the same category allow the combination of multiple
397 analyses into a single pass (see ```--freq```, ```--log```, ```--usage```,
398 ```--latencystats```, etc). We are planning to add a way to save the state
399 and/or create an interactive environment to allow the user to run multiple
400 analyses on the same trace without having to process the trace every time.
401
402
403 ## Conclusion
404 We hope you have fun trying this project and please remember it is a work in
405 progress; feedback, bug reports and improvement ideas are always welcome!
This page took 0.040042 seconds and 6 git commands to generate.