Merge pull request #9 from PSRCode/patch-1
[deliverable/lttng-analyses.git] / README.md
CommitLineData
28ccfe49
JD
1# lttng-analyses
2
3This repository contains various scripts to extract monitoring data and metrics
4from LTTng kernel traces.
5
e3de48f7
JD
6Compared to other diagnostics/monitoring solutions, this approach is designed
7to allow users to record the activity of their system with a low overhead, wait
8for a problem to appear and then diagnose offline to identify the performance
9problem they encountered.
28ccfe49 10
e3de48f7
JD
11With this solution, we target really hard to find problems and dig until we
12found the root cause.
28ccfe49 13
e3de48f7
JD
14This README describes the analyses implemented and the usage of the whole
15project.
28ccfe49
JD
16
17## Requirements
e3de48f7
JD
18* LTTng >= 2.5
19* Babeltrace >= 1.2 (with python bindings compiled)
20* Python >= 3.4
28ccfe49
JD
21
22## Install on Ubuntu (12.04 and 14.04 at least)
e3de48f7 23```bash
851bf235
JD
24apt-get install -y software-properties-common (or python-software-properties on 12.04)
25apt-add-repository -y ppa:lttng/ppa
26apt-get update
27apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar
e3de48f7
JD
28git clone https://github.com/lttng/lttng-analyses.git
29```
28ccfe49
JD
30
31## Trace creation
34f4c6f4
JD
32Here are the basic commands to create a trace, for more information on the
33LTTng setup, please refer to the [LTTng
34documentation](http://lttng.org/docs/#doc-tracing-the-linux-kernel)
35
e3de48f7
JD
36### Automatic
37From the cloned git tree:
38```bash
39./lttng-analyses-record
28ccfe49 40```
e3de48f7
JD
41
42### Manual
43```bash
851bf235
JD
44lttng create
45lttng enable-channel -k bla --subbuf-size=4M
0e634170 46lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer,irq_handler_entry,irq_handler_exit,softirq_entry,softirq_exit,softirq_raise -c bla
851bf235
JD
47lttng enable-event -k --syscall -a -c bla
48lttng start
49..do stuff...
50lttng stop
51lttng destroy
28ccfe49
JD
52```
53
e3de48f7 54### Remote trace creation
28ccfe49
JD
55You can also create a trace on a server and send it to a remote host. The
56remote host only need to run `lttng-relayd -d` and be reachable by network.
e3de48f7
JD
57The only difference with the above commands is the trace session creation:
58```bash
851bf235 59lttng create -U net://<remote-host>
3d184c69 60```
28ccfe49 61
e3de48f7
JD
62## Implemented analyses
63
64* CPU usage for the whole system
65* CPU usage per-process
66* Process CPU migration count
67* Memory usage per-process (as seen by the kernel)
68* Memory usage system-wide (as seen by the kernel)
69* I/O usage (syscalls, disk, network)
70* I/O operations log (with latency and usage)
71* I/O latency statistics (open, read, write, sync operations)
72* I/O latency frequency distribution
73* Interrupt handler duration statistics (count, min, max, average stdev)
74* Interrupt handler duration top
75* Interrupt handler duration log
76* Interrupt handler duration frequency distribution
77* SoftIRQ handler latency statistics
78* Syscalls usage statistics
79
80All of the analyses share the same code architecture, so it is possible
81to filter by timerange, process name, PID, min and max values with the
82same options. Also, all the timestamp reported can optionally be output
83in the GMT timezone to allow easy sharing between teams.
84
85The code architecture allows to easily add new analyses and an external
86tool can import the analysis backend and output the results in its own
87way (instead of text).
88
fe109453 89## Examples
e3de48f7
JD
90Once you have collected your trace, you can run any script from the repository
91directly, following are some examples.
92
93### I/O
94#### I/O latency stats
95```bash
fe109453 96$ ./lttng-iolatencystats mytrace/
e3de48f7
JD
97Timerange: [2015-01-06 10:58:26.140545481, 2015-01-06 10:58:27.229358936]
98Syscalls latency statistics (usec):
99Type Count Min Average Max Stdev
100-----------------------------------------------------------------------------------------
101Open 45 5.562 13.835 77.683 15.263
102Read 109 0.316 5.774 62.569 9.277
103Write 101 0.256 7.060 48.531 8.555
104Sync 207 19.384 40.664 160.188 21.201
105
106Disk latency statistics (usec):
107Name Count Min Average Max Stdev
108-----------------------------------------------------------------------------------------
109dm-0 108 0.001 0.004 0.007 1.306
110```
111
112#### I/O latency frequency distribution
113```bash
fe109453 114$ ./lttng-iolatencyfreq mytrace/
e3de48f7
JD
115Timerange: [2015-01-06 10:58:26.140545481, 2015-01-06 10:58:27.229358936]
116Open latency distribution (usec)
28ccfe49 117###############################################################################
e3de48f7
JD
118 5.562 ███████████████████████████████████████████████████████████████████ 25
119 9.168 ██████████ 4
12012.774 █████████████████████ 8
12116.380 ████████ 3
12219.986 █████ 2
12323.592 0
12427.198 0
12530.804 0
12634.410 ██ 1
12738.016 0
12841.623 0
12945.229 0
13048.835 0
13152.441 0
13256.047 0
13359.653 0
13463.259 0
13566.865 0
13670.471 0
13774.077 █████ 2
138```
139
140#### I/O operations log
141```bash
fe109453 142$ ./lttng-iolog mytrace/
e3de48f7
JD
143[10:58:26.221618530,10:58:26.221620659] write 2.129 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
144[10:58:26.221623609,10:58:26.221628055] read 4.446 50.00 B /usr/bin/x-term 11793 /dev/ptmx (fd=24)
145[10:58:26.221638929,10:58:26.221640008] write 1.079 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
146[10:58:26.221676232,10:58:26.221677385] read 1.153 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
147[10:58:26.223401804,10:58:26.223411683] open 9.879 N/A sleep 12420 /etc/ld.so.cache (fd=3)
148[10:58:26.223448060,10:58:26.223455577] open 7.517 N/A sleep 12420 /lib/x86_64-linux-gnu/libc.so.6 (fd=3)
149[10:58:26.223456522,10:58:26.223458898] read 2.376 832.00 B sleep 12420 /lib/x86_64-linux-gnu/libc.so.6 (fd=3)
150[10:58:26.223918068,10:58:26.223929316] open 11.248 N/A sleep 12420 (fd=3)
151[10:58:26.231881565,10:58:26.231895970] writev 14.405 16.00 B /usr/bin/x-term 11793 socket:[45650] (fd=4)
152[10:58:26.231979636,10:58:26.231988446] recvmsg 8.810 16.00 B Xorg 1827 socket:[47480] (fd=38)
153```
154
155
156#### I/O usage top
157```bash
fe109453 158$ ./lttng-iousagetop traces/pgread-writes
e3de48f7
JD
159Timerange: [2014-10-07 16:36:00.733214969, 2014-10-07 16:36:18.804584183]
160Per-process I/O Read
161###############################################################################
162██████████████████████████████████████████████████ 16.00 MB lttng-consumerd (2619) 0 B file 4.00 B net 16.00 MB unknown
163█████ 1.72 MB lttng-consumerd (2619) 0 B file 0 B net 1.72 MB unknown
164█ 398.13 KB postgres (4219) 121.05 KB file 277.07 KB net 8.00 B unknown
165 256.09 KB postgres (1348) 0 B file 255.97 KB net 117.00 B unknown
166 204.81 KB postgres (4218) 204.81 KB file 0 B net 0 B unknown
167 123.77 KB postgres (4220) 117.50 KB file 6.26 KB net 8.00 B unknown
168Per-process I/O Write
169###############################################################################
170██████████████████████████████████████████████████ 16.00 MB lttng-consumerd (2619) 0 B file 8.00 MB net 8.00 MB unknown
171██████ 2.20 MB postgres (4219) 2.00 MB file 202.23 KB net 0 B unknown
172█████ 1.73 MB lttng-consumerd (2619) 0 B file 887.73 KB net 882.58 KB unknown
173██ 726.33 KB postgres (1165) 8.00 KB file 6.33 KB net 712.00 KB unknown
174 158.69 KB postgres (1168) 158.69 KB file 0 B net 0 B unknown
175 80.66 KB postgres (1348) 0 B file 80.66 KB net 0 B unknown
28ccfe49 176Files Read
e3de48f7
JD
177###############################################################################
178██████████████████████████████████████████████████ 8.00 MB anon_inode:[lttng_stream] (lttng-consumerd) 'fd 32 in lttng-consumerd (2619)'
179█████ 834.41 KB base/16384/pg_internal.init 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
180█ 256.09 KB socket:[8893] (postgres) 'fd 9 in postgres (1348)'
181█ 174.69 KB pg_stat_tmp/pgstat.stat 'fd 9 in postgres (4218)', 'fd 9 in postgres (1167)'
182 109.48 KB global/pg_internal.init 'fd 7 in postgres (4218)', 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
183 104.30 KB base/11951/pg_internal.init 'fd 7 in postgres (4218)'
184 12.85 KB socket (lttng-sessiond) 'fd 30 in lttng-sessiond (384)'
185 4.50 KB global/pg_filenode.map 'fd 7 in postgres (4218)', 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
186 4.16 KB socket (postgres) 'fd 9 in postgres (4226)'
187 4.00 KB /proc/interrupts 'fd 3 in irqbalance (1104)'
19952e0f 188Files Write
e3de48f7
JD
189###############################################################################
190██████████████████████████████████████████████████ 8.00 MB socket:[56371] (lttng-consumerd) 'fd 30 in lttng-consumerd (2619)'
191█████████████████████████████████████████████████ 8.00 MB pipe:[53306] (lttng-consumerd) 'fd 12 in lttng-consumerd (2619)'
192██████████ 1.76 MB pg_xlog/00000001000000000000000B 'fd 31 in postgres (4219)'
193█████ 887.82 KB socket:[56369] (lttng-consumerd) 'fd 26 in lttng-consumerd (2619)'
194█████ 882.58 KB pipe:[53309] (lttng-consumerd) 'fd 18 in lttng-consumerd (2619)'
195 160.00 KB /var/lib/postgresql/9.1/main/base/16384/16602 'fd 14 in postgres (1165)'
196 158.69 KB pg_stat_tmp/pgstat.tmp 'fd 3 in postgres (1168)'
197 144.00 KB /var/lib/postgresql/9.1/main/base/16384/16613 'fd 12 in postgres (1165)'
198 88.00 KB /var/lib/postgresql/9.1/main/base/16384/16609 'fd 11 in postgres (1165)'
199 78.28 KB socket:[8893] (postgres) 'fd 9 in postgres (1348)'
200Block I/O Read
201###############################################################################
202Block I/O Write
203###############################################################################
204██████████████████████████████████████████████████ 1.76 MB postgres (pid=4219)
205████ 160.00 KB postgres (pid=1168)
206██ 100.00 KB kworker/u8:0 (pid=1540)
207██ 96.00 KB jbd2/vda1-8 (pid=257)
208█ 40.00 KB postgres (pid=1166)
209 8.00 KB kworker/u9:0 (pid=4197)
210 4.00 KB kworker/u9:2 (pid=1381)
28ccfe49
JD
211Disk nr_sector
212###############################################################################
e3de48f7 213███████████████████████████████████████████████████████████████████ 4416.00 sectors vda1
28ccfe49
JD
214Disk nr_requests
215###############################################################################
e3de48f7 216████████████████████████████████████████████████████████████████████ 177.00 requests vda1
28ccfe49
JD
217Disk request time/sector
218###############################################################################
e3de48f7 219██████████████████████████████████████████████████████████████████ 0.01 ms vda1
28ccfe49
JD
220Network recv_bytes
221###############################################################################
e3de48f7
JD
222███████████████████████████████████████████████████████ 739.50 KB eth0
223█████ 80.27 KB lo
28ccfe49
JD
224Network sent_bytes
225###############################################################################
e3de48f7 226████████████████████████████████████████████████████████ 9.36 MB eth0
28ccfe49 227```
19952e0f 228
e3de48f7
JD
229### IRQ
230#### Handler duration and raise latency statistics
231```bash
fe109453 232$ ./lttng-irqstats mytrace/
e3de48f7
JD
233Timerange: [2014-03-11 16:05:41.314824752, 2014-03-11 16:05:45.041994298]
234Hard IRQ Duration (us)
235 count min avg max stdev
236----------------------------------------------------------------------------------|
2371: <i8042> 30 10.901 45.500 64.510 18.447 |
23842: <ahci> 259 3.203 7.863 21.426 3.183 |
23943: <eth0> 2 3.859 3.976 4.093 0.165 |
24044: <iwlwifi> 92 0.300 3.995 6.542 2.181 |
241
242Soft IRQ Duration (us) Raise latency (us)
243 count min avg max stdev | count min avg max stdev
244----------------------------------------------------------------------------------|------------------------------------------------------------
2451: <TIMER_SOFTIRQ> 495 0.202 21.058 51.060 11.047 | 53 2.141 11.217 20.005 7.233
2463: <NET_RX_SOFTIRQ> 14 0.133 9.177 32.774 10.483 | 14 0.763 3.703 10.902 3.448
2474: <BLOCK_SOFTIRQ> 257 5.981 29.064 125.862 15.891 | 257 0.891 3.104 15.054 2.046
2486: <TASKLET_SOFTIRQ> 26 0.309 1.198 1.748 0.329 | 26 9.636 39.222 51.430 11.246
2497: <SCHED_SOFTIRQ> 299 1.185 14.768 90.465 15.992 | 298 1.286 31.387 61.700 11.866
2509: <RCU_SOFTIRQ> 338 0.592 3.387 13.745 1.356 | 147 2.480 29.299 64.453 14.286
251```
252
253#### Handler duration frequency distribution
254```bash
fe109453 255$ ./lttng-irqfreq --timerange [16:05:42,16:05:45] --irq 44 --stats mytrace/
e3de48f7
JD
256Timerange: [2014-03-11 16:05:42.042034570, 2014-03-11 16:05:44.998914297]
257Hard IRQ Duration (us)
258 count min avg max stdev
259----------------------------------------------------------------------------------|
26044: <iwlwifi> 72 0.300 4.018 6.542 2.164 |
261Frequency distribution iwlwifi (44)
262###############################################################################
2630.300 █████ 1.00
2640.612 ██████████████████████████████████████████████████████████████ 12.00
2650.924 ████████████████████ 4.00
2661.236 ██████████ 2.00
2671.548 0.00
2681.861 █████ 1.00
2692.173 0.00
2702.485 █████ 1.00
2712.797 ██████████████████████████ 5.00
2723.109 █████ 1.00
2733.421 ███████████████ 3.00
2743.733 0.00
2754.045 █████ 1.00
2764.357 █████ 1.00
2774.669 ██████████ 2.00
2784.981 ██████████ 2.00
2795.294 █████████████████████████████████████████ 8.00
2805.606 ████████████████████████████████████████████████████████████████████ 13.00
2815.918 ██████████████████████████████████████████████████████████████ 12.00
2826.230 ███████████████ 3.00
283```
284
285### Others
286There are a lot of other scripts, we encourage you to try them and read the
287```--help``` to see all the available options.
288
173fa485 289## Work in progress
19952e0f
JD
290Track the page cache and extract the latencies associated with pages flush to disk.
291In order to do that, we rely on the assumption that the pages are flushed in a FIFO
292order. It might not be 100% accurate, but it already gives great results :
293
294An example here when saving a file in vim:
173fa485 295```
19952e0f
JD
296[19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms
297 1 dirty page(s) were flushed (assuming FIFO):
298 vim (31517): 1 pages
299 - blabla : 1 pages
300 13 active dirty filesystem page(s) (known):
301 redis-server (2092): 2 pages
302 - /var/log/redis/redis-server.log : 2 pages
303 vim (31517): 2 pages
304 - .blabla.swp : 2 pages
305 lttng-consumerd (6750): 9 pages
306 - unknown (origin not found) : 9 pages
307```
308
309An other example when running the 'sync' command:
173fa485 310```
19952e0f
JD
311[19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms
312 23 dirty page(s) were flushed (assuming FIFO):
313 redis-server (2092): 2 pages
314 - /var/log/redis/redis-server.log : 2 pages
315 vim (31517): 9 pages
316 - /home/julien/.viminfo.tmp : 6 pages
317 - .blabla.swp : 3 pages
318 lttng-consumerd (6750): 12 pages
319 - unknown (origin not found) : 12 pages
320```
321
2ef22645 322PostgreSQL with 'sys_fdatasync':
19952e0f
JD
323```
324[13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms
325 2 pages allocated during the period
326 88 dirty page(s) were flushed (assuming FIFO):
327 postgres (1137): 88 pages
328 - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages
329 68 last dirtied filesystem page(s):
330 postgres (2419): 68 pages
331 - base/11951/18410 : 46 pages
332 - base/11951/18407 : 10 pages
333 - base/11951/18407_fsm : 6 pages
334 - base/11951/18410_fsm : 6 pages
335```
336
2ef22645 337Detecting a fight for the I/O between a huge write and PostgreSQL:
19952e0f
JD
338```
339[13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms
340 34760 pages allocated during the period
341 woke up kswapd during the period
342 10046 pages written on disk
343 freed 33753 pages from the cache during the period
344 1397 last dirtied filesystem page(s):
345 python (2353): 1325 pages
346 - /root/bla : 1325 pages
347 postgres (2419): 72 pages
348 - base/11951/18419 : 72 pages
349```
350
e3de48f7
JD
351## Limitations
352
353The main limitation of this project is the fact that it can be quite slow to
354process a large trace. This project is a work in progress and we focus on the
355problem-solving approach. So for now we didn't spend much time on the
356performance of the analyses but rather the features.
357
358One other aspect is the fact that the state is not persistent, so when the
359script exits, it has to re-process the trace if we start another script on the
360same trace. Some scripts of the same category allow to combine multiple
361analyses into one pass (see ```--freq```, ```--log```, ```--usage```,
362```--latencystats```, etc). We are planning to add a way to save the state
363and/or create an interactive environment to allow the user to run multiple
364analyses on the same trace without having to process the trace every time.
365
366
19952e0f 367## Conclusion
e3de48f7 368We hope you have fun trying it and please remember it is a work in progress,
2ef22645 369feedback, bug reports and improvement ideas are always welcome !
This page took 0.053912 seconds and 5 git commands to generate.