3 This repository contains various scripts to extract monitoring data and metrics
4 from LTTng kernel traces.
6 As opposed to other diagnostic or monitoring solutions, this approach is designed
7 to allow users to record their system's activity with a low overhead, wait
8 for a problem to occur and then diagnose its cause offline.
10 This solution allows the user to target hard to find problems and dig until the
13 This README describes the implemented analyses as well as how to use them.
17 * Babeltrace >= 1.2 (with python bindings built)
20 ## Install on Ubuntu (12.04 and 14.04 at least)
22 apt-get install -y software-properties-common (or python-software-properties on 12.04)
23 apt-add-repository -y ppa:lttng/ppa
25 apt-get -y install lttng-tools babeltrace lttng-modules-dkms python3-babeltrace python3-progressbar
29 git clone https://github.com/lttng/lttng-analyses.git
33 Here are the basic commands to create a trace, for more information on the
34 LTTng setup, please refer to the [LTTng
35 documentation](http://lttng.org/docs/#doc-tracing-the-linux-kernel)
38 From the cloned git tree:
40 ./lttng-analyses-record
46 lttng enable-channel -k bla --subbuf-size=4M
47 lttng enable-event -k sched_switch,block_rq_complete,block_rq_issue,block_bio_remap,block_bio_backmerge,netif_receive_skb,net_dev_xmit,sched_process_fork,sched_process_exec,lttng_statedump_process_state,lttng_statedump_file_descriptor,lttng_statedump_block_device,writeback_pages_written,mm_vmscan_wakeup_kswapd,mm_page_free,mm_page_alloc,block_dirty_buffer,irq_handler_entry,irq_handler_exit,softirq_entry,softirq_exit,softirq_raise -c bla
48 lttng enable-event -k --syscall -a -c bla
55 ### Remote trace creation
56 You can also create a trace on a server and send it to a remote host. The
57 remote host only needs to run `lttng-relayd -d` and be reachable over the network.
58 The only difference with the above commands is the tracing session's creation:
60 lttng create -U net://<remote-host>
63 ## Implemented analyses
65 * CPU usage for the whole system
66 * CPU usage per-process
67 * Process CPU migration count
68 * Memory usage per-process (as seen by the kernel)
69 * Memory usage system-wide (as seen by the kernel)
70 * I/O usage (syscalls, disk, network)
71 * I/O operations log (with latency and usage)
72 * I/O latency statistics (open, read, write, sync operations)
73 * I/O latency frequency distribution
74 * Interrupt handler duration statistics (count, min, max, average stdev)
75 * Interrupt handler duration top
76 * Interrupt handler duration log
77 * Interrupt handler duration frequency distribution
78 * SoftIRQ handler latency statistics
79 * Syscalls usage statistics
81 All of the analyses share the same code architecture making it possible
82 to filter by timerange, process name, PID, min and max values using the
83 same command-line options. Also note that reported timestamps can
84 optionally be expressed in the GMT timezone to allow easy sharing between
87 The project's architecture makes it easy to add new analyses or to reuse
88 the analysis backend in external tools which may then present the results
89 in their own format (as opposed to text).
92 After having collected your trace, any script contained in this repository
93 can be used to run an analysis. Read on for some examples!
96 #### I/O latency stats
98 $ ./lttng-iolatencystats mytrace/
99 Timerange: [2015-01-06 10:58:26.140545481, 2015-01-06 10:58:27.229358936]
100 Syscalls latency statistics (usec):
101 Type Count Min Average Max Stdev
102 -----------------------------------------------------------------------------------------
103 Open 45 5.562 13.835 77.683 15.263
104 Read 109 0.316 5.774 62.569 9.277
105 Write 101 0.256 7.060 48.531 8.555
106 Sync 207 19.384 40.664 160.188 21.201
108 Disk latency statistics (usec):
109 Name Count Min Average Max Stdev
110 -----------------------------------------------------------------------------------------
111 dm-0 108 0.001 0.004 0.007 1.306
114 #### I/O latency frequency distribution
116 $ ./lttng-iolatencyfreq mytrace/
117 Timerange: [2015-01-06 10:58:26.140545481, 2015-01-06 10:58:27.229358936]
118 Open latency distribution (usec)
119 ###############################################################################
120 5.562 ███████████████████████████████████████████████████████████████████ 25
122 12.774 █████████████████████ 8
144 $ ./lttng-iolatencytop analysis-20150115-120942/ --limit 3 --minsize 2
145 Checking the trace for lost events...
146 Timerange: [2015-01-15 12:18:37.216484041, 2015-01-15 12:18:53.821580313]
147 Top open syscall latencies (usec)
148 Begin End Name Duration (usec) Size Proc PID Filename
149 [12:18:50.432950815,12:18:50.870648568] open 437697.753 N/A apache2 31517 /var/lib/php5/sess_0ifir2hangm8ggaljdphl9o5b5 (fd=13)
150 [12:18:52.946080165,12:18:52.946132278] open 52.113 N/A apache2 31588 /var/lib/php5/sess_mr9045p1k55vin1h0vg7rhgd63 (fd=13)
151 [12:18:46.800846035,12:18:46.800874916] open 28.881 N/A apache2 31591 /var/lib/php5/sess_r7c12pccfvjtas15g3j69u14h0 (fd=13)
152 [12:18:51.389797604,12:18:51.389824426] open 26.822 N/A apache2 31520 /var/lib/php5/sess_4sdb1rtjkhb78sabnoj8gpbl00 (fd=13)
154 Top read syscall latencies (usec)
155 Begin End Name Duration (usec) Size Proc PID Filename
156 [12:18:37.256073107,12:18:37.256555967] read 482.860 7.00 B bash 10237 unknown (origin not found) (fd=3)
157 [12:18:52.000209798,12:18:52.000252304] read 42.506 1.00 KB irqbalance 1337 /proc/interrupts (fd=3)
158 [12:18:37.256559439,12:18:37.256601615] read 42.176 5.00 B bash 10237 unknown (origin not found) (fd=3)
159 [12:18:42.000281918,12:18:42.000320016] read 38.098 1.00 KB irqbalance 1337 /proc/interrupts (fd=3)
161 Top write syscall latencies (usec)
162 Begin End Name Duration (usec) Size Proc PID Filename
163 [12:18:49.913241516,12:18:49.915908862] write 2667.346 95.00 B apache2 31584 /var/log/apache2/access.log (fd=8)
164 [12:18:37.472823631,12:18:37.472859836] writev 36.205 21.97 KB apache2 31544 unknown (origin not found) (fd=12)
165 [12:18:37.991578372,12:18:37.991612724] writev 34.352 21.97 KB apache2 31589 unknown (origin not found) (fd=12)
166 [12:18:39.547778549,12:18:39.547812515] writev 33.966 21.97 KB apache2 31584 unknown (origin not found) (fd=12)
168 Top sync syscall latencies (usec)
169 Begin End Name Duration (usec) Size Proc PID Filename
170 [12:18:50.162776739,12:18:51.157522361] sync 994745.622 N/A sync 22791 None (fd=None)
171 [12:18:37.227867532,12:18:37.232289687] sync_file_range 4422.155 N/A lttng-consumerd 19964 /home/julien/lttng-traces/analysis-20150115-120942/kernel/metadata (fd=32)
172 [12:18:37.238076585,12:18:37.239012027] sync_file_range 935.442 N/A lttng-consumerd 19964 /home/julien/lttng-traces/analysis-20150115-120942/kernel/metadata (fd=32)
173 [12:18:37.220974711,12:18:37.221647124] sync_file_range 672.413 N/A lttng-consumerd 19964 /home/julien/lttng-traces/analysis-20150115-120942/kernel/metadata (fd=32)
176 #### I/O operations log
178 $ ./lttng-iolog mytrace/
179 [10:58:26.221618530,10:58:26.221620659] write 2.129 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
180 [10:58:26.221623609,10:58:26.221628055] read 4.446 50.00 B /usr/bin/x-term 11793 /dev/ptmx (fd=24)
181 [10:58:26.221638929,10:58:26.221640008] write 1.079 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
182 [10:58:26.221676232,10:58:26.221677385] read 1.153 8.00 B /usr/bin/x-term 11793 anon_inode:[eventfd] (fd=5)
183 [10:58:26.223401804,10:58:26.223411683] open 9.879 N/A sleep 12420 /etc/ld.so.cache (fd=3)
184 [10:58:26.223448060,10:58:26.223455577] open 7.517 N/A sleep 12420 /lib/x86_64-linux-gnu/libc.so.6 (fd=3)
185 [10:58:26.223456522,10:58:26.223458898] read 2.376 832.00 B sleep 12420 /lib/x86_64-linux-gnu/libc.so.6 (fd=3)
186 [10:58:26.223918068,10:58:26.223929316] open 11.248 N/A sleep 12420 (fd=3)
187 [10:58:26.231881565,10:58:26.231895970] writev 14.405 16.00 B /usr/bin/x-term 11793 socket:[45650] (fd=4)
188 [10:58:26.231979636,10:58:26.231988446] recvmsg 8.810 16.00 B Xorg 1827 socket:[47480] (fd=38)
194 $ ./lttng-iousagetop traces/pgread-writes
195 Timerange: [2014-10-07 16:36:00.733214969, 2014-10-07 16:36:18.804584183]
197 ###############################################################################
198 ██████████████████████████████████████████████████ 16.00 MB lttng-consumerd (2619) 0 B file 4.00 B net 16.00 MB unknown
199 █████ 1.72 MB lttng-consumerd (2619) 0 B file 0 B net 1.72 MB unknown
200 █ 398.13 KB postgres (4219) 121.05 KB file 277.07 KB net 8.00 B unknown
201 256.09 KB postgres (1348) 0 B file 255.97 KB net 117.00 B unknown
202 204.81 KB postgres (4218) 204.81 KB file 0 B net 0 B unknown
203 123.77 KB postgres (4220) 117.50 KB file 6.26 KB net 8.00 B unknown
204 Per-process I/O Write
205 ###############################################################################
206 ██████████████████████████████████████████████████ 16.00 MB lttng-consumerd (2619) 0 B file 8.00 MB net 8.00 MB unknown
207 ██████ 2.20 MB postgres (4219) 2.00 MB file 202.23 KB net 0 B unknown
208 █████ 1.73 MB lttng-consumerd (2619) 0 B file 887.73 KB net 882.58 KB unknown
209 ██ 726.33 KB postgres (1165) 8.00 KB file 6.33 KB net 712.00 KB unknown
210 158.69 KB postgres (1168) 158.69 KB file 0 B net 0 B unknown
211 80.66 KB postgres (1348) 0 B file 80.66 KB net 0 B unknown
213 ###############################################################################
214 ██████████████████████████████████████████████████ 8.00 MB anon_inode:[lttng_stream] (lttng-consumerd) 'fd 32 in lttng-consumerd (2619)'
215 █████ 834.41 KB base/16384/pg_internal.init 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
216 █ 256.09 KB socket:[8893] (postgres) 'fd 9 in postgres (1348)'
217 █ 174.69 KB pg_stat_tmp/pgstat.stat 'fd 9 in postgres (4218)', 'fd 9 in postgres (1167)'
218 109.48 KB global/pg_internal.init 'fd 7 in postgres (4218)', 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
219 104.30 KB base/11951/pg_internal.init 'fd 7 in postgres (4218)'
220 12.85 KB socket (lttng-sessiond) 'fd 30 in lttng-sessiond (384)'
221 4.50 KB global/pg_filenode.map 'fd 7 in postgres (4218)', 'fd 7 in postgres (4219)', 'fd 7 in postgres (4220)', 'fd 7 in postgres (4221)', 'fd 7 in postgres (4222)', 'fd 7 in postgres (4223)', 'fd 7 in postgres (4224)', 'fd 7 in postgres (4225)', 'fd 7 in postgres (4226)'
222 4.16 KB socket (postgres) 'fd 9 in postgres (4226)'
223 4.00 KB /proc/interrupts 'fd 3 in irqbalance (1104)'
225 ###############################################################################
226 ██████████████████████████████████████████████████ 8.00 MB socket:[56371] (lttng-consumerd) 'fd 30 in lttng-consumerd (2619)'
227 █████████████████████████████████████████████████ 8.00 MB pipe:[53306] (lttng-consumerd) 'fd 12 in lttng-consumerd (2619)'
228 ██████████ 1.76 MB pg_xlog/00000001000000000000000B 'fd 31 in postgres (4219)'
229 █████ 887.82 KB socket:[56369] (lttng-consumerd) 'fd 26 in lttng-consumerd (2619)'
230 █████ 882.58 KB pipe:[53309] (lttng-consumerd) 'fd 18 in lttng-consumerd (2619)'
231 160.00 KB /var/lib/postgresql/9.1/main/base/16384/16602 'fd 14 in postgres (1165)'
232 158.69 KB pg_stat_tmp/pgstat.tmp 'fd 3 in postgres (1168)'
233 144.00 KB /var/lib/postgresql/9.1/main/base/16384/16613 'fd 12 in postgres (1165)'
234 88.00 KB /var/lib/postgresql/9.1/main/base/16384/16609 'fd 11 in postgres (1165)'
235 78.28 KB socket:[8893] (postgres) 'fd 9 in postgres (1348)'
237 ###############################################################################
239 ###############################################################################
240 ██████████████████████████████████████████████████ 1.76 MB postgres (pid=4219)
241 ████ 160.00 KB postgres (pid=1168)
242 ██ 100.00 KB kworker/u8:0 (pid=1540)
243 ██ 96.00 KB jbd2/vda1-8 (pid=257)
244 █ 40.00 KB postgres (pid=1166)
245 8.00 KB kworker/u9:0 (pid=4197)
246 4.00 KB kworker/u9:2 (pid=1381)
248 ###############################################################################
249 ███████████████████████████████████████████████████████████████████ 4416.00 sectors vda1
251 ###############################################################################
252 ████████████████████████████████████████████████████████████████████ 177.00 requests vda1
253 Disk request time/sector
254 ###############################################################################
255 ██████████████████████████████████████████████████████████████████ 0.01 ms vda1
257 ###############################################################################
258 ███████████████████████████████████████████████████████ 739.50 KB eth0
261 ###############################################################################
262 ████████████████████████████████████████████████████████ 9.36 MB eth0
266 #### Handler duration and raise latency statistics
268 $ ./lttng-irqstats mytrace/
269 Timerange: [2014-03-11 16:05:41.314824752, 2014-03-11 16:05:45.041994298]
270 Hard IRQ Duration (us)
271 count min avg max stdev
272 ----------------------------------------------------------------------------------|
273 1: <i8042> 30 10.901 45.500 64.510 18.447 |
274 42: <ahci> 259 3.203 7.863 21.426 3.183 |
275 43: <eth0> 2 3.859 3.976 4.093 0.165 |
276 44: <iwlwifi> 92 0.300 3.995 6.542 2.181 |
278 Soft IRQ Duration (us) Raise latency (us)
279 count min avg max stdev | count min avg max stdev
280 ----------------------------------------------------------------------------------|------------------------------------------------------------
281 1: <TIMER_SOFTIRQ> 495 0.202 21.058 51.060 11.047 | 53 2.141 11.217 20.005 7.233
282 3: <NET_RX_SOFTIRQ> 14 0.133 9.177 32.774 10.483 | 14 0.763 3.703 10.902 3.448
283 4: <BLOCK_SOFTIRQ> 257 5.981 29.064 125.862 15.891 | 257 0.891 3.104 15.054 2.046
284 6: <TASKLET_SOFTIRQ> 26 0.309 1.198 1.748 0.329 | 26 9.636 39.222 51.430 11.246
285 7: <SCHED_SOFTIRQ> 299 1.185 14.768 90.465 15.992 | 298 1.286 31.387 61.700 11.866
286 9: <RCU_SOFTIRQ> 338 0.592 3.387 13.745 1.356 | 147 2.480 29.299 64.453 14.286
289 #### Handler duration frequency distribution
291 $ ./lttng-irqfreq --timerange [16:05:42,16:05:45] --irq 44 --stats mytrace/
292 Timerange: [2014-03-11 16:05:42.042034570, 2014-03-11 16:05:44.998914297]
293 Hard IRQ Duration (us)
294 count min avg max stdev
295 ----------------------------------------------------------------------------------|
296 44: <iwlwifi> 72 0.300 4.018 6.542 2.164 |
297 Frequency distribution iwlwifi (44)
298 ###############################################################################
300 0.612 ██████████████████████████████████████████████████████████████ 12.00
301 0.924 ████████████████████ 4.00
302 1.236 ██████████ 2.00
307 2.797 ██████████████████████████ 5.00
309 3.421 ███████████████ 3.00
313 4.669 ██████████ 2.00
314 4.981 ██████████ 2.00
315 5.294 █████████████████████████████████████████ 8.00
316 5.606 ████████████████████████████████████████████████████████████████████ 13.00
317 5.918 ██████████████████████████████████████████████████████████████ 12.00
318 6.230 ███████████████ 3.00
322 There are a lot of other scripts, we encourage you to try them and read the
323 ```--help``` to see all the available options.
326 Track the page cache and extract the latencies associated with pages flush to disk.
327 In order to do that, we rely on the assumption that the pages are flushed in a FIFO
328 order. It might not be 100% accurate, but it already gives great results :
330 An example here when saving a file in vim:
332 [19:57:51.173332284 - 19:57:51.177794657] vim (31517) syscall_entry_fsync(fd = 4 <blabla>) = 0, 4.462 ms
333 1 dirty page(s) were flushed (assuming FIFO):
336 13 active dirty filesystem page(s) (known):
337 redis-server (2092): 2 pages
338 - /var/log/redis/redis-server.log : 2 pages
340 - .blabla.swp : 2 pages
341 lttng-consumerd (6750): 9 pages
342 - unknown (origin not found) : 9 pages
345 An other example when running the 'sync' command:
347 [19:57:53.046840755 - 19:57:53.072809609] sync (31554) syscall_entry_sync(fd = <unknown>) = 0, 25.969 ms
348 23 dirty page(s) were flushed (assuming FIFO):
349 redis-server (2092): 2 pages
350 - /var/log/redis/redis-server.log : 2 pages
352 - /home/julien/.viminfo.tmp : 6 pages
353 - .blabla.swp : 3 pages
354 lttng-consumerd (6750): 12 pages
355 - unknown (origin not found) : 12 pages
358 PostgreSQL with 'sys_fdatasync':
360 [13:49:39.908599447 - 13:49:39.915930730] postgres (1137) sys_fdatasync(fd = 7 </var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008>) = 0, 7.331 ms
361 2 pages allocated during the period
362 88 dirty page(s) were flushed (assuming FIFO):
363 postgres (1137): 88 pages
364 - /var/lib/postgresql/9.1/main/pg_xlog/000000010000000000000008 : 88 pages
365 68 last dirtied filesystem page(s):
366 postgres (2419): 68 pages
367 - base/11951/18410 : 46 pages
368 - base/11951/18407 : 10 pages
369 - base/11951/18407_fsm : 6 pages
370 - base/11951/18410_fsm : 6 pages
373 Detecting a fight for the I/O between a huge write and postgresql:
375 [13:49:47.242730583 - 13:49:47.442835037] python (2353) sys_write(fd = 3 </root/bla>, count = 102395904) = 102395904, 200.104 ms
376 34760 pages allocated during the period
377 woke up kswapd during the period
378 10046 pages written on disk
379 freed 33753 pages from the cache during the period
380 1397 last dirtied filesystem page(s):
381 python (2353): 1325 pages
382 - /root/bla : 1325 pages
383 postgres (2419): 72 pages
384 - base/11951/18419 : 72 pages
389 The main limitation of this project is the fact that it can be quite slow to
390 process a large trace. This project is a work in progress and we focus on the
391 problem-solving aspect. Therefore, features have been prioritized over
394 One other aspect is the fact that the state is not persistent; the trace has
395 to be re-processed if another analysis script is to be used on the same trace.
396 Some scripts belonging to the same category allow the combination of multiple
397 analyses into a single pass (see ```--freq```, ```--log```, ```--usage```,
398 ```--latencystats```, etc). We are planning to add a way to save the state
399 and/or create an interactive environment to allow the user to run multiple
400 analyses on the same trace without having to process the trace every time.
404 We hope you have fun trying this project and please remember it is a work in
405 progress; feedback, bug reports and improvement ideas are always welcome!