Commit | Line | Data |
---|---|---|
5ba9f198 MD |
1 | |
2 | RFC: Common Trace Format Requirements (v1.4) | |
3 | ||
4 | Mathieu Desnoyers, EfficiOS Inc. | |
5 | ||
6 | The goal of the present document is to gather the trace format requirements | |
7 | from the embedded, telecom, high-performance and kernel communities. It consists | |
8 | of an overview of the trace format, tracer and trace analyzer requirements to | |
9 | consider for a Common Trace Format proposal. | |
10 | ||
11 | This document includes requirements from: | |
12 | ||
13 | Steven Rostedt <rostedt@goodmis.org> | |
14 | Dominique Toupin <dominique.toupin@ericsson.com> | |
15 | Aaron Spear <aaron_spear@mentor.com> | |
16 | Philippe Maisonneuve <Philippe.Maisonneuve@windriver.com> | |
17 | Felix Burton <Felix.Burton@windriver.com> | |
18 | Andrew McDermott <Andrew.McDermott@windriver.com> | |
cf59b300 | 19 | Frank Ch. Eigler <fche@redhat.com> |
5ba9f198 MD |
20 | Michel Dagenais <michel.dagenais@polymtl.ca> |
21 | Stefan Hajnoczi <stefanha@gmail.com> | |
22 | Multi-Core Association Tool Infrastructure Workgroup | |
23 | (http://www.multicore-association.org/workgroup/tiwg.php) | |
24 | ||
25 | ||
26 | * Trace Format Requirements | |
27 | ||
28 | These are requirements on the trace format per se. This section discusses the | |
29 | layout of data in the trace, explaining the rationale behind the choices. The | |
30 | rationale for the trace format choices may refer to the tracer and trace | |
31 | analyzer requirements stated below. This section starts by presenting the common | |
32 | trace model, and then specifies the requirements of an instance of this model | |
33 | specifically tailored to efficient kernel- and user-space tracing requirements. | |
34 | ||
35 | ||
36 | 1) Architecture | |
37 | ||
38 | This high-level model is meant to be an industry-wide, common model, fulfilling | |
39 | the tracing requirements. It is meant to be application-, architecture-, and | |
40 | language-agnostic. | |
41 | ||
42 | 1.1) Core model | |
43 | ||
44 | - Event | |
45 | ||
46 | An event is an information record contained within the trace. | |
47 | ||
48 | - Events must be in physical order within a section. Their physical position | |
49 | relative to other events within the section specify their order relative to | |
50 | other events within the same section. | |
51 | - Event type (numeric identifier: maps to metadata) | |
52 | - Unique ID assigned within a section. | |
53 | - Event payload | |
54 | - Variable event size | |
55 | - Size limitations: maximum event size should be configurable. | |
56 | - Size information available through metadata. | |
57 | - Support various data alignment for architectures, standards, and | |
58 | languages: | |
59 | - Natural alignment of data for architectures with slow non-aligned | |
60 | writes. | |
61 | - Packed layout of headers for architecture with efficient non-aligned | |
62 | writes. | |
63 | ||
64 | - Section | |
65 | ||
66 | A section within the trace can be thought of as the ELF sections in a ELF | |
67 | binary. They contain a sequence of physically contiguous event records. | |
68 | ||
69 | - Multi-level section identifier | |
70 | - e.g.: section name / CPU number | |
71 | - Contains a subset of event types | |
72 | ||
73 | The parallel with ELF sections is used here to conceptually demonstrate the idea | |
74 | of section, but the similarity stops there. A trace is peculiar in that we have | |
75 | to continuously append to each sections, and we need to have ideally no | |
76 | interaction between sections. Therefore, for storage, recording all sections | |
77 | into a single file is not recommended; a directory made of one file per section | |
78 | is better suited. | |
79 | ||
80 | ||
81 | - Metadata | |
82 | ||
83 | Metadata is the description of the setting of the environment of the | |
84 | application. Defines the basic types of the domains. Will define the mapping | |
85 | between the event, and the type of the event fields. The metadata scope (what it | |
86 | describes) is a whole trace, which consists of one or many sections. | |
87 | ||
88 | The metadata can be either contained in the trace (better usability for telecom | |
89 | scenarios) or added alongside the trace data by a separate module (for DSP | |
90 | scenarios). Metadata checksumming (only for statically generated metadata) | |
91 | and/or versioning can be used to ensure consistency between sections and | |
92 | metadata in the latter. | |
93 | ||
94 | - Trace version | |
95 | - Major number (increment breaks compabilility) | |
96 | - Minor number (increment keeps compatibility) | |
97 | - Describe the invariant properties of the environment where the trace was | |
98 | generated. | |
99 | - Contain unique domain identifier (kernel, process ID and timestamp, | |
100 | hypervisor) | |
101 | - Describes the runtime environment. | |
102 | - Report target bitness | |
103 | - Report target byte order | |
104 | - Data types (see section 1.2 Extensions below) | |
105 | - Architecture-agnostic (text-based) | |
106 | - Ought to be parsed with a regular grammar | |
107 | - Mapping to event types, e.g. (section, event) tuples, with: | |
108 | ( section identifier, event numerical identifier ) | |
109 | - Description of event context fields (per section) | |
110 | - Can be streamed along with the trace as a trace section | |
111 | - Support dynamic addition of new event types while trace is active (required | |
112 | to support module/shared object loading and dynamic probes) | |
113 | - Metadata section should be efficient and reliable. Additional information | |
114 | could be kept in separate sections, outside of metadata. | |
115 | - Metadata description language not imposed by standard | |
116 | - Metadata format identifier placed at the beginning of the metadata. | |
117 | ||
118 | ||
119 | 1.2) Extensions (optional capabilities) | |
120 | ||
121 | - Event | |
122 | - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread), | |
123 | CPU/board/node id, event ordering identifier, timestamp, | |
124 | current hardware performance counter information, event | |
125 | size) | |
126 | - Optional ordering capability across sections: | |
127 | - Ordering identifier required for trace containing many event streams | |
128 | - Either timestamp-based or based on unique sequence numbers | |
129 | - Optional time-flow capability: per-event timestamps | |
130 | - It should be possible to have context information only in some event records | |
131 | within a section. E.g., timestamp written every few events. | |
132 | ||
133 | - Section | |
134 | - Optional context applying to all events contained in that section | |
135 | (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node | |
136 | id) | |
137 | - Support piece-wise compression | |
138 | - Support checksumming | |
139 | ||
140 | - Metadata | |
141 | - Execution environment information | |
142 | - Data types available: integer, strings, arrays, sequence, floats, | |
143 | structures, maps (aka enumerations), bitfields, ... | |
144 | - Describe type alignment. | |
145 | - Describe type size. | |
146 | - Describe type signedness. | |
147 | - Other type examples: | |
148 | - gcc "vector" type. (packed data) | |
149 | http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html | |
150 | - gcc complex type (e.g. complex short, float, double...) | |
151 | - gcc _Fract and _Accum http://gcc.gnu.org/wiki/FixedPointArithmetic | |
152 | http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html | |
153 | - Describes trace capabilities, for instance: | |
154 | - Event ordering across sections | |
155 | - Time flow information | |
156 | - In event header | |
157 | - Or possibly payload of pre-specified sections and/or events | |
158 | - Ability to perform event ordering across traces | |
159 | ||
160 | - Optional per-event "current state tracking" information. | |
161 | ||
162 | This per-event taxonomy allows automated creation of a state machine that | |
163 | keeps track of state updates within the taxonomy tree. | |
164 | ||
165 | Described in an file-system path-like taxonomy with additional [] | |
166 | operator which indicates a lookup by value, e.g.: | |
167 | ||
168 | * For events in the trace stream updating the current state only based on | |
169 | information known from the context (either derived from the per-section or | |
170 | per-event context information): | |
171 | ||
172 | E.g., associated with a scheduling change event: | |
173 | ||
174 | "cpu[section/cpu]/thread = field/next_pid" | |
175 | Updates the current value of the current section's cpu "thread" attribute | |
176 | (e.g. currently running thread). | |
177 | ||
178 | E.g., associated with a system call: | |
179 | ||
180 | "thread[cpu[section/cpu]/thread]/syscall[field/syscall_id]/id | |
181 | = field/syscall_id" | |
182 | ||
183 | Updates the state value of the current thread "syscall" attribute. | |
184 | ||
185 | * For events in the trace stream targeting a path that depends on other | |
186 | fields into that same event (would be common for full system state dump at | |
187 | trace start): | |
188 | ||
189 | E.g., associated with a thread listing event: | |
190 | "thread[field/pid]/pid = field/pid" | |
191 | ||
192 | E.g., associated with a thread memory maps listing event: | |
193 | "thread[field/pid]/mmap[field/address]/address = field/address" | |
194 | "thread[field/pid]/mmap[field/address]/end = field/end" | |
195 | "thread[field/pid]/mmap[field/address]/flags = field/flags" | |
196 | "thread[field/pid]/mmap[field/address]/pgoff = field/pgoff" | |
197 | "thread[field/pid]/mmap[field/address]/inode = field/inode" | |
198 | ||
199 | All per-event context information (e.g. repeating the current PID and CPU | |
200 | for each event) can be represented with this taxonomy, e.g., in the | |
201 | section description: | |
202 | ||
203 | "section/pid = field/pid" | |
204 | "section/cpu = field/cpu" | |
205 | ||
206 | ||
207 | 2) Linux-specific Model | |
208 | ||
209 | (Linux instance, specific to the reference implementation) | |
210 | ||
211 | Instance of the model specifically tailored to the Linux kernel and C | |
212 | programs/libraries requirements. Allows for either packed events, or events | |
213 | aligned following the ISO/C standard. | |
214 | ||
215 | - Event | |
216 | - Payload | |
217 | - Initially support ISO C naturally aligned and packed type layouts. | |
218 | ||
219 | - Each section represented as a trace stream (typically 1 trace stream per cpu | |
220 | per section) to allow the tracer to easily append to these sections. | |
221 | Identifier: section name / CPU ID | |
222 | Each section has a CPU ID identifier in its context information. | |
223 | ||
224 | - Trace stream | |
225 | - Should have no hard-coded limit on size of a file generated by saving the | |
226 | trace stream (64 bit file position is fine) | |
227 | - Event lost count should be localized. It should apply to a limited time | |
228 | interval and to a tracefile, hence to a specific section, so the trace | |
229 | analyzer can provide basic information about what kind of events were lost | |
230 | and where they were lost in the trace. | |
231 | - A stream is divided into packets, which each consists of one or many event | |
232 | records. | |
233 | - Should be optionally compressible piece-wise (packet per packet). | |
234 | - Optional checksum on the packet content (except packet header), with a | |
235 | selection of checksum algorithms. Performed on a per-packet basis. | |
236 | - Packet headers should contain a sequence number to help UDP streaming | |
237 | reassembly. | |
238 | - Packet headers should be allowed to contain extra space reserved for | |
239 | encapsulation into a UDP packet encapsulation without copy. | |
240 | ||
241 | - Compact representation | |
242 | - Minimize the overhead in terms of disk/network/serial port/memory bandwidth. | |
243 | - A compact representation can keep more information in smaller buffers, | |
244 | thus needs less memory to keep the same amount of information around. | |
245 | Also useful to improve cache locality in flight recorder mode. | |
246 | ||
247 | - Natural alignment of headers for architectures with slow non-aligned writes. | |
248 | ||
249 | - Packed layout of headers for architecture with efficient non-aligned writes. | |
250 | ||
251 | - Should have a 1 to 1 mapping between the memory buffers and the generated | |
252 | trace files: allows zero-copy with splice(). | |
253 | ||
254 | - Use target endianness | |
255 | ||
256 | - Portable across different host target (tracer)/host (analyzer) architectures | |
257 | ||
258 | - It should be possible to generate metadata from descriptions written in header | |
259 | files (extraction with C preprocessor macros is one solution). | |
260 | ||
261 | ||
262 | * Requirements on the Tracers | |
263 | ||
264 | Higher-level tracer requirements that seem appropriate to support some of the | |
265 | trace format requirements stated above. | |
266 | ||
267 | Enumerating these higher-level requirements influence the trace format in many | |
268 | ways. For instance, a requirement for compactness leads to schemes where all | |
269 | information repetition should be eliminated. Thus the need for optional | |
270 | per-section context information. Another example is the requirement for speed | |
fbf2fa4f | 271 | and streaming. The requirement for speed and streaming leads to zero-copy |
5ba9f198 MD |
272 | implementations, which imply that the trace format should be written natively by |
273 | the tracer. The tracer requirements stated in this section are stated to ensure | |
274 | that the trace format structure makes it possible for a tracer to cope with the | |
275 | requirements, not to require that all tracer do so. | |
276 | ||
277 | ||
278 | *Fast* | |
279 | - Low-overhead | |
280 | - Handle large trace throughput (multi-GB per minutes) | |
281 | - Scalable to high number of cores | |
282 | - Per-cpu memory buffers | |
283 | - Scalability and performance-aware synchronization | |
284 | ||
285 | *Compact* | |
286 | - Environments without filesystem | |
287 | - Need to buffer events in target RAM to send them in group a host for | |
288 | analysis | |
289 | - Ability to tune the size of buffers and transmission medium to minimize the | |
290 | impact on the traced system. | |
291 | - Streaming (live monitoring) | |
292 | - Through sockets (USB, network) | |
293 | - Through serial ports | |
294 | - There must be a related protocol for streaming this event data. | |
295 | ||
296 | - Availability of flight recorder (synonym: overwrite) mode | |
297 | - Exclusive ownership of reader data. | |
298 | - Buffer size should be per group of events. | |
299 | ||
300 | - Output trace to disk | |
301 | - Trace buffers available in crash dump to allow post-mortem analysis | |
302 | - Fine-grained timestamps | |
303 | ||
304 | - Lockless (lock-free, ideally wait-free; aka starvation-free) | |
305 | ||
306 | - Buffer introspection: event written, read and lost counts. | |
307 | ||
308 | - Ability to iteratively narrow the level of details and traced time window | |
309 | following an initial high level "state" overview provided by an initial trace | |
310 | collecting everything. | |
311 | ||
312 | - Support kernel module instrumentation | |
313 | ||
314 | - Standard way(s) for a host to upload/access trace log data from a | |
315 | target/JTAG device/simulator/etc. | |
316 | ||
317 | - Conditional tracing in kernel space. | |
318 | ||
319 | - Compatibility with power management subsystem (trace collection shall not be a | |
320 | reason for waking up a device) | |
321 | ||
322 | - Well defined and stable trace configuration and control API across kernel | |
323 | versions. | |
324 | ||
325 | - Create and run more than one trace session in parallel at the same time | |
326 | - monitoring from system administrators | |
327 | - field engineered to troubleshoot a specific problem | |
328 | ||
329 | ||
330 | * Trace Analyzer Requirements | |
331 | ||
332 | The trace analyzer requirements stated in this section are stated to ensure that | |
333 | the trace format structure makes it possible for a trace analyzer to cope with | |
334 | the requirements, not to require that all trace analyzers do so. | |
335 | ||
336 | - Ability to cope with huge traces (> 10 GB) | |
337 | - Should be possible to do a binary search on the file to find events by time | |
338 | at least. (combined with smart indexing/ summary data perhaps) | |
339 | - File format should be as dense as possible, but not at the expense of | |
340 | analysis performance (faster is more important than bigger since disks are | |
341 | getting cheaper) | |
342 | - Must not be required to scan through all events in order to start | |
343 | analyzing (by time anyway) | |
344 | - Support live viewing of trace streams | |
345 | - Standard description of a trace event context. | |
346 | (PERI-XML calls it "Dimensions") | |
347 | - Manage system-wide event scoping with the following hierarchy: | |
348 | (address space identifier, section name, event name) |