| 1 | |
| 2 | RFC: Common Trace Format Requirements (v1.4) |
| 3 | |
| 4 | Mathieu Desnoyers, EfficiOS Inc. |
| 5 | |
| 6 | The goal of the present document is to gather the trace format requirements |
| 7 | from the embedded, telecom, high-performance and kernel communities. It consists |
| 8 | of an overview of the trace format, tracer and trace analyzer requirements to |
| 9 | consider for a Common Trace Format proposal. |
| 10 | |
| 11 | This document includes requirements from: |
| 12 | |
| 13 | Steven Rostedt <rostedt@goodmis.org> |
| 14 | Dominique Toupin <dominique.toupin@ericsson.com> |
| 15 | Aaron Spear <aaron_spear@mentor.com> |
| 16 | Philippe Maisonneuve <Philippe.Maisonneuve@windriver.com> |
| 17 | Felix Burton <Felix.Burton@windriver.com> |
| 18 | Andrew McDermott <Andrew.McDermott@windriver.com> |
| 19 | Frank Ch. Eigler <fche@redhat.com> |
| 20 | Michel Dagenais <michel.dagenais@polymtl.ca> |
| 21 | Stefan Hajnoczi <stefanha@gmail.com> |
| 22 | Multi-Core Association Tool Infrastructure Workgroup |
| 23 | (http://www.multicore-association.org/workgroup/tiwg.php) |
| 24 | |
| 25 | |
| 26 | * Trace Format Requirements |
| 27 | |
| 28 | These are requirements on the trace format per se. This section discusses the |
| 29 | layout of data in the trace, explaining the rationale behind the choices. The |
| 30 | rationale for the trace format choices may refer to the tracer and trace |
| 31 | analyzer requirements stated below. This section starts by presenting the common |
| 32 | trace model, and then specifies the requirements of an instance of this model |
| 33 | specifically tailored to efficient kernel- and user-space tracing requirements. |
| 34 | |
| 35 | |
| 36 | 1) Architecture |
| 37 | |
| 38 | This high-level model is meant to be an industry-wide, common model, fulfilling |
| 39 | the tracing requirements. It is meant to be application-, architecture-, and |
| 40 | language-agnostic. |
| 41 | |
| 42 | 1.1) Core model |
| 43 | |
| 44 | - Event |
| 45 | |
| 46 | An event is an information record contained within the trace. |
| 47 | |
| 48 | - Events must be in physical order within a section. Their physical position |
| 49 | relative to other events within the section specify their order relative to |
| 50 | other events within the same section. |
| 51 | - Event type (numeric identifier: maps to metadata) |
| 52 | - Unique ID assigned within a section. |
| 53 | - Event payload |
| 54 | - Variable event size |
| 55 | - Size limitations: maximum event size should be configurable. |
| 56 | - Size information available through metadata. |
| 57 | - Support various data alignment for architectures, standards, and |
| 58 | languages: |
| 59 | - Natural alignment of data for architectures with slow non-aligned |
| 60 | writes. |
| 61 | - Packed layout of headers for architecture with efficient non-aligned |
| 62 | writes. |
| 63 | |
| 64 | - Section |
| 65 | |
| 66 | A section within the trace can be thought of as the ELF sections in a ELF |
| 67 | binary. They contain a sequence of physically contiguous event records. |
| 68 | |
| 69 | - Multi-level section identifier |
| 70 | - e.g.: section name / CPU number |
| 71 | - Contains a subset of event types |
| 72 | |
| 73 | The parallel with ELF sections is used here to conceptually demonstrate the idea |
| 74 | of section, but the similarity stops there. A trace is peculiar in that we have |
| 75 | to continuously append to each sections, and we need to have ideally no |
| 76 | interaction between sections. Therefore, for storage, recording all sections |
| 77 | into a single file is not recommended; a directory made of one file per section |
| 78 | is better suited. |
| 79 | |
| 80 | |
| 81 | - Metadata |
| 82 | |
| 83 | Metadata is the description of the setting of the environment of the |
| 84 | application. Defines the basic types of the domains. Will define the mapping |
| 85 | between the event, and the type of the event fields. The metadata scope (what it |
| 86 | describes) is a whole trace, which consists of one or many sections. |
| 87 | |
| 88 | The metadata can be either contained in the trace (better usability for telecom |
| 89 | scenarios) or added alongside the trace data by a separate module (for DSP |
| 90 | scenarios). Metadata checksumming (only for statically generated metadata) |
| 91 | and/or versioning can be used to ensure consistency between sections and |
| 92 | metadata in the latter. |
| 93 | |
| 94 | - Trace version |
| 95 | - Major number (increment breaks compabilility) |
| 96 | - Minor number (increment keeps compatibility) |
| 97 | - Describe the invariant properties of the environment where the trace was |
| 98 | generated. |
| 99 | - Contain unique domain identifier (kernel, process ID and timestamp, |
| 100 | hypervisor) |
| 101 | - Describes the runtime environment. |
| 102 | - Report target bitness |
| 103 | - Report target byte order |
| 104 | - Data types (see section 1.2 Extensions below) |
| 105 | - Architecture-agnostic (text-based) |
| 106 | - Ought to be parsed with a regular grammar |
| 107 | - Mapping to event types, e.g. (section, event) tuples, with: |
| 108 | ( section identifier, event numerical identifier ) |
| 109 | - Description of event context fields (per section) |
| 110 | - Can be streamed along with the trace as a trace section |
| 111 | - Support dynamic addition of new event types while trace is active (required |
| 112 | to support module/shared object loading and dynamic probes) |
| 113 | - Metadata section should be efficient and reliable. Additional information |
| 114 | could be kept in separate sections, outside of metadata. |
| 115 | - Metadata description language not imposed by standard |
| 116 | - Metadata format identifier placed at the beginning of the metadata. |
| 117 | |
| 118 | |
| 119 | 1.2) Extensions (optional capabilities) |
| 120 | |
| 121 | - Event |
| 122 | - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread), |
| 123 | CPU/board/node id, event ordering identifier, timestamp, |
| 124 | current hardware performance counter information, event |
| 125 | size) |
| 126 | - Optional ordering capability across sections: |
| 127 | - Ordering identifier required for trace containing many event streams |
| 128 | - Either timestamp-based or based on unique sequence numbers |
| 129 | - Optional time-flow capability: per-event timestamps |
| 130 | - It should be possible to have context information only in some event records |
| 131 | within a section. E.g., timestamp written every few events. |
| 132 | |
| 133 | - Section |
| 134 | - Optional context applying to all events contained in that section |
| 135 | (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node |
| 136 | id) |
| 137 | - Support piece-wise compression |
| 138 | - Support checksumming |
| 139 | |
| 140 | - Metadata |
| 141 | - Execution environment information |
| 142 | - Data types available: integer, strings, arrays, sequence, floats, |
| 143 | structures, maps (aka enumerations), bitfields, ... |
| 144 | - Describe type alignment. |
| 145 | - Describe type size. |
| 146 | - Describe type signedness. |
| 147 | - Other type examples: |
| 148 | - gcc "vector" type. (packed data) |
| 149 | http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html |
| 150 | - gcc complex type (e.g. complex short, float, double...) |
| 151 | - gcc _Fract and _Accum http://gcc.gnu.org/wiki/FixedPointArithmetic |
| 152 | http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html |
| 153 | - Describes trace capabilities, for instance: |
| 154 | - Event ordering across sections |
| 155 | - Time flow information |
| 156 | - In event header |
| 157 | - Or possibly payload of pre-specified sections and/or events |
| 158 | - Ability to perform event ordering across traces |
| 159 | |
| 160 | - Optional per-event "current state tracking" information. |
| 161 | |
| 162 | This per-event taxonomy allows automated creation of a state machine that |
| 163 | keeps track of state updates within the taxonomy tree. |
| 164 | |
| 165 | Described in an file-system path-like taxonomy with additional [] |
| 166 | operator which indicates a lookup by value, e.g.: |
| 167 | |
| 168 | * For events in the trace stream updating the current state only based on |
| 169 | information known from the context (either derived from the per-section or |
| 170 | per-event context information): |
| 171 | |
| 172 | E.g., associated with a scheduling change event: |
| 173 | |
| 174 | "cpu[section/cpu]/thread = field/next_pid" |
| 175 | Updates the current value of the current section's cpu "thread" attribute |
| 176 | (e.g. currently running thread). |
| 177 | |
| 178 | E.g., associated with a system call: |
| 179 | |
| 180 | "thread[cpu[section/cpu]/thread]/syscall[field/syscall_id]/id |
| 181 | = field/syscall_id" |
| 182 | |
| 183 | Updates the state value of the current thread "syscall" attribute. |
| 184 | |
| 185 | * For events in the trace stream targeting a path that depends on other |
| 186 | fields into that same event (would be common for full system state dump at |
| 187 | trace start): |
| 188 | |
| 189 | E.g., associated with a thread listing event: |
| 190 | "thread[field/pid]/pid = field/pid" |
| 191 | |
| 192 | E.g., associated with a thread memory maps listing event: |
| 193 | "thread[field/pid]/mmap[field/address]/address = field/address" |
| 194 | "thread[field/pid]/mmap[field/address]/end = field/end" |
| 195 | "thread[field/pid]/mmap[field/address]/flags = field/flags" |
| 196 | "thread[field/pid]/mmap[field/address]/pgoff = field/pgoff" |
| 197 | "thread[field/pid]/mmap[field/address]/inode = field/inode" |
| 198 | |
| 199 | All per-event context information (e.g. repeating the current PID and CPU |
| 200 | for each event) can be represented with this taxonomy, e.g., in the |
| 201 | section description: |
| 202 | |
| 203 | "section/pid = field/pid" |
| 204 | "section/cpu = field/cpu" |
| 205 | |
| 206 | |
| 207 | 2) Linux-specific Model |
| 208 | |
| 209 | (Linux instance, specific to the reference implementation) |
| 210 | |
| 211 | Instance of the model specifically tailored to the Linux kernel and C |
| 212 | programs/libraries requirements. Allows for either packed events, or events |
| 213 | aligned following the ISO/C standard. |
| 214 | |
| 215 | - Event |
| 216 | - Payload |
| 217 | - Initially support ISO C naturally aligned and packed type layouts. |
| 218 | |
| 219 | - Each section represented as a trace stream (typically 1 trace stream per cpu |
| 220 | per section) to allow the tracer to easily append to these sections. |
| 221 | Identifier: section name / CPU ID |
| 222 | Each section has a CPU ID identifier in its context information. |
| 223 | |
| 224 | - Trace stream |
| 225 | - Should have no hard-coded limit on size of a file generated by saving the |
| 226 | trace stream (64 bit file position is fine) |
| 227 | - Event lost count should be localized. It should apply to a limited time |
| 228 | interval and to a tracefile, hence to a specific section, so the trace |
| 229 | analyzer can provide basic information about what kind of events were lost |
| 230 | and where they were lost in the trace. |
| 231 | - A stream is divided into packets, which each consists of one or many event |
| 232 | records. |
| 233 | - Should be optionally compressible piece-wise (packet per packet). |
| 234 | - Optional checksum on the packet content (except packet header), with a |
| 235 | selection of checksum algorithms. Performed on a per-packet basis. |
| 236 | - Packet headers should contain a sequence number to help UDP streaming |
| 237 | reassembly. |
| 238 | - Packet headers should be allowed to contain extra space reserved for |
| 239 | encapsulation into a UDP packet encapsulation without copy. |
| 240 | |
| 241 | - Compact representation |
| 242 | - Minimize the overhead in terms of disk/network/serial port/memory bandwidth. |
| 243 | - A compact representation can keep more information in smaller buffers, |
| 244 | thus needs less memory to keep the same amount of information around. |
| 245 | Also useful to improve cache locality in flight recorder mode. |
| 246 | |
| 247 | - Natural alignment of headers for architectures with slow non-aligned writes. |
| 248 | |
| 249 | - Packed layout of headers for architecture with efficient non-aligned writes. |
| 250 | |
| 251 | - Should have a 1 to 1 mapping between the memory buffers and the generated |
| 252 | trace files: allows zero-copy with splice(). |
| 253 | |
| 254 | - Use target endianness |
| 255 | |
| 256 | - Portable across different host target (tracer)/host (analyzer) architectures |
| 257 | |
| 258 | - It should be possible to generate metadata from descriptions written in header |
| 259 | files (extraction with C preprocessor macros is one solution). |
| 260 | |
| 261 | |
| 262 | * Requirements on the Tracers |
| 263 | |
| 264 | Higher-level tracer requirements that seem appropriate to support some of the |
| 265 | trace format requirements stated above. |
| 266 | |
| 267 | Enumerating these higher-level requirements influence the trace format in many |
| 268 | ways. For instance, a requirement for compactness leads to schemes where all |
| 269 | information repetition should be eliminated. Thus the need for optional |
| 270 | per-section context information. Another example is the requirement for speed |
| 271 | and streaming. The requirement for speed ans treaming leads to zero-copy |
| 272 | implementations, which imply that the trace format should be written natively by |
| 273 | the tracer. The tracer requirements stated in this section are stated to ensure |
| 274 | that the trace format structure makes it possible for a tracer to cope with the |
| 275 | requirements, not to require that all tracer do so. |
| 276 | |
| 277 | |
| 278 | *Fast* |
| 279 | - Low-overhead |
| 280 | - Handle large trace throughput (multi-GB per minutes) |
| 281 | - Scalable to high number of cores |
| 282 | - Per-cpu memory buffers |
| 283 | - Scalability and performance-aware synchronization |
| 284 | |
| 285 | *Compact* |
| 286 | - Environments without filesystem |
| 287 | - Need to buffer events in target RAM to send them in group a host for |
| 288 | analysis |
| 289 | - Ability to tune the size of buffers and transmission medium to minimize the |
| 290 | impact on the traced system. |
| 291 | - Streaming (live monitoring) |
| 292 | - Through sockets (USB, network) |
| 293 | - Through serial ports |
| 294 | - There must be a related protocol for streaming this event data. |
| 295 | |
| 296 | - Availability of flight recorder (synonym: overwrite) mode |
| 297 | - Exclusive ownership of reader data. |
| 298 | - Buffer size should be per group of events. |
| 299 | |
| 300 | - Output trace to disk |
| 301 | - Trace buffers available in crash dump to allow post-mortem analysis |
| 302 | - Fine-grained timestamps |
| 303 | |
| 304 | - Lockless (lock-free, ideally wait-free; aka starvation-free) |
| 305 | |
| 306 | - Buffer introspection: event written, read and lost counts. |
| 307 | |
| 308 | - Ability to iteratively narrow the level of details and traced time window |
| 309 | following an initial high level "state" overview provided by an initial trace |
| 310 | collecting everything. |
| 311 | |
| 312 | - Support kernel module instrumentation |
| 313 | |
| 314 | - Standard way(s) for a host to upload/access trace log data from a |
| 315 | target/JTAG device/simulator/etc. |
| 316 | |
| 317 | - Conditional tracing in kernel space. |
| 318 | |
| 319 | - Compatibility with power management subsystem (trace collection shall not be a |
| 320 | reason for waking up a device) |
| 321 | |
| 322 | - Well defined and stable trace configuration and control API across kernel |
| 323 | versions. |
| 324 | |
| 325 | - Create and run more than one trace session in parallel at the same time |
| 326 | - monitoring from system administrators |
| 327 | - field engineered to troubleshoot a specific problem |
| 328 | |
| 329 | |
| 330 | * Trace Analyzer Requirements |
| 331 | |
| 332 | The trace analyzer requirements stated in this section are stated to ensure that |
| 333 | the trace format structure makes it possible for a trace analyzer to cope with |
| 334 | the requirements, not to require that all trace analyzers do so. |
| 335 | |
| 336 | - Ability to cope with huge traces (> 10 GB) |
| 337 | - Should be possible to do a binary search on the file to find events by time |
| 338 | at least. (combined with smart indexing/ summary data perhaps) |
| 339 | - File format should be as dense as possible, but not at the expense of |
| 340 | analysis performance (faster is more important than bigger since disks are |
| 341 | getting cheaper) |
| 342 | - Must not be required to scan through all events in order to start |
| 343 | analyzing (by time anyway) |
| 344 | - Support live viewing of trace streams |
| 345 | - Standard description of a trace event context. |
| 346 | (PERI-XML calls it "Dimensions") |
| 347 | - Manage system-wide event scoping with the following hierarchy: |
| 348 | (address space identifier, section name, event name) |