common-trace-format-linux-proposal.txt

   1
   2 RFC: Common Trace Format Proposal for Linux (v1)
   3
   4 Mathieu Desnoyers, EfficiOS Inc.
   5
   6 The goal of the present document is to propose a trace format that suits the
   7 needs of the embedded, telecom, high-performance and kernel communities.  It is
   8 based on the Common Trace Format Requirements (v1.4) document. It is designed to
   9 be natively generated by tracing of a Linux kernel and Linux user-space
  10 applications written in C/C++.
  11
  12 A reference implementation of a library to read and write this trace format is
  13 being implemented within the BabelTrace project, a converter between trace
  14 formats. The development tree is available at:
  15
  16   git tree:   git://git.efficios.com/babeltrace.git
  17   gitweb:     http://git.efficios.com/?p=babeltrace.git
  18
  19
  20 1. Preliminary definitions
  21
  22   - Trace: An ordered sequence of events.
  23   - Section: Group of events, containing a subset of the trace event types.
  24   - Packet: A sequence of physically contiguous events within a section.
  25   - Event: This is the basic entry in a trace. (aka: a trace record).
  26     - An event identifier (ID) relates to the class (a type) of event within
  27       a section.
  28         e.g. section: high_throughput, event: irq_entry.
  29     - An event (or event record) relates to a specific instance of an event
  30       class.
  31         e.g. section: high_throughput, event: irq_entry, at time X, on CPU Y
  32
  33
  34 2. High-level representation of a trace
  35
  36 A trace is divided into multiple trace streams, each representing an information
  37 stream specific to:
  38
  39  - a section,
  40  - a processor.
  41
  42 A trace "section" consists of a collection of trace streams (typically one trace
  43 stream per cpu) containing a subset of the trace event types.
  44
  45 Because each trace stream is appended to while a trace is being recorded, each
  46 is associated with a separate file for disk output. Therefore, a trace stored to
  47 disk can be represented as a directory containing one file per section.
  48
  49 A metadata section contains information on trace event types. It describes:
  50
  51 - Trace version.
  52 - Types available.
  53 - Per-section event header description.
  54 - Per-section event header selection.
  55 - Per-section event context fields.
  56 - Per-event
  57   - Event type to section mapping.
  58   - Event type to name mapping.
  59   - Event type to ID mapping.
  60   - Event fields description.
  61
  62
  63 3. Trace Section
  64
  65 A trace section is divided in contiguous packets of variable size. These
  66 subdivisions allow the trace analyzer to perform a fast binary search by time
  67 within the section (typically requiring to index only the packet headers)
  68 without reading the whole section. These subdivisions have a variable size to
  69 eliminate the need to transfer the packet padding when partially filled packets
  70 must be sent when streaming a trace for live viewing/analysis. Dividing sections
  71 into packets is also useful for network streaming over UDP and flight recorder
  72 mode tracing (a whole packet can be swapped out of the buffer atomically for
  73 reading).
  74
  75 The section header is repeated at the beginning of each packet to allow
  76 flexibility in terms of:
  77
  78   - streaming support,
  79   - allowing arbitrary buffers to be discarded without making the trace
  80     unreadable,
  81   - allow UDP packet loss handling by either dealing with missing packet or
  82     asking for re-transmission.
  83   - transparently support flight recorder mode,
  84   - transparently support crash dump.
  85
  86 The section header will therefore be referred to as the "packet header"
  87 thorough the rest of this document.
  88
  89
  90 4. Types
  91
  92 4.1 Basic types
  93
  94 A basic type is a scalar type, as described in this section.
  95
  96 4.1.1 Type inheritance
  97
  98 Type specifications can be inherited to allow deriving concrete types from an
  99 abstract type. For example, see the uint32_t type derived from the "integer"
 100 abstract type below ("Integers" section). Concrete types have a precise binary
 101 representation in the trace. Abstract types have methods to read and write these
 102 types, but must be derived into a concrete type to be usable in an event field.
 103
 104 Concrete types inherit from abstract types. Abstract types can inherit from
 105 other abstract types.
 106
 107 4.1.2 Alignment
 108
 109 We define "byte-packed" types as aligned on the byte size, namely 8-bit.
 110 We define "bit-packed" types as following on the next bit, as defined by the
 111 "bitfields" section.
 112 We define "natural alignment" of a basic type as the lesser value between the
 113 type size and the architecture word size.
 114
 115 All basic types, except bitfields, are either aligned on their "natural"
 116 alignment or byte-packed, depending on the architecture preference.
 117 Architectures providing fast unaligned writes byte-packed basic types to save
 118 space, aligning each type on byte boundaries (8-bit). Architectures with slow
 119 unaligned writes align types on the lesser value between their size and the
 120 architecture word size (the type "natural" alignment on the architecture).
 121
 122 Note that the natural alignment for 64-bit integers and double-precision
 123 floating point values is fixed to 32-bit on a 32-bit architecture, but to 64-bit
 124 for a 64-bit architecture.
 125
 126 Metadata attribute representation:
 127
 128   align = value;                                /* value in bits */
 129
 130 4.1.3 Byte order
 131
 132 By default, target architecture endianness is used. Byte order can be overridden
 133 for a basic type by specifying a "byte_order" attribute. Typical use-case is to
 134 specify the network byte order (big endian: "be") to save data captured from the
 135 network into the trace without conversion. If not specified, the byte order is
 136 native.
 137
 138 Metadata representation:
 139
 140   byte_order = native OR network OR be OR le;   /* network and be are aliases */
 141
 142 4.1.4 Size
 143
 144 Type size, in bits, for integers and floats is that returned by "sizeof()" in C
 145 multiplied by CHAR_BIT.
 146 We require the size of "char" and "unsigned char" types (CHAR_BIT) to be fixed
 147 to 8 bits for cross-endianness compatibility.
 148
 149 Metadata representation:
 150
 151   size = value;    (value is in bits)
 152
 153 4.1.5 Integers
 154
 155 Signed integers are represented in two-complement. Integer alignment, size,
 156 signedness and byte ordering are defined in the metadata. Integers aligned on
 157 byte size (8-bit) and with length multiple of byte size (8-bit) correspond to
 158 the C99 standard integers. In addition, integers with alignment and/or size that
 159 are _not_ a multiple of the byte size are permitted; these correspond to the C99
 160 standard bitfields, with the added specification that the CTF integer bitfields
 161 have a fixed binary representation. A MIT-licensed reference implementation of
 162 the CTF portable bitfields is available at:
 163
 164   http://git.efficios.com/?p=babeltrace.git;a=blob;f=include/babeltrace/bitfield.h
 165
 166 Binary representation of integers:
 167
 168 - On little and big endian:
 169   - Within a byte, high bits correspond to an integer high bits, and low bits
 170     correspond to low bits.
 171 - On little endian:
 172   - Integer across multiple bytes are placed from the less significant to the
 173     most significant.
 174   - Consecutive integers are placed from lower bits to higher bits (even within
 175     a byte).
 176 - On big endian:
 177   - Integer across multiple bytes are placed from the most significant to the
 178     less significant.
 179   - Consecutive integers are placed from higher bits to lower bits (even within
 180     a byte).
 181
 182 This binary representation is derived from the bitfield implementation in GCC
 183 for little and big endian. However, contrary to what GCC does, integers can
 184 cross units boundaries (no padding is required). Padding can be explicitely
 185 added (see 4.1.6 GNU/C bitfields) to follow the GCC layout if needed.
 186
 187 Metadata representation:
 188
 189   abstract_type integer {
 190     signed = true OR false;                     /* default false */
 191     byte_order = native OR network OR be OR le; /* default native */
 192     size = value;                               /* value in bits, no default */
 193     align = value;                              /* value in bits */
 194   }
 195
 196 Example of type inheritance (creation of a concrete type uint32_t):
 197
 198 type uint32_t {
 199   parent = integer;
 200   size = 8;
 201   signed = false;
 202   align = 32;
 203 }
 204
 205 Definition of a 5-bit signed bitfield:
 206
 207 type int5_t {
 208   parent = integer;
 209   size = 5;
 210   signed = true;
 211   align = 1;
 212 }
 213
 214 4.1.6 GNU/C bitfields
 215
 216 The GNU/C bitfields follow closely the integer representation, with a
 217 particularity on alignment: if a bitfield cannot fit in the current unit, the
 218 unit is padded and the bitfield starts at the following unit. We therefore need
 219 to express the extra "unit size" information.
 220
 221 Metadata representation:
 222
 223 abstract_type gcc_bitfield {
 224   parent = integer;
 225   unit_size = value;
 226 }
 227
 228 As an example, the following structure declared in C compiled by GCC:
 229
 230 struct example {
 231   short a:12;
 232   short b:5;
 233 };
 234
 235 Would correspond to the following structure, aligned on the largest element
 236 (short). The second bitfield would be aligned on the next unit boundary, because
 237 it would not fit in the current unit.
 238
 239 type struct_example {
 240   parent = struct;
 241   fields = {
 242     {
 243       type {
 244         parent = gcc_bitfield;
 245         unit_size = 16;                         /* sizeof(short) */
 246         size = 12;
 247         signed = true;
 248         align = 1;
 249       },
 250       a,
 251     },
 252     {
 253       type {
 254         parent = gcc_bitfield;
 255         unit_size = 16;                         /* sizeof(short) */
 256         size = 5;
 257         signed = true;
 258         align = 1;
 259       },
 260       b,
 261     },
 262   };
 263 }
 264
 265 4.1.7 Floating point
 266
 267 The floating point values byte ordering is defined in the metadata.
 268
 269 Floating point values follow the IEEE 754-2008 standard interchange formats.
 270 Description of the floating point values include the exponent and mantissa size
 271 in bits. Some requirements are imposed on the floating point values:
 272
 273 - FLT_RADIX must be 2.
 274 - mant_dig is the number of digits represented in the mantissa. It is specified
 275   by the ISO C99 standard, section 5.2.4, as FLT_MANT_DIG, DBL_MANT_DIG and
 276   LDBL_MANT_DIG as defined by <float.h>.
 277 - exp_dig is the number of digits represented in the exponent. Given that
 278   mant_dig is one bit more than its actual size in bits (leading 1 is not
 279   needed) and also given that the sign bit always takes one bit, exp_dig can be
 280   specified as:
 281
 282   - sizeof(float) * CHAR_BIT - FLT_MANT_DIG
 283   - sizeof(double) * CHAR_BIT - DBL_MANT_DIG
 284   - sizeof(long double) * CHAR_BIT - LDBL_MANT_DIG
 285
 286 Metadata representation:
 287
 288 abstract_type floating_point {
 289    exp_dig = value;
 290    mant_dig = value;
 291    byte_order = native OR network OR be OR le;
 292 }
 293
 294 Example of type inheritance:
 295
 296 type float {
 297   exp_dig = 8;         /* sizeof(float) * CHAR_BIT - FLT_MANT_DIG */
 298   mant_dig = 24;       /* FLT_MANT_DIG */
 299   byte_order = native;
 300 }
 301
 302 TODO: define NaN, +inf, -inf behavior.
 303
 304 4.1.8 Enumerations
 305
 306 Enumerations are a mapping between an integer type and a table of strings. The
 307 numerical representation of the enumeration follows the integer type specified
 308 by the metadata. The enumeration mapping table is detailed in the enumeration
 309 description within the metadata.
 310
 311 abstract_type enum  {
 312   .parent = integer;
 313   .map = {
 314     { value , string },
 315     { value , string },
 316     { value , string },
 317     ...
 318   };
 319 }
 320
 321
 322 4.2 Compound types
 323
 324 4.2.1 Structures
 325
 326 Structures are aligned on the largest alignment required by basic types
 327 contained within the structure. (This follows the ISO/C standard for structures)
 328
 329 Metadata representation:
 330
 331 abstract_type struct {
 332   fields = {
 333     { field_type, field_name },
 334     { field_type, field_name },
 335     ...
 336   };
 337 }
 338
 339 Example:
 340
 341 type struct_example {
 342   parent = struct;
 343   fields = {
 344     {
 345       type {                 /* Nameless type */
 346         parent = integer;
 347         size = 16;
 348         signed = true;
 349         align = 16;
 350       },
 351       first_field_name,
 352     },
 353     {
 354       uint64_t,              /* Named type declared in the metadata */
 355       second_field_name,
 356     }
 357   };
 358 }
 359
 360 The fields are placed in a sequence next to each other. They each possess a
 361 field name, which is a unique identifier within the structure.
 362
 363 4.2.2 Arrays
 364
 365 Arrays are fixed-length. Their length is declared in the type declaration within
 366 the metadata. They contain an array of "inner type" elements, which can refer to
 367 any type not containing the type of the array being declared (no circular
 368 dependency).
 369
 370 Metadata representation:
 371
 372 abstract_type array {
 373   length = value;
 374   elem_type = type;
 375 }
 376
 377 E.g.:
 378
 379 type example_array {
 380   parent = array;
 381   length = 10;
 382   elem_type = uint32_t;
 383 }
 384
 385 4.2.3 Sequences
 386
 387 Sequences are dynamically-sized arrays. They start with an integer that specify
 388 the length of the sequence, followed by an array of "inner type" elements.
 389
 390 abstract_type sequence {
 391   length_type = type;   /* Inheriting from integer */
 392   elem_type = type;
 393 }
 394
 395 The integer type follows the integer types specifications, and the sequence
 396 elements follow the "array" specifications.
 397
 398 4.2.4 Strings
 399
 400 Strings are an array of bytes of variable size and are terminated by a '\0'
 401 "NULL" character.  Their encoding is described in the metadata. In absence of
 402 encoding attribute information, the default encoding is UTF-8.
 403
 404 abstract_type string {
 405   encoding = UTF8 OR ASCII;
 406 }
 407
 408
 409 5. Trace Packet Header
 410
 411 - Aligned on page size. Fixed size. Fields aligned on their natural size or
 412   packed (depending on the architecture preference).
 413   No padding at the end of the trace packet header. Native architecture byte
 414   ordering.
 415 - Magic number (CTF magic numbers: 0xC1FC1FC1 and its reverse endianness
 416   representation: 0xC11FFCC1) It needs to have a non-symmetric bytewise
 417   representation. Used to distinguish between big and little endian traces (this
 418   information is determined by knowing the endianness of the architecture
 419   reading the trace and comparing the magic number against its value and the
 420   reverse, 0xC11FFCC1). This magic number specifies that we use the CTF metadata
 421   description language described in this document. Different magic numbers
 422   should be used for other metadata description languages.
 423 - Session ID, used to ensure the packet match the metadata used.
 424   (note: we cannot use a metadata checksum because metadata can be appended to
 425    while tracing is active)
 426 - Packet content size (in bytes).
 427 - Packet size (in bytes, includes padding).
 428 - Packet content checksum (optional). Checksum excludes the packet header.
 429 - Per-section packet sequence count (to deal with UDP packet loss). The number
 430   of significant sequence counter bits should also be present, so wrap-arounds
 431   are deal with correctly.
 432 - Timestamp at the beginning and end of the packet. Should include all
 433   event timestamps contained therein.
 434 - Events discarded count
 435   - Snapshot of a per-section free-running counter, counting the number of
 436     events discarded that were supposed to be written in the section prior to
 437     the first event in the packet.
 438     * Note: producer-consumer buffer full condition should fill the current
 439             packet with padding so we know exactly where events have been
 440             discarded.
 441 - Lossless compression scheme used for the packet content. Applied directly to
 442   raw data.
 443   0: no compression scheme
 444   1: bzip2
 445   2: gzip
 446 - Cypher used for the packet content. Applied after compression.
 447   0: no encryption
 448   1: AES
 449 - Checksum scheme used for the packet content. Applied after encryption.
 450   0: no checksum
 451   1: md5
 452   2: sha1
 453   3: crc32
 454
 455 type packet_header {
 456   parent = struct;
 457   fields = {
 458     { uint32_t, magic },
 459     { uint32_t, session_id },
 460     { uint32_t, content_size },
 461     { uint32_t, packet_size },
 462     { uint32_t, checksum },
 463     { uint32_t, section_packet_count },
 464     { uint64_t, timestamp_begin }
 465     { uint64_t, timestamp_end }
 466     [ uint32_t, events_discarded },
 467     { uint8_t,  section_packet_count_bits },    /* Significant counter bits */
 468     { uint8_t,  compression_scheme },
 469     { uint8_t,  encryption_scheme },
 470     { uint8_t,  checksum },
 471   };
 472 };
 473
 474
 475 6. Event Structure
 476
 477 The overall structure of an event is:
 478
 479   - Event Header (as specifed by the section metadata)
 480   - Extended Event Header (as specified by the event header)
 481   - Event Context (as specified by the section metadata)
 482   - Event Payload (as specified by the event metadata)
 483
 484
 485 6.1 Event Header
 486
 487 One major factor can vary between sections: the number of event IDs assigned to
 488 a section. Luckily, this information tends to stay relatively constant (modulo
 489 event registration while trace is being recorded), so we can specify different
 490 representations for sections containing few event IDs and sections containing
 491 many event IDs, so we end up representing the event ID and timestamp as densely
 492 as possible in each case.
 493
 494 We therefore provide two types of events headers. Type 1 accommodates sections
 495 with less than 31 event IDs. Type 2 accommodates sections with 31 or more event
 496 IDs.
 497
 498 The "extended headers" are used in the rare occasions where the information
 499 cannot be represented in the ranges available in the event header.
 500
 501 Types uintX_t represent an X-bit unsigned integer.
 502
 503
 504 6.1.1 Type 1 - Few event IDs
 505
 506   - Aligned on 32-bit (or 8-bit if byte-packed, depending on the architecture
 507     preference).
 508   - Fixed size: 32 bits.
 509   - Native architecture byte ordering.
 510
 511 type event_header_1 {
 512   parent = struct;
 513   fields = {
 514     { uint5_t, id },    /*
 515                          * id: range: 0 - 30.
 516                          * id 31 is reserved to indicate a following
 517                          * extended header.
 518                          */
 519     { uint27_t, timestamp },
 520   };
 521 };
 522
 523 The end of a type 1 header is aligned on a 32-bit boundary (or packed).
 524
 525
 526 6.1.2 Extended Type 1 Event Header
 527
 528   - Follows struct event_header_1, which is aligned on 32-bit, so no need to
 529     realign.
 530   - Fixed size: 96 bits.
 531   - Native architecture byte ordering.
 532
 533 type event_header_1_ext {
 534   parent = struct;
 535   fields = {
 536     { uint32_t, id },           /* 32-bit event IDs */
 537     { uint64_t, timestamp },    /* 64-bit timestamps */
 538   };
 539 };
 540
 541 The end of a type 1 extended header is aligned on the natural alignment of a
 542 64-bit integer (or 8-bit if byte-packed).
 543
 544
 545 6.1.3 Type 2 - Many event IDs
 546
 547   - Aligned on 32-bit (or 8-bit if byte-packed, depending on the architecture
 548     preference).
 549   - Fixed size: 48 bits.
 550   - Native architecture byte ordering.
 551
 552 type event_header_2 {
 553   parent = struct;
 554   fields = {
 555     { uint32_t, timestamp },
 556     { uint16_t, id },   /*
 557                          * id: range: 0 - 65534.
 558                          * id 65535 is reserved to indicate a following
 559                          * extended header.
 560                          */
 561   };
 562 };
 563
 564 The end of a type 2 header is aligned on a 16-bit boundary (or 8-bit if
 565 byte-packed).
 566
 567
 568 6.1.4 Extended Type 2 Event Header
 569
 570   - Follows struct event_header_2, which alignment end on a 16-bit boundary, so
 571     we need to align on 64-bit integer natural alignment (or 8-bit if
 572     byte-packed).
 573   - Fixed size: 96 bits.
 574   - Native architecture byte ordering.
 575
 576 type event_header_2_ext {
 577   parent = struct;
 578   fields = {
 579     { uint64_t, timestamp },    /* 64-bit timestamps */
 580     { uint32_t, id },           /* 32-bit event IDs */
 581   };
 582 };
 583
 584 The end of a type 2 extended header is aligned on the natural alignment of a
 585 32-bit integer (or 8-bit if byte-packed).
 586
 587
 588 6.2 Event Context
 589
 590 The event context contains information relative to the current event. The choice
 591 and meaning of this information is specified by the metadata "section"
 592 information. For this trace format, event context is usually empty, except when
 593 the metadata "section" information specifies otherwise by declaring a non-empty
 594 structure for the event context. An example of event context is to save the
 595 event payload size with each event, or to save the current PID with each event.
 596
 597 6.2.1 Event Context Description
 598
 599 Event context example. These are declared within the section declaration within
 600 the metadata.
 601
 602 type per_section_event_ctx {
 603   parent = struct;
 604   fields = {
 605     { uint, pid },
 606     { uint16_t, payload_size },
 607   };
 608 };
 609
 610
 611 6.3 Event Payload
 612
 613 An event payload contains fields specific to a given event type. The fields
 614 belonging to an event type are described in the event-specific metadata
 615 within a structure type.
 616
 617 6.3.1 Padding
 618
 619 No padding at the end of the event payload. This differs from the ISO/C standard
 620 for structures, but follows the CTF standard for structures. In a trace, even
 621 though it makes sense to align the beginning of a structure, it really makes no
 622 sense to add padding at the end of the structure, because structures are usually
 623 not followed by a structure of the same type.
 624
 625 This trick can be done by adding a zero-length "end" field at the end of the C
 626 structures, and by using the offset of this field rather than using sizeof()
 627 when calculating the size of a structure (see section "A.1 Helper macros").
 628
 629 6.3.2 Alignment
 630
 631 The event payload is aligned on the largest alignment required by types
 632 contained within the payload. (This follows the ISO/C standard for structures)
 633
 634
 635
 636 7. Metadata
 637
 638 The meta-data is located in a tracefile section named "metadata". It is made of
 639 "packets", which each start with a packet header. The event type within the
 640 metadata section have no event header nor event context. Each event only
 641 contains a null-terminated "string" payload, which is a metadata description
 642 entry. The events are packed one next to another. Each packet start with a
 643 packet header, which contains, amongst other fields, the session ID and magic
 644 number.
 645
 646 The metadata can be parsed by reading through the metadata strings, skipping
 647 spaces, newlines and null-characters.
 648
 649 trace {
 650   major = value;        /* Trace format version */
 651   minor = value;
 652 }
 653
 654 section {
 655   name = section_name;
 656   event {
 657     /* Type 1 - Few event IDs; Type 2 - Many event IDs */
 658     header_type = type1 OR type2;
 659     context {
 660       event_size = true OR false;  /* Includes event size field or not */
 661     }
 662   }
 663 }
 664
 665 event {
 666   name = event_name;
 667   id = value;                   /* Numeric identifier within the section */
 668   section = section_name;
 669   fields = type inheriting from "struct" abstract type.
 670 }
 671
 672 /* More detail on types in section 4. Types */
 673
 674 /* Named types */
 675 type typename {
 676    ...
 677 }
 678
 679 /* Unnamed types, contained within compound type fields */
 680 type {
 681    ...
 682 }
 683
 684 A.1 Helper macros
 685
 686 The two following macros keep track of the size of a GNU/C structure without
 687 padding at the end by placing HEADER_END as the last field. A one byte end field
 688 is used for C90 compatibility (C99 flexible arrays could be used here). Note
 689 that this does not affect the effective structure size, which should always be
 690 calculated with the header_sizeof() helper.
 691
 692 #define HEADER_END              char end_field
 693 #define header_sizeof(type)     offsetof(typeof(type), end_field)