+
+RFC: Common Trace Format Proposal for Linux (v1)
+
+Mathieu Desnoyers, EfficiOS Inc.
+
+The goal of the present document is to propose a trace format that suits the
+needs of the embedded, telecom, high-performance and kernel communities. It is
+based on the Common Trace Format Requirements (v1.4) document. It is designed to
+be natively generated by tracing of a Linux kernel and Linux user-space
+applications written in C/C++.
+
+A reference implementation of a library to read and write this trace format is
+being implemented within the BabelTrace project, a converter between trace
+formats. The development tree is available at:
+
+ git tree: git://git.efficios.com/babeltrace.git
+ gitweb: http://git.efficios.com/?p=babeltrace.git
+
+
+1. Preliminary definitions
+
+ - Trace: An ordered sequence of events.
+ - Section: Group of events, containing a subset of the trace event types.
+ - Packet: A sequence of physically contiguous events within a section.
+ - Event: This is the basic entry in a trace. (aka: a trace record).
+ - An event identifier (ID) relates to the class (a type) of event within
+ a section.
+ e.g. section: high_throughput, event: irq_entry.
+ - An event (or event record) relates to a specific instance of an event
+ class.
+ e.g. section: high_throughput, event: irq_entry, at time X, on CPU Y
+
+
+2. High-level representation of a trace
+
+A trace is divided into multiple trace streams, each representing an information
+stream specific to:
+
+ - a section,
+ - a processor.
+
+A trace "section" consists of a collection of trace streams (typically one trace
+stream per cpu) containing a subset of the trace event types.
+
+Because each trace stream is appended to while a trace is being recorded, each
+is associated with a separate file for disk output. Therefore, a trace stored to
+disk can be represented as a directory containing one file per section.
+
+A metadata section contains information on trace event types. It describes:
+
+- Trace version.
+- Types available.
+- Per-section event header description.
+- Per-section event header selection.
+- Per-section event context fields.
+- Per-event
+ - Event type to section mapping.
+ - Event type to name mapping.
+ - Event type to ID mapping.
+ - Event fields description.
+
+
+3. Trace Section
+
+A trace section is divided in contiguous packets of variable size. These
+subdivisions allow the trace analyzer to perform a fast binary search by time
+within the section (typically requiring to index only the packet headers)
+without reading the whole section. These subdivisions have a variable size to
+eliminate the need to transfer the packet padding when partially filled packets
+must be sent when streaming a trace for live viewing/analysis. Dividing sections
+into packets is also useful for network streaming over UDP and flight recorder
+mode tracing (a whole packet can be swapped out of the buffer atomically for
+reading).
+
+The section header is repeated at the beginning of each packet to allow
+flexibility in terms of:
+
+ - streaming support,
+ - allowing arbitrary buffers to be discarded without making the trace
+ unreadable,
+ - allow UDP packet loss handling by either dealing with missing packet or
+ asking for re-transmission.
+ - transparently support flight recorder mode,
+ - transparently support crash dump.
+
+The section header will therefore be referred to as the "packet header"
+thorough the rest of this document.
+
+
+4. Types
+
+4.1 Basic types
+
+A basic type is a scalar type, as described in this section.
+
+4.1.1 Type inheritance
+
+Type specifications can be inherited to allow deriving concrete types from an
+abstract type. For example, see the uint32_t type derived from the "integer"
+abstract type below ("Integers" section). Concrete types have a precise binary
+representation in the trace. Abstract types have methods to read and write these
+types, but must be derived into a concrete type to be usable in an event field.
+
+Concrete types inherit from abstract types. Abstract types can inherit from
+other abstract types.
+
+4.1.2 Alignment
+
+We define "byte-packed" types as aligned on the byte size, namely 8-bit.
+We define "bit-packed" types as following on the next bit, as defined by the
+"bitfields" section.
+We define "natural alignment" of a basic type as the lesser value between the
+type size and the architecture word size.
+
+All basic types, except bitfields, are either aligned on their "natural"
+alignment or byte-packed, depending on the architecture preference.
+Architectures providing fast unaligned writes byte-packed basic types to save
+space, aligning each type on byte boundaries (8-bit). Architectures with slow
+unaligned writes align types on the lesser value between their size and the
+architecture word size (the type "natural" alignment on the architecture).
+
+Note that the natural alignment for 64-bit integers and double-precision
+floating point values is fixed to 32-bit on a 32-bit architecture, but to 64-bit
+for a 64-bit architecture.
+
+Metadata attribute representation:
+
+ align = value; /* value in bits */
+
+4.1.3 Byte order
+
+By default, target architecture endianness is used. Byte order can be overridden
+for a basic type by specifying a "byte_order" attribute. Typical use-case is to
+specify the network byte order (big endian: "be") to save data captured from the
+network into the trace without conversion. If not specified, the byte order is
+native.
+
+Metadata representation:
+
+ byte_order = native OR network OR be OR le; /* network and be are aliases */
+
+4.1.4 Size
+
+Type size, in bits, for integers and floats is that returned by "sizeof()" in C
+multiplied by CHAR_BIT.
+We require the size of "char" and "unsigned char" types (CHAR_BIT) to be fixed
+to 8 bits for cross-endianness compatibility.
+
+Metadata representation:
+
+ size = value; (value is in bits)
+
+4.1.5 Integers
+
+Signed integers are represented in two-complement. Integer alignment, size,
+signedness and byte ordering are defined in the metadata. Integers aligned on
+byte size (8-bit) and with length multiple of byte size (8-bit) correspond to
+the C99 standard integers. In addition, integers with alignment and/or size that
+are _not_ a multiple of the byte size are permitted; these correspond to the C99
+standard bitfields, with the added specification that the CTF integer bitfields
+have a fixed binary representation. A MIT-licensed reference implementation of
+the CTF portable bitfields is available at:
+
+ http://git.efficios.com/?p=babeltrace.git;a=blob;f=include/babeltrace/bitfield.h
+
+Binary representation of integers:
+
+- On little and big endian:
+ - Within a byte, high bits correspond to an integer high bits, and low bits
+ correspond to low bits.
+- On little endian:
+ - Integer across multiple bytes are placed from the less significant to the
+ most significant.
+ - Consecutive integers are placed from lower bits to higher bits (even within
+ a byte).
+- On big endian:
+ - Integer across multiple bytes are placed from the most significant to the
+ less significant.
+ - Consecutive integers are placed from higher bits to lower bits (even within
+ a byte).
+
+This binary representation is derived from the bitfield implementation in GCC
+for little and big endian. However, contrary to what GCC does, integers can
+cross units boundaries (no padding is required). Padding can be explicitely
+added (see 4.1.6 GNU/C bitfields) to follow the GCC layout if needed.
+
+Metadata representation:
+
+ abstract_type integer {
+ signed = true OR false; /* default false */
+ byte_order = native OR network OR be OR le; /* default native */
+ size = value; /* value in bits, no default */
+ align = value; /* value in bits */
+ }
+
+Example of type inheritance (creation of a concrete type uint32_t):
+
+type uint32_t {
+ parent = integer;
+ size = 8;
+ signed = false;
+ align = 32;
+}
+
+Definition of a 5-bit signed bitfield:
+
+type int5_t {
+ parent = integer;
+ size = 5;
+ signed = true;
+ align = 1;
+}
+
+4.1.6 GNU/C bitfields
+
+The GNU/C bitfields follow closely the integer representation, with a
+particularity on alignment: if a bitfield cannot fit in the current unit, the
+unit is padded and the bitfield starts at the following unit. We therefore need
+to express the extra "unit size" information.
+
+Metadata representation:
+
+abstract_type gcc_bitfield {
+ parent = integer;
+ unit_size = value;
+}
+
+As an example, the following structure declared in C compiled by GCC:
+
+struct example {
+ short a:12;
+ short b:5;
+};
+
+Would correspond to the following structure, aligned on the largest element
+(short). The second bitfield would be aligned on the next unit boundary, because
+it would not fit in the current unit.
+
+type struct_example {
+ parent = struct;
+ fields = {
+ {
+ type {
+ parent = gcc_bitfield;
+ unit_size = 16; /* sizeof(short) */
+ size = 12;
+ signed = true;
+ align = 1;
+ },
+ a,
+ },
+ {
+ type {
+ parent = gcc_bitfield;
+ unit_size = 16; /* sizeof(short) */
+ size = 5;
+ signed = true;
+ align = 1;
+ },
+ b,
+ },
+ };
+}
+
+4.1.7 Floating point
+
+The floating point values byte ordering is defined in the metadata.
+
+Floating point values follow the IEEE 754-2008 standard interchange formats.
+Description of the floating point values include the exponent and mantissa size
+in bits. Some requirements are imposed on the floating point values:
+
+- FLT_RADIX must be 2.
+- mant_dig is the number of digits represented in the mantissa. It is specified
+ by the ISO C99 standard, section 5.2.4, as FLT_MANT_DIG, DBL_MANT_DIG and
+ LDBL_MANT_DIG as defined by <float.h>.
+- exp_dig is the number of digits represented in the exponent. Given that
+ mant_dig is one bit more than its actual size in bits (leading 1 is not
+ needed) and also given that the sign bit always takes one bit, exp_dig can be
+ specified as:
+
+ - sizeof(float) * CHAR_BIT - FLT_MANT_DIG
+ - sizeof(double) * CHAR_BIT - DBL_MANT_DIG
+ - sizeof(long double) * CHAR_BIT - LDBL_MANT_DIG
+
+Metadata representation:
+
+abstract_type floating_point {
+ exp_dig = value;
+ mant_dig = value;
+ byte_order = native OR network OR be OR le;
+}
+
+Example of type inheritance:
+
+type float {
+ exp_dig = 8; /* sizeof(float) * CHAR_BIT - FLT_MANT_DIG */
+ mant_dig = 24; /* FLT_MANT_DIG */
+ byte_order = native;
+}
+
+TODO: define NaN, +inf, -inf behavior.
+
+4.1.8 Enumerations
+
+Enumerations are a mapping between an integer type and a table of strings. The
+numerical representation of the enumeration follows the integer type specified
+by the metadata. The enumeration mapping table is detailed in the enumeration
+description within the metadata.
+
+abstract_type enum {
+ .parent = integer;
+ .map = {
+ { value , string },
+ { value , string },
+ { value , string },
+ ...
+ };
+}
+
+
+4.2 Compound types
+
+4.2.1 Structures
+
+Structures are aligned on the largest alignment required by basic types
+contained within the structure. (This follows the ISO/C standard for structures)
+
+Metadata representation:
+
+abstract_type struct {
+ fields = {
+ { field_type, field_name },
+ { field_type, field_name },
+ ...
+ };
+}
+
+Example:
+
+type struct_example {
+ parent = struct;
+ fields = {
+ {
+ type { /* Nameless type */
+ parent = integer;
+ size = 16;
+ signed = true;
+ align = 16;
+ },
+ first_field_name,
+ },
+ {
+ uint64_t, /* Named type declared in the metadata */
+ second_field_name,
+ }
+ };
+}
+
+The fields are placed in a sequence next to each other. They each possess a
+field name, which is a unique identifier within the structure.
+
+4.2.2 Arrays
+
+Arrays are fixed-length. Their length is declared in the type declaration within
+the metadata. They contain an array of "inner type" elements, which can refer to
+any type not containing the type of the array being declared (no circular
+dependency).
+
+Metadata representation:
+
+abstract_type array {
+ length = value;
+ elem_type = type;
+}
+
+E.g.:
+
+type example_array {
+ parent = array;
+ length = 10;
+ elem_type = uint32_t;
+}
+
+4.2.3 Sequences
+
+Sequences are dynamically-sized arrays. They start with an integer that specify
+the length of the sequence, followed by an array of "inner type" elements.
+
+abstract_type sequence {
+ length_type = type; /* Inheriting from integer */
+ elem_type = type;
+}
+
+The integer type follows the integer types specifications, and the sequence
+elements follow the "array" specifications.
+
+4.2.4 Strings
+
+Strings are an array of bytes of variable size and are terminated by a '\0'
+"NULL" character. Their encoding is described in the metadata. In absence of
+encoding attribute information, the default encoding is UTF-8.
+
+abstract_type string {
+ encoding = UTF8 OR ASCII;
+}
+
+
+5. Trace Packet Header
+
+- Aligned on page size. Fixed size. Fields aligned on their natural size or
+ packed (depending on the architecture preference).
+ No padding at the end of the trace packet header. Native architecture byte
+ ordering.
+- Magic number (CTF magic numbers: 0xC1FC1FC1 and its reverse endianness
+ representation: 0xC11FFCC1) It needs to have a non-symmetric bytewise
+ representation. Used to distinguish between big and little endian traces (this
+ information is determined by knowing the endianness of the architecture
+ reading the trace and comparing the magic number against its value and the
+ reverse, 0xC11FFCC1). This magic number specifies that we use the CTF metadata
+ description language described in this document. Different magic numbers
+ should be used for other metadata description languages.
+- Session ID, used to ensure the packet match the metadata used.
+ (note: we cannot use a metadata checksum because metadata can be appended to
+ while tracing is active)
+- Packet content size (in bytes).
+- Packet size (in bytes, includes padding).
+- Packet content checksum (optional). Checksum excludes the packet header.
+- Per-section packet sequence count (to deal with UDP packet loss). The number
+ of significant sequence counter bits should also be present, so wrap-arounds
+ are deal with correctly.
+- Timestamp at the beginning and end of the packet. Should include all
+ event timestamps contained therein.
+- Events discarded count
+ - Snapshot of a per-section free-running counter, counting the number of
+ events discarded that were supposed to be written in the section prior to
+ the first event in the packet.
+ * Note: producer-consumer buffer full condition should fill the current
+ packet with padding so we know exactly where events have been
+ discarded.
+- Lossless compression scheme used for the packet content. Applied directly to
+ raw data.
+ 0: no compression scheme
+ 1: bzip2
+ 2: gzip
+- Cypher used for the packet content. Applied after compression.
+ 0: no encryption
+ 1: AES
+- Checksum scheme used for the packet content. Applied after encryption.
+ 0: no checksum
+ 1: md5
+ 2: sha1
+ 3: crc32
+
+type packet_header {
+ parent = struct;
+ fields = {
+ { uint32_t, magic },
+ { uint32_t, session_id },
+ { uint32_t, content_size },
+ { uint32_t, packet_size },
+ { uint32_t, checksum },
+ { uint32_t, section_packet_count },
+ { uint64_t, timestamp_begin }
+ { uint64_t, timestamp_end }
+ [ uint32_t, events_discarded },
+ { uint8_t, section_packet_count_bits }, /* Significant counter bits */
+ { uint8_t, compression_scheme },
+ { uint8_t, encryption_scheme },
+ { uint8_t, checksum },
+ };
+};
+
+
+6. Event Structure
+
+The overall structure of an event is:
+
+ - Event Header (as specifed by the section metadata)
+ - Extended Event Header (as specified by the event header)
+ - Event Context (as specified by the section metadata)
+ - Event Payload (as specified by the event metadata)
+
+
+6.1 Event Header
+
+One major factor can vary between sections: the number of event IDs assigned to
+a section. Luckily, this information tends to stay relatively constant (modulo
+event registration while trace is being recorded), so we can specify different
+representations for sections containing few event IDs and sections containing
+many event IDs, so we end up representing the event ID and timestamp as densely
+as possible in each case.
+
+We therefore provide two types of events headers. Type 1 accommodates sections
+with less than 31 event IDs. Type 2 accommodates sections with 31 or more event
+IDs.
+
+The "extended headers" are used in the rare occasions where the information
+cannot be represented in the ranges available in the event header.
+
+Types uintX_t represent an X-bit unsigned integer.
+
+
+6.1.1 Type 1 - Few event IDs
+
+ - Aligned on 32-bit (or 8-bit if byte-packed, depending on the architecture
+ preference).
+ - Fixed size: 32 bits.
+ - Native architecture byte ordering.
+
+type event_header_1 {
+ parent = struct;
+ fields = {
+ { uint5_t, id }, /*
+ * id: range: 0 - 30.
+ * id 31 is reserved to indicate a following
+ * extended header.
+ */
+ { uint27_t, timestamp },
+ };
+};
+
+The end of a type 1 header is aligned on a 32-bit boundary (or packed).
+
+
+6.1.2 Extended Type 1 Event Header
+
+ - Follows struct event_header_1, which is aligned on 32-bit, so no need to
+ realign.
+ - Fixed size: 96 bits.
+ - Native architecture byte ordering.
+
+type event_header_1_ext {
+ parent = struct;
+ fields = {
+ { uint32_t, id }, /* 32-bit event IDs */
+ { uint64_t, timestamp }, /* 64-bit timestamps */
+ };
+};
+
+The end of a type 1 extended header is aligned on the natural alignment of a
+64-bit integer (or 8-bit if byte-packed).
+
+
+6.1.3 Type 2 - Many event IDs
+
+ - Aligned on 32-bit (or 8-bit if byte-packed, depending on the architecture
+ preference).
+ - Fixed size: 48 bits.
+ - Native architecture byte ordering.
+
+type event_header_2 {
+ parent = struct;
+ fields = {
+ { uint32_t, timestamp },
+ { uint16_t, id }, /*
+ * id: range: 0 - 65534.
+ * id 65535 is reserved to indicate a following
+ * extended header.
+ */
+ };
+};
+
+The end of a type 2 header is aligned on a 16-bit boundary (or 8-bit if
+byte-packed).
+
+
+6.1.4 Extended Type 2 Event Header
+
+ - Follows struct event_header_2, which alignment end on a 16-bit boundary, so
+ we need to align on 64-bit integer natural alignment (or 8-bit if
+ byte-packed).
+ - Fixed size: 96 bits.
+ - Native architecture byte ordering.
+
+type event_header_2_ext {
+ parent = struct;
+ fields = {
+ { uint64_t, timestamp }, /* 64-bit timestamps */
+ { uint32_t, id }, /* 32-bit event IDs */
+ };
+};
+
+The end of a type 2 extended header is aligned on the natural alignment of a
+32-bit integer (or 8-bit if byte-packed).
+
+
+6.2 Event Context
+
+The event context contains information relative to the current event. The choice
+and meaning of this information is specified by the metadata "section"
+information. For this trace format, event context is usually empty, except when
+the metadata "section" information specifies otherwise by declaring a non-empty
+structure for the event context. An example of event context is to save the
+event payload size with each event, or to save the current PID with each event.
+
+6.2.1 Event Context Description
+
+Event context example. These are declared within the section declaration within
+the metadata.
+
+type per_section_event_ctx {
+ parent = struct;
+ fields = {
+ { uint, pid },
+ { uint16_t, payload_size },
+ };
+};
+
+
+6.3 Event Payload
+
+An event payload contains fields specific to a given event type. The fields
+belonging to an event type are described in the event-specific metadata
+within a structure type.
+
+6.3.1 Padding
+
+No padding at the end of the event payload. This differs from the ISO/C standard
+for structures, but follows the CTF standard for structures. In a trace, even
+though it makes sense to align the beginning of a structure, it really makes no
+sense to add padding at the end of the structure, because structures are usually
+not followed by a structure of the same type.
+
+This trick can be done by adding a zero-length "end" field at the end of the C
+structures, and by using the offset of this field rather than using sizeof()
+when calculating the size of a structure (see section "A.1 Helper macros").
+
+6.3.2 Alignment
+
+The event payload is aligned on the largest alignment required by types
+contained within the payload. (This follows the ISO/C standard for structures)
+
+
+
+7. Metadata
+
+The meta-data is located in a tracefile section named "metadata". It is made of
+"packets", which each start with a packet header. The event type within the
+metadata section have no event header nor event context. Each event only
+contains a null-terminated "string" payload, which is a metadata description
+entry. The events are packed one next to another. Each packet start with a
+packet header, which contains, amongst other fields, the session ID and magic
+number.
+
+The metadata can be parsed by reading through the metadata strings, skipping
+spaces, newlines and null-characters.
+
+trace {
+ major = value; /* Trace format version */
+ minor = value;
+}
+
+section {
+ name = section_name;
+ event {
+ /* Type 1 - Few event IDs; Type 2 - Many event IDs */
+ header_type = type1 OR type2;
+ context {
+ event_size = true OR false; /* Includes event size field or not */
+ }
+ }
+}
+
+event {
+ name = event_name;
+ id = value; /* Numeric identifier within the section */
+ section = section_name;
+ fields = type inheriting from "struct" abstract type.
+}
+
+/* More detail on types in section 4. Types */
+
+/* Named types */
+type typename {
+ ...
+}
+
+/* Unnamed types, contained within compound type fields */
+type {
+ ...
+}
+
+A.1 Helper macros
+
+The two following macros keep track of the size of a GNU/C structure without
+padding at the end by placing HEADER_END as the last field. A one byte end field
+is used for C90 compatibility (C99 flexible arrays could be used here). Note
+that this does not affect the effective structure size, which should always be
+calculated with the header_sizeof() helper.
+
+#define HEADER_END char end_field
+#define header_sizeof(type) offsetof(typeof(type), end_field)