| Keith Wyss | 3d0bc9e | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 1 | ====================================== | 
|  | 2 | XRay Flight Data Recorder Trace Format | 
|  | 3 | ====================================== | 
|  | 4 |  | 
|  | 5 | :Version: 1 as of 2017-07-20 | 
|  | 6 |  | 
|  | 7 | .. contents:: | 
|  | 8 | :local: | 
|  | 9 |  | 
|  | 10 |  | 
|  | 11 | Introduction | 
|  | 12 | ============ | 
|  | 13 |  | 
|  | 14 | When gathering XRay traces in Flight Data Recorder mode, each thread of an | 
|  | 15 | application will claim buffers to fill with trace data, which at some point | 
|  | 16 | is finalized and flushed. | 
|  | 17 |  | 
| Eric Christopher | 66c5bbc | 2018-05-27 09:19:03 +0000 | [diff] [blame] | 18 | A goal of the profiler is to minimize overhead, the flushed data directly | 
| Keith Wyss | 3d0bc9e | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 19 | corresponds to the buffer. | 
|  | 20 |  | 
|  | 21 | This document describes the format of a trace file. | 
|  | 22 |  | 
|  | 23 |  | 
|  | 24 | General | 
|  | 25 | ======= | 
|  | 26 |  | 
|  | 27 | Each trace file corresponds to a sequence of events in a particular thread. | 
|  | 28 |  | 
|  | 29 | The file has a header followed by a sequence of discriminated record types. | 
|  | 30 |  | 
| Sylvestre Ledru | 3655495 | 2017-08-12 15:08:11 +0000 | [diff] [blame] | 31 | The endianness of byte fields matches the endianess of the platform which | 
| Keith Wyss | 3d0bc9e | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 32 | produced the trace file. | 
|  | 33 |  | 
|  | 34 |  | 
|  | 35 | Header Section | 
|  | 36 | ============== | 
|  | 37 |  | 
|  | 38 | A trace file begins with a 32 byte header. | 
|  | 39 |  | 
|  | 40 | +-------------------+-----------------+----------------------------------------+ | 
|  | 41 | | Field             | Size (bytes)    | Description                            | | 
|  | 42 | +===================+=================+========================================+ | 
|  | 43 | | version           | ``2``           | Anticipates versioned  readers. This   | | 
|  | 44 | |                   |                 | document describes the format when     | | 
|  | 45 | |                   |                 | version == 1                           | | 
|  | 46 | +-------------------+-----------------+----------------------------------------+ | 
|  | 47 | | type              | ``2``           | An enumeration encoding the type of    | | 
|  | 48 | |                   |                 | trace. Flight Data Recorder mode       | | 
|  | 49 | |                   |                 | traces have type == 1                  | | 
|  | 50 | +-------------------+-----------------+----------------------------------------+ | 
|  | 51 | | bitfield          | ``4``           | Holds parameters that are not aligned  | | 
|  | 52 | |                   |                 | to bytes. Further described below.     | | 
|  | 53 | +-------------------+-----------------+----------------------------------------+ | 
|  | 54 | | cycle_frequency   | ``8``           | The frequency in hertz of the CPU      | | 
|  | 55 | |                   |                 | oscillator used to measure duration of | | 
|  | 56 | |                   |                 | events in ticks.                       | | 
|  | 57 | +-------------------+-----------------+----------------------------------------+ | 
|  | 58 | | buffer_size       | ``8``           | The size in bytes of the data portion  | | 
|  | 59 | |                   |                 | of the trace following the header.     | | 
|  | 60 | +-------------------+-----------------+----------------------------------------+ | 
|  | 61 | | reserved          | ``8``           | Reserved for future use.               | | 
|  | 62 | +-------------------+-----------------+----------------------------------------+ | 
|  | 63 |  | 
|  | 64 | The bitfield parameter of the file header is composed of the following fields. | 
|  | 65 |  | 
|  | 66 | +-------------------+----------------+-----------------------------------------+ | 
|  | 67 | | Field             | Size (bits)    | Description                             | | 
|  | 68 | +===================+================+=========================================+ | 
|  | 69 | | constant_tsc      | ``1``          | Whether the platform's timestamp        | | 
|  | 70 | |                   |                | counter used to record ticks between    | | 
|  | 71 | |                   |                | events ticks at a constant frequency    | | 
|  | 72 | |                   |                | despite CPU frequency changes.          | | 
|  | 73 | |                   |                | 0 == non-constant. 1 == constant.       | | 
|  | 74 | +-------------------+----------------+-----------------------------------------+ | 
|  | 75 | | nonstop_tsc       | ``1``          | Whether the tsc continues to count      | | 
|  | 76 | |                   |                | despite whether the CPU is in a low     | | 
|  | 77 | |                   |                | power state. 0 == stop. 1 == non-stop.  | | 
|  | 78 | +-------------------+----------------+-----------------------------------------+ | 
|  | 79 | | reserved          | ``30``         | Not meaningful.                         | | 
|  | 80 | +-------------------+----------------+-----------------------------------------+ | 
|  | 81 |  | 
|  | 82 |  | 
|  | 83 | Data Section | 
|  | 84 | ============ | 
|  | 85 |  | 
|  | 86 | Following the header in a trace is a data section with size matching the | 
|  | 87 | buffer_size field in the header. | 
|  | 88 |  | 
|  | 89 | The data section is a stream of elements of different types. | 
|  | 90 |  | 
|  | 91 | There are a few categories of data in the sequence. | 
|  | 92 |  | 
|  | 93 | - ``Function Records``: Function Records contain the timing of entry into and | 
|  | 94 | exit from function execution. Function Records have 8 bytes each. | 
|  | 95 |  | 
|  | 96 | - ``Metadata Records``: Metadata records serve many purposes. Mostly, they | 
|  | 97 | capture information that may be too costly to record for each function, but | 
|  | 98 | that is required to contextualize the fine-grained timings. They also are used | 
|  | 99 | as markers for user-defined Event Data payloads. Metadata records have 16 | 
|  | 100 | bytes each. | 
|  | 101 |  | 
|  | 102 | - ``Event Data``: Free form data may be associated with events that are traced | 
|  | 103 | by the binary and encode data defined by a handler function. Event data is | 
|  | 104 | always preceded with a marker record which indicates how large it is. | 
|  | 105 |  | 
|  | 106 | - ``Function Arguments``: The arguments to some functions are included in the | 
|  | 107 | trace. These are either pointer addresses or primitives that are read and | 
|  | 108 | logged independently of their types in a high level language. To the tracer, | 
| Eric Christopher | 66c5bbc | 2018-05-27 09:19:03 +0000 | [diff] [blame] | 109 | they are all numbers. Function Records that have attached arguments will | 
|  | 110 | indicate their presence on the function entry record. We only support logging | 
|  | 111 | contiguous function argument sequences starting with argument zero, which will | 
|  | 112 | be the "this" pointer for member function invocations. For example, we don't | 
|  | 113 | support logging the first and third argument. | 
| Keith Wyss | 3d0bc9e | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 114 |  | 
|  | 115 | A reader of the memory format must maintain a state machine. The format makes no | 
|  | 116 | attempt to pad for alignment, and it is not seekable. | 
|  | 117 |  | 
|  | 118 |  | 
|  | 119 | Function Records | 
|  | 120 | ---------------- | 
|  | 121 |  | 
|  | 122 | Function Records have an 8 byte layout. This layout encodes information to | 
|  | 123 | reconstruct a call stack of instrumented function and their durations. | 
|  | 124 |  | 
|  | 125 | +---------------+--------------+-----------------------------------------------+ | 
|  | 126 | | Field         | Size (bits)  | Description                                   | | 
|  | 127 | +===============+==============+===============================================+ | 
|  | 128 | | discriminant  | ``1``        | Indicates whether a reader should read a      | | 
|  | 129 | |               |              | Function or Metadata record. Set to ``0`` for | | 
|  | 130 | |               |              | Function records.                             | | 
|  | 131 | +---------------+--------------+-----------------------------------------------+ | 
|  | 132 | | action        | ``3``        | Specifies whether the function is being       | | 
|  | 133 | |               |              | entered, exited, or is a non-standard entry   | | 
|  | 134 | |               |              | or exit produced by optimizations.            | | 
|  | 135 | +---------------+--------------+-----------------------------------------------+ | 
|  | 136 | | function_id   | ``28``       | A numeric ID for the function. Resolved to a  | | 
|  | 137 | |               |              | name via the xray instrumentation map. The    | | 
|  | 138 | |               |              | instrumentation map is built by xray at       | | 
|  | 139 | |               |              | compile time into an object file and pairs    | | 
|  | 140 | |               |              | the function ids to addresses. It is used for | | 
|  | 141 | |               |              | patching and as a lookup into the binary's    | | 
|  | 142 | |               |              | symbols to obtain names.                      | | 
|  | 143 | +---------------+--------------+-----------------------------------------------+ | 
|  | 144 | | tsc_delta     | ``32``       | The number of ticks of the timestamp counter  | | 
|  | 145 | |               |              | since a previous record recorded a delta or   | | 
|  | 146 | |               |              | other TSC resetting event.                    | | 
|  | 147 | +---------------+--------------+-----------------------------------------------+ | 
|  | 148 |  | 
|  | 149 | On little-endian machines, the bitfields are ordered from least significant bit | 
|  | 150 | bit to most significant bit. A reader can read an 8 bit value and apply the mask | 
|  | 151 | ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned | 
|  | 152 | shift right by ``0x04`` to obtain the function_id field. | 
|  | 153 |  | 
|  | 154 | On big-endian machine, the bitfields are written in order from most significant | 
|  | 155 | bit to least significant bit. A reader would read an 8 bit value and unsigned | 
|  | 156 | shift right by 7 bits for the discriminant. The function_id field could be | 
|  | 157 | obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. | 
|  | 158 |  | 
|  | 159 | Function action types are as follows. | 
|  | 160 |  | 
|  | 161 | +---------------+--------------+-----------------------------------------------+ | 
|  | 162 | | Type          | Number       | Description                                   | | 
|  | 163 | +===============+==============+===============================================+ | 
|  | 164 | | Entry         | ``0``        | Typical function entry.                       | | 
|  | 165 | +---------------+--------------+-----------------------------------------------+ | 
|  | 166 | | Exit          | ``1``        | Typical function exit.                        | | 
|  | 167 | +---------------+--------------+-----------------------------------------------+ | 
| Dean Michael Berris | 0f84a7d | 2017-09-18 06:08:46 +0000 | [diff] [blame] | 168 | | Tail_Exit     | ``2``        | An exit from a function due to tail call      | | 
| Keith Wyss | 3d0bc9e | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 169 | |               |              | optimization.                                 | | 
|  | 170 | +---------------+--------------+-----------------------------------------------+ | 
|  | 171 | | Entry_Args    | ``3``        | A function entry that records arguments.      | | 
|  | 172 | +---------------+--------------+-----------------------------------------------+ | 
|  | 173 |  | 
|  | 174 | Entry_Args records do not contain the arguments themselves. Instead, metadata | 
|  | 175 | records for each of the logged args follow the function record in the stream. | 
|  | 176 |  | 
|  | 177 |  | 
|  | 178 | Metadata Records | 
|  | 179 | ---------------- | 
|  | 180 |  | 
|  | 181 | Interspersed throughout the buffer are 16 byte Metadata records. For typically | 
|  | 182 | instrumented binaries, they will be sparser than Function records, and they | 
|  | 183 | provide a fuller picture of the binary execution state. | 
|  | 184 |  | 
|  | 185 | Metadata record layout is partially record dependent, but they share a common | 
|  | 186 | structure. | 
|  | 187 |  | 
|  | 188 | The same bit field rules described for function records apply to the first byte | 
|  | 189 | of MetadataRecords. Within this byte, little endian machines use lsb to msb | 
|  | 190 | ordering and big endian machines use msb to lsb ordering. | 
|  | 191 |  | 
|  | 192 | +---------------+--------------+-----------------------------------------------+ | 
|  | 193 | | Field         | Size         | Description                                   | | 
|  | 194 | +===============+==============+===============================================+ | 
|  | 195 | | discriminant  | ``1 bit``    | Indicates whether a reader should read a      | | 
|  | 196 | |               |              | Function or Metadata record. Set to ``1`` for | | 
|  | 197 | |               |              | Metadata records.                             | | 
|  | 198 | +---------------+--------------+-----------------------------------------------+ | 
|  | 199 | | record_kind   | ``7 bits``   | The type of Metadata record.                  | | 
|  | 200 | +---------------+--------------+-----------------------------------------------+ | 
|  | 201 | | data          | ``15 bytes`` | A data field used differently for each record | | 
|  | 202 | |               |              | type.                                         | | 
|  | 203 | +---------------+--------------+-----------------------------------------------+ | 
|  | 204 |  | 
|  | 205 | Here is a table of the enumerated record kinds. | 
|  | 206 |  | 
|  | 207 | +--------+---------------------------+ | 
|  | 208 | | Number | Type                      | | 
|  | 209 | +========+===========================+ | 
|  | 210 | | 0      | NewBuffer                 | | 
|  | 211 | +--------+---------------------------+ | 
|  | 212 | | 1      | EndOfBuffer               | | 
|  | 213 | +--------+---------------------------+ | 
|  | 214 | | 2      | NewCPUId                  | | 
|  | 215 | +--------+---------------------------+ | 
|  | 216 | | 3      | TSCWrap                   | | 
|  | 217 | +--------+---------------------------+ | 
|  | 218 | | 4      | WallTimeMarker            | | 
|  | 219 | +--------+---------------------------+ | 
|  | 220 | | 5      | CustomEventMarker         | | 
|  | 221 | +--------+---------------------------+ | 
|  | 222 | | 6      | CallArgument              | | 
|  | 223 | +--------+---------------------------+ | 
|  | 224 |  | 
|  | 225 |  | 
|  | 226 | NewBuffer Records | 
|  | 227 | ----------------- | 
|  | 228 |  | 
|  | 229 | Each buffer begins with a NewBuffer record immediately after the header. | 
|  | 230 | It records the thread ID of the thread that the trace belongs to. | 
|  | 231 |  | 
|  | 232 | Its data segment is as follows. | 
|  | 233 |  | 
|  | 234 | +---------------+--------------+-----------------------------------------------+ | 
|  | 235 | | Field         | Size (bytes) | Description                                   | | 
|  | 236 | +===============+==============+===============================================+ | 
|  | 237 | | thread_Id     | ``2``        | Thread ID for buffer.                         | | 
|  | 238 | +---------------+--------------+-----------------------------------------------+ | 
|  | 239 | | reserved      | ``13``       | Unused.                                       | | 
|  | 240 | +---------------+--------------+-----------------------------------------------+ | 
|  | 241 |  | 
|  | 242 |  | 
|  | 243 | WallClockTime Records | 
|  | 244 | --------------------- | 
|  | 245 |  | 
|  | 246 | Following the NewBuffer record, each buffer records an absolute time as a frame | 
|  | 247 | of reference for the durations recorded by timestamp counter deltas. | 
|  | 248 |  | 
|  | 249 | Its data segment is as follows. | 
|  | 250 |  | 
|  | 251 | +---------------+--------------+-----------------------------------------------+ | 
|  | 252 | | Field         | Size (bytes) | Description                                   | | 
|  | 253 | +===============+==============+===============================================+ | 
|  | 254 | | seconds       | ``8``        | Seconds on absolute timescale. The starting   | | 
|  | 255 | |               |              | point is unspecified and depends on the       | | 
|  | 256 | |               |              | implementation and platform configured by the | | 
|  | 257 | |               |              | tracer.                                       | | 
|  | 258 | +---------------+--------------+-----------------------------------------------+ | 
|  | 259 | | microseconds  | ``4``        | The microsecond component of the time.        | | 
|  | 260 | +---------------+--------------+-----------------------------------------------+ | 
|  | 261 | | reserved      | ``3``        | Unused.                                       | | 
|  | 262 | +---------------+--------------+-----------------------------------------------+ | 
|  | 263 |  | 
|  | 264 |  | 
|  | 265 | NewCpuId Records | 
|  | 266 | ---------------- | 
|  | 267 |  | 
|  | 268 | Each function entry invokes a routine to determine what CPU is executing. | 
|  | 269 | Typically, this is done with readtscp, which reads the timestamp counter at the | 
|  | 270 | same time. | 
|  | 271 |  | 
|  | 272 | If the tracing detects that the execution has switched CPUs or if this is the | 
|  | 273 | first instrumented entry point, the tracer will output a NewCpuId record. | 
|  | 274 |  | 
|  | 275 | Its data segment is as follows. | 
|  | 276 |  | 
|  | 277 | +---------------+--------------+-----------------------------------------------+ | 
|  | 278 | | Field         | Size (bytes) | Description                                   | | 
|  | 279 | +===============+==============+===============================================+ | 
|  | 280 | | cpu_id        | ``2``        | CPU Id.                                       | | 
|  | 281 | +---------------+--------------+-----------------------------------------------+ | 
|  | 282 | | absolute_tsc  | ``8``        | The absolute value of the timestamp counter.  | | 
|  | 283 | +---------------+--------------+-----------------------------------------------+ | 
|  | 284 | | reserved      | ``5``        | Unused.                                       | | 
|  | 285 | +---------------+--------------+-----------------------------------------------+ | 
|  | 286 |  | 
|  | 287 |  | 
|  | 288 | TSCWrap Records | 
|  | 289 | --------------- | 
|  | 290 |  | 
|  | 291 | Since each function record uses a 32 bit value to represent the number of ticks | 
|  | 292 | of the timestamp counter since the last reference, it is possible for this value | 
|  | 293 | to overflow, particularly for sparsely instrumented binaries. | 
|  | 294 |  | 
|  | 295 | When this delta would not fit into a 32 bit representation, a reference absolute | 
|  | 296 | timestamp counter record is written in the form of a TSCWrap record. | 
|  | 297 |  | 
|  | 298 | Its data segment is as follows. | 
|  | 299 |  | 
|  | 300 | +---------------+--------------+-----------------------------------------------+ | 
|  | 301 | | Field         | Size (bytes) | Description                                   | | 
|  | 302 | +===============+==============+===============================================+ | 
|  | 303 | | absolute_tsc  | ``8``        | Timestamp counter value.                      | | 
|  | 304 | +---------------+--------------+-----------------------------------------------+ | 
|  | 305 | | reserved      | ``7``        | Unused.                                       | | 
|  | 306 | +---------------+--------------+-----------------------------------------------+ | 
|  | 307 |  | 
|  | 308 |  | 
|  | 309 | CallArgument Records | 
|  | 310 | -------------------- | 
|  | 311 |  | 
|  | 312 | Immediately following an Entry_Args type function record, there may be one or | 
|  | 313 | more CallArgument records that contain the traced function's parameter values. | 
|  | 314 |  | 
|  | 315 | The order of the CallArgument Record sequency corresponds one to one with the | 
|  | 316 | order of the function parameters. | 
|  | 317 |  | 
|  | 318 | CallArgument data segment: | 
|  | 319 |  | 
|  | 320 | +---------------+--------------+-----------------------------------------------+ | 
|  | 321 | | Field         | Size (bytes) | Description                                   | | 
|  | 322 | +===============+==============+===============================================+ | 
|  | 323 | | argument      | ``8``        | Numeric argument (may be pointer address).    | | 
|  | 324 | +---------------+--------------+-----------------------------------------------+ | 
|  | 325 | | reserved      | ``7``        | Unused.                                       | | 
|  | 326 | +---------------+--------------+-----------------------------------------------+ | 
|  | 327 |  | 
|  | 328 |  | 
|  | 329 | CustomEventMarker Records | 
|  | 330 | ------------------------- | 
|  | 331 |  | 
|  | 332 | XRay provides the feature of logging custom events. This may be leveraged to | 
|  | 333 | record tracing info for RPCs or similarly trace data that is application | 
|  | 334 | specific. | 
|  | 335 |  | 
|  | 336 | Custom Events themselves are an unstructured (application defined) segment of | 
|  | 337 | memory with arbitrary size within the buffer. They are preceded by | 
|  | 338 | CustomEventMarkers to indicate their presence and size. | 
|  | 339 |  | 
|  | 340 | CustomEventMarker data segment: | 
|  | 341 |  | 
|  | 342 | +---------------+--------------+-----------------------------------------------+ | 
|  | 343 | | Field         | Size (bytes) | Description                                   | | 
|  | 344 | +===============+==============+===============================================+ | 
|  | 345 | | event_size    | ``4``        | Size of preceded event.                       | | 
|  | 346 | +---------------+--------------+-----------------------------------------------+ | 
|  | 347 | | absolute_tsc  | ``8``        | A timestamp counter of the event.             | | 
|  | 348 | +---------------+--------------+-----------------------------------------------+ | 
|  | 349 | | reserved      | ``3``        | Unused.                                       | | 
|  | 350 | +---------------+--------------+-----------------------------------------------+ | 
|  | 351 |  | 
|  | 352 |  | 
|  | 353 | EndOfBuffer Records | 
|  | 354 | ------------------- | 
|  | 355 |  | 
|  | 356 | An EndOfBuffer record type indicates that there is no more trace data in this | 
|  | 357 | buffer. The reader is expected to seek past the remaining buffer_size expressed | 
|  | 358 | before the start of buffer and look for either another header or EOF. | 
|  | 359 |  | 
|  | 360 |  | 
|  | 361 | Format Grammar and Invariants | 
|  | 362 | ============================= | 
|  | 363 |  | 
|  | 364 | Not all sequences of Metadata records and Function records are valid data. A | 
|  | 365 | sequence should be parsed as a state machine. The expectations for a valid | 
|  | 366 | format can be expressed as a context free grammar. | 
|  | 367 |  | 
|  | 368 | This is an attempt to explain the format with statements in EBNF format. | 
|  | 369 |  | 
|  | 370 | - Format := Header ThreadBuffer* EOF | 
|  | 371 |  | 
|  | 372 | - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End | 
|  | 373 |  | 
|  | 374 | - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent | 
|  | 375 |  | 
|  | 376 | - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type | 
|  | 377 |  | 
|  | 378 | - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory | 
|  | 379 |  | 
|  | 380 | - End := EndOfBuffer RemainingBufferSizeToSkip | 
|  | 381 |  | 
|  | 382 |  | 
|  | 383 | Function Record Order | 
|  | 384 | --------------------- | 
|  | 385 |  | 
|  | 386 | There are a few clarifications that may help understand what is expected of | 
|  | 387 | Function records. | 
|  | 388 |  | 
|  | 389 | - Functions with an Exit are expected to have a corresponding Entry or | 
|  | 390 | Entry_Args function record precede them in the trace. | 
|  | 391 |  | 
|  | 392 | - Tail_Exit Function records record the Function ID of the function whose return | 
|  | 393 | address the program counter will take. In other words, the final function that | 
|  | 394 | would be popped off of the call stack if tail call optimization was not used. | 
|  | 395 |  | 
|  | 396 | - Not all functions marked for instrumentation are necessarily in the trace. The | 
|  | 397 | tracer uses heuristics to preserve the trace for non-trivial functions. | 
|  | 398 |  | 
|  | 399 | - Not every entry must have a traced Exit or Tail Exit. The buffer may run out | 
|  | 400 | of space or the program may request for the tracer to finalize toreturn the | 
|  | 401 | buffer before an instrumented function exits. |