| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <title>LLVM Bitcode File Format</title> |
| <link rel="stylesheet" href="llvm.css" type="text/css"> |
| </head> |
| <body> |
| <div class="doc_title"> LLVM Bitcode File Format </div> |
| <ol> |
| <li><a href="#abstract">Abstract</a></li> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#bitstream">Bitstream Format</a> |
| <ol> |
| <li><a href="#magic">Magic Numbers</a></li> |
| <li><a href="#primitives">Primitives</a></li> |
| <li><a href="#abbrevid">Abbreviation IDs</a></li> |
| <li><a href="#blocks">Blocks</a></li> |
| <li><a href="#datarecord">Data Records</a></li> |
| <li><a href="#abbreviations">Abbreviations</a></li> |
| <li><a href="#stdblocks">Standard Blocks</a></li> |
| </ol> |
| </li> |
| <li><a href="#wrapper">Bitcode Wrapper Format</a> |
| </li> |
| <li><a href="#llvmir">LLVM IR Encoding</a> |
| <ol> |
| <li><a href="#basics">Basics</a></li> |
| </ol> |
| </li> |
| </ol> |
| <div class="doc_author"> |
| <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a> |
| and <a href="http://www.reverberate.org">Joshua Haberman</a>. |
| </p> |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <div class="doc_section"> <a name="abstract">Abstract</a></div> |
| <!-- *********************************************************************** --> |
| |
| <div class="doc_text"> |
| |
| <p>This document describes the LLVM bitstream file format and the encoding of |
| the LLVM IR into it.</p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <div class="doc_section"> <a name="overview">Overview</a></div> |
| <!-- *********************************************************************** --> |
| |
| <div class="doc_text"> |
| |
| <p> |
| What is commonly known as the LLVM bitcode file format (also, sometimes |
| anachronistically known as bytecode) is actually two things: a <a |
| href="#bitstream">bitstream container format</a> |
| and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p> |
| |
| <p> |
| The bitstream format is an abstract encoding of structured data, very |
| similar to XML in some ways. Like XML, bitstream files contain tags, and nested |
| structures, and you can parse the file without having to understand the tags. |
| Unlike XML, the bitstream format is a binary encoding, and unlike XML it |
| provides a mechanism for the file to self-describe "abbreviations", which are |
| effectively size optimizations for the content.</p> |
| |
| <p>LLVM IR files may be optionally embedded into a <a |
| href="#wrapper">wrapper</a> structure that makes it easy to embed extra data |
| along with LLVM IR files.</p> |
| |
| <p>This document first describes the LLVM bitstream format, describes the |
| wrapper format, then describes the record structure used by LLVM IR files. |
| </p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <div class="doc_section"> <a name="bitstream">Bitstream Format</a></div> |
| <!-- *********************************************************************** --> |
| |
| <div class="doc_text"> |
| |
| <p> |
| The bitstream format is literally a stream of bits, with a very simple |
| structure. This structure consists of the following concepts: |
| </p> |
| |
| <ul> |
| <li>A "<a href="#magic">magic number</a>" that identifies the contents of |
| the stream.</li> |
| <li>Encoding <a href="#primitives">primitives</a> like variable bit-rate |
| integers.</li> |
| <li><a href="#blocks">Blocks</a>, which define nested content.</li> |
| <li><a href="#datarecord">Data Records</a>, which describe entities within the |
| file.</li> |
| <li>Abbreviations, which specify compression optimizations for the file.</li> |
| </ul> |
| |
| <p>Note that the <a |
| href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be |
| used to dump and inspect arbitrary bitstreams, which is very useful for |
| understanding the encoding.</p> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="magic">Magic Numbers</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p>The first two bytes of a bitcode file are 'BC' (0x42, 0x43). |
| The second two bytes are an application-specific magic number. Generic |
| bitcode tools can look at only the first two bytes to verify the file is |
| bitcode, while application-specific programs will want to look at all four.</p> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="primitives">Primitives</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| A bitstream literally consists of a stream of bits, which are read in order |
| starting with the least significant bit of each byte. The stream is made up of a |
| number of primitive values that encode a stream of unsigned integer values. |
| These |
| integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed |
| Width Integers</a> or as <a href="#variablewidth">Variable Width |
| Integers</a>. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p>Fixed-width integer values have their low bits emitted directly to the file. |
| For example, a 3-bit integer value encodes 1 as 001. Fixed width integers |
| are used when there are a well-known number of options for a field. For |
| example, boolean values are usually encoded with a 1-bit wide integer. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="variablewidth">Variable Width |
| Integers</a></div> |
| |
| <div class="doc_text"> |
| |
| <p>Variable-width integer (VBR) values encode values of arbitrary size, |
| optimizing for the case where the values are small. Given a 4-bit VBR field, |
| any 3-bit value (0 through 7) is encoded directly, with the high bit set to |
| zero. Values larger than N-1 bits emit their bits in a series of N-1 bit |
| chunks, where all but the last set the high bit.</p> |
| |
| <p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a |
| vbr4 value. The first set of four bits indicates the value 3 (011) with a |
| continuation piece (indicated by a high bit of 1). The next word indicates a |
| value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value |
| 27. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div> |
| |
| <div class="doc_text"> |
| |
| <p>6-bit characters encode common characters into a fixed 6-bit field. They |
| represent the following characters with the following 6-bit values:</p> |
| |
| <div class="doc_code"> |
| <pre> |
| 'a' .. 'z' — 0 .. 25 |
| 'A' .. 'Z' — 26 .. 51 |
| '0' .. '9' — 52 .. 61 |
| '.' — 62 |
| '_' — 63 |
| </pre> |
| </div> |
| |
| <p>This encoding is only suitable for encoding characters and strings that |
| consist only of the above characters. It is completely incapable of encoding |
| characters not in the set.</p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div> |
| |
| <div class="doc_text"> |
| |
| <p>Occasionally, it is useful to emit zero bits until the bitstream is a |
| multiple of 32 bits. This ensures that the bit position in the stream can be |
| represented as a multiple of 32-bit words.</p> |
| |
| </div> |
| |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| A bitstream is a sequential series of <a href="#blocks">Blocks</a> and |
| <a href="#datarecord">Data Records</a>. Both of these start with an |
| abbreviation ID encoded as a fixed-bitwidth field. The width is specified by |
| the current block, as described below. The value of the abbreviation ID |
| specifies either a builtin ID (which have special meanings, defined below) or |
| one of the abbreviation IDs defined by the stream itself. |
| </p> |
| |
| <p> |
| The set of builtin abbrev IDs is: |
| </p> |
| |
| <ul> |
| <li><tt>0 - <a href="#END_BLOCK">END_BLOCK</a></tt> — This abbrev ID marks |
| the end of the current block.</li> |
| <li><tt>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a></tt> — This |
| abbrev ID marks the beginning of a new block.</li> |
| <li><tt>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt> — This defines |
| a new abbreviation.</li> |
| <li><tt>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> — This ID |
| specifies the definition of an unabbreviated record.</li> |
| </ul> |
| |
| <p>Abbreviation IDs 4 and above are defined by the stream itself, and specify |
| an <a href="#abbrev_records">abbreviated record encoding</a>.</p> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="blocks">Blocks</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| Blocks in a bitstream denote nested regions of the stream, and are identified by |
| a content-specific id number (for example, LLVM IR uses an ID of 12 to represent |
| function bodies). Block IDs 0-7 are reserved for <a href="#stdblocks">standard blocks</a> |
| whose meaning is defined by Bitcode; block IDs 8 and greater are |
| application specific. Nested blocks capture the hierachical structure of the data |
| encoded in it, and various properties are associated with blocks as the file is |
| parsed. Block definitions allow the reader to efficiently skip blocks |
| in constant time if the reader wants a summary of blocks, or if it wants to |
| efficiently skip data they do not understand. The LLVM IR reader uses this |
| mechanism to skip function bodies, lazily reading them on demand. |
| </p> |
| |
| <p> |
| When reading and encoding the stream, several properties are maintained for the |
| block. In particular, each block maintains: |
| </p> |
| |
| <ol> |
| <li>A current abbrev id width. This value starts at 2, and is set every time a |
| block record is entered. The block entry specifies the abbrev id width for |
| the body of the block.</li> |
| |
| <li>A set of abbreviations. Abbreviations may be defined within a block, in |
| which case they are only defined in that block (neither subblocks nor |
| enclosing blocks see the abbreviation). Abbreviations can also be defined |
| inside a <tt><a href="#BLOCKINFO">BLOCKINFO</a></tt> block, in which case |
| they are defined in all blocks that match the ID that the BLOCKINFO block is |
| describing. |
| </li> |
| </ol> |
| |
| <p> |
| As sub blocks are entered, these properties are saved and the new sub-block has |
| its own set of abbreviations, and its own abbrev id width. When a sub-block is |
| popped, the saved values are restored. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK |
| Encoding</a></div> |
| |
| <div class="doc_text"> |
| |
| <p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>, |
| <align32bits>, blocklen<sub>32</sub>]</tt></p> |
| |
| <p> |
| The <tt>ENTER_SUBBLOCK</tt> abbreviation ID specifies the start of a new block |
| record. The <tt>blockid</tt> value is encoded as an 8-bit VBR identifier, and |
| indicates the type of block being entered, which can be |
| a <a href="#stdblocks">standard block</a> or an application-specific block. |
| The <tt>newabbrevlen</tt> value is a 4-bit VBR, which specifies the abbrev id |
| width for the sub-block. The <tt>blocklen</tt> value is a 32-bit aligned value |
| that specifies the size of the subblock in 32-bit words. This value allows the |
| reader to skip over the entire block in one jump. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK |
| Encoding</a></div> |
| |
| <div class="doc_text"> |
| |
| <p><tt>[END_BLOCK, <align32bits>]</tt></p> |
| |
| <p> |
| The <tt>END_BLOCK</tt> abbreviation ID specifies the end of the current block |
| record. Its end is aligned to 32-bits to ensure that the size of the block is |
| an even multiple of 32-bits. |
| </p> |
| |
| </div> |
| |
| |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="datarecord">Data Records</a> |
| </div> |
| |
| <div class="doc_text"> |
| <p> |
| Data records consist of a record code and a number of (up to) 64-bit integer |
| values. The interpretation of the code and values is application specific and |
| there are multiple different ways to encode a record (with an unabbrev record or |
| with an abbreviation). In the LLVM IR format, for example, there is a record |
| which encodes the target triple of a module. The code is |
| <tt>MODULE_CODE_TRIPLE</tt>, and the values of the record are the ASCII codes |
| for the characters in the string. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD |
| Encoding</a></div> |
| |
| <div class="doc_text"> |
| |
| <p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>, |
| op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p> |
| |
| <p> |
| An <tt>UNABBREV_RECORD</tt> provides a default fallback encoding, which is both |
| completely general and extremely inefficient. It can describe an arbitrary |
| record by emitting the code and operands as vbrs. |
| </p> |
| |
| <p> |
| For example, emitting an LLVM IR target triple as an unabbreviated record |
| requires emitting the <tt>UNABBREV_RECORD</tt> abbrevid, a vbr6 for the |
| <tt>MODULE_CODE_TRIPLE</tt> code, a vbr6 for the length of the string, which is |
| equal to the number of operands, and a vbr6 for each character. Because there |
| are no letters with values less than 32, each letter would need to be emitted as |
| at least a two-part VBR, which means that each letter would require at least 12 |
| bits. This is not an efficient encoding, but it is fully general. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record |
| Encoding</a></div> |
| |
| <div class="doc_text"> |
| |
| <p><tt>[<abbrevid>, fields...]</tt></p> |
| |
| <p> |
| An abbreviated record is a abbreviation id followed by a set of fields that are |
| encoded according to the <a href="#abbreviations">abbreviation definition</a>. |
| This allows records to be encoded significantly more densely than records |
| encoded with the <tt><a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> type, |
| and allows the abbreviation types to be specified in the stream itself, which |
| allows the files to be completely self describing. The actual encoding of |
| abbreviations is defined below. |
| </p> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="abbreviations">Abbreviations</a> |
| </div> |
| |
| <div class="doc_text"> |
| <p> |
| Abbreviations are an important form of compression for bitstreams. The idea is |
| to specify a dense encoding for a class of records once, then use that encoding |
| to emit many records. It takes space to emit the encoding into the file, but |
| the space is recouped (hopefully plus some) when the records that use it are |
| emitted. |
| </p> |
| |
| <p> |
| Abbreviations can be determined dynamically per client, per file. Because the |
| abbreviations are stored in the bitstream itself, different streams of the same |
| format can contain different sets of abbreviations if the specific stream does |
| not need it. As a concrete example, LLVM IR files usually emit an abbreviation |
| for binary operators. If a specific LLVM module contained no or few binary |
| operators, the abbreviation does not need to be emitted. |
| </p> |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV |
| Encoding</a></div> |
| |
| <div class="doc_text"> |
| |
| <p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1, |
| ...]</tt></p> |
| |
| <p> |
| A <tt>DEFINE_ABBREV</tt> record adds an abbreviation to the list of currently |
| defined abbreviations in the scope of this block. This definition only exists |
| inside this immediate block — it is not visible in subblocks or enclosing |
| blocks. Abbreviations are implicitly assigned IDs sequentially starting from 4 |
| (the first application-defined abbreviation ID). Any abbreviations defined in a |
| <tt>BLOCKINFO</tt> record receive IDs first, in order, followed by any |
| abbreviations defined within the block itself. Abbreviated data records |
| reference this ID to indicate what abbreviation they are invoking. |
| </p> |
| |
| <p> |
| An abbreviation definition consists of the <tt>DEFINE_ABBREV</tt> abbrevid |
| followed by a VBR that specifies the number of abbrev operands, then the abbrev |
| operands themselves. Abbreviation operands come in three forms. They all start |
| with a single bit that indicates whether the abbrev operand is a literal operand |
| (when the bit is 1) or an encoding operand (when the bit is 0). |
| </p> |
| |
| <ol> |
| <li>Literal operands — <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt> |
| — Literal operands specify that the value in the result is always a single |
| specific value. This specific value is emitted as a vbr8 after the bit |
| indicating that it is a literal operand.</li> |
| <li>Encoding info without data — <tt>[0<sub>1</sub>, |
| encoding<sub>3</sub>]</tt> — Operand encodings that do not have extra |
| data are just emitted as their code. |
| </li> |
| <li>Encoding info with data — <tt>[0<sub>1</sub>, encoding<sub>3</sub>, |
| value<sub>vbr5</sub>]</tt> — Operand encodings that do have extra data are |
| emitted as their code, followed by the extra data. |
| </li> |
| </ol> |
| |
| <p>The possible operand encodings are:</p> |
| |
| <ol> |
| <li>Fixed: The field should be emitted as |
| a <a href="#fixedwidth">fixed-width value</a>, whose width is specified by |
| the operand's extra data.</li> |
| <li>VBR: The field should be emitted as |
| a <a href="#variablewidth">variable-width value</a>, whose width is |
| specified by the operand's extra data.</li> |
| <li>Array: This field is an array of values. The array operand |
| has no extra data, but expects another operand to follow it which indicates |
| the element type of the array. When reading an array in an abbreviated |
| record, the first integer is a vbr6 that indicates the array length, |
| followed by the encoded elements of the array. An array may only occur as |
| the last operand of an abbreviation (except for the one final operand that |
| gives the array's type).</li> |
| <li>Char6: This field should be emitted as |
| a <a href="#char6">char6-encoded value</a>. This operand type takes no |
| extra data.</li> |
| <li>Blob: This field is emitted as a vbr6, followed by padding to a |
| 32-bit boundary (for alignment) and an array of 8-bit objects. The array of |
| bytes is further followed by tail padding to ensure that its total length is |
| a multiple of 4 bytes. This makes it very efficient for the reader to |
| decode the data without having to make a copy of it: it can use a pointer to |
| the data in the mapped in file and poke directly at it. A blob may only |
| occur as the last operand of an abbreviation.</li> |
| </ol> |
| |
| <p> |
| For example, target triples in LLVM modules are encoded as a record of the |
| form <tt>[TRIPLE, 'a', 'b', 'c', 'd']</tt>. Consider if the bitstream emitted |
| the following abbrev entry: |
| </p> |
| |
| <div class="doc_code"> |
| <pre> |
| [0, Fixed, 4] |
| [0, Array] |
| [0, Char6] |
| </pre> |
| </div> |
| |
| <p> |
| When emitting a record with this abbreviation, the above entry would be emitted |
| as: |
| </p> |
| |
| <div class="doc_code"> |
| <p> |
| <tt>[4<sub>abbrevwidth</sub>, 2<sub>4</sub>, 4<sub>vbr6</sub>, 0<sub>6</sub>, |
| 1<sub>6</sub>, 2<sub>6</sub>, 3<sub>6</sub>]</tt> |
| </p> |
| </div> |
| |
| <p>These values are:</p> |
| |
| <ol> |
| <li>The first value, 4, is the abbreviation ID for this abbreviation.</li> |
| <li>The second value, 2, is the code for <tt>TRIPLE</tt> in LLVM IR files.</li> |
| <li>The third value, 4, is the length of the array.</li> |
| <li>The rest of the values are the char6 encoded values |
| for <tt>"abcd"</tt>.</li> |
| </ol> |
| |
| <p> |
| With this abbreviation, the triple is emitted with only 37 bits (assuming a |
| abbrev id width of 3). Without the abbreviation, significantly more space would |
| be required to emit the target triple. Also, because the <tt>TRIPLE</tt> value |
| is not emitted as a literal in the abbreviation, the abbreviation can also be |
| used for any other string value. |
| </p> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="stdblocks">Standard Blocks</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| In addition to the basic block structure and record encodings, the bitstream |
| also defines specific builtin block types. These block types specify how the |
| stream is to be decoded or other metadata. In the future, new standard blocks |
| may be added. Block IDs 0-7 are reserved for standard blocks. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"><a name="BLOCKINFO">#0 - BLOCKINFO |
| Block</a></div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| The <tt>BLOCKINFO</tt> block allows the description of metadata for other |
| blocks. The currently specified records are: |
| </p> |
| |
| <div class="doc_code"> |
| <pre> |
| [SETBID (#1), blockid] |
| [DEFINE_ABBREV, ...] |
| [BLOCKNAME, ...name...] |
| [SETRECORDNAME, RecordID, ...name...] |
| </pre> |
| </div> |
| |
| <p> |
| The <tt>SETBID</tt> record indicates which block ID is being |
| described. <tt>SETBID</tt> records can occur multiple times throughout the |
| block to change which block ID is being described. There must be |
| a <tt>SETBID</tt> record prior to any other records. |
| </p> |
| |
| <p> |
| Standard <tt>DEFINE_ABBREV</tt> records can occur inside <tt>BLOCKINFO</tt> |
| blocks, but unlike their occurrence in normal blocks, the abbreviation is |
| defined for blocks matching the block ID we are describing, <i>not</i> the |
| <tt>BLOCKINFO</tt> block itself. The abbreviations defined |
| in <tt>BLOCKINFO</tt> blocks receive abbreviation IDs as described |
| in <tt><a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt>. |
| </p> |
| |
| <p>The <tt>BLOCKNAME</tt> can optionally occur in this block. The elements of |
| the record are the bytes for the string name of the block. llvm-bcanalyzer uses |
| this to dump out bitcode files symbolically.</p> |
| |
| <p>The <tt>SETRECORDNAME</tt> record can optionally occur in this block. The |
| first entry is a record ID number and the rest of the elements of the record are |
| the bytes for the string name of the record. llvm-bcanalyzer uses |
| this to dump out bitcode files symbolically.</p> |
| |
| <p> |
| Note that although the data in <tt>BLOCKINFO</tt> blocks is described as |
| "metadata," the abbreviations they contain are essential for parsing records |
| from the corresponding blocks. It is not safe to skip them. |
| </p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <div class="doc_section"> <a name="wrapper">Bitcode Wrapper Format</a></div> |
| <!-- *********************************************************************** --> |
| |
| <div class="doc_text"> |
| |
| <p> |
| Bitcode files for LLVM IR may optionally be wrapped in a simple wrapper |
| structure. This structure contains a simple header that indicates the offset |
| and size of the embedded BC file. This allows additional information to be |
| stored alongside the BC file. The structure of this file header is: |
| </p> |
| |
| <div class="doc_code"> |
| <p> |
| <tt>[Magic<sub>32</sub>, Version<sub>32</sub>, Offset<sub>32</sub>, |
| Size<sub>32</sub>, CPUType<sub>32</sub>]</tt> |
| </p> |
| </div> |
| |
| <p> |
| Each of the fields are 32-bit fields stored in little endian form (as with |
| the rest of the bitcode file fields). The Magic number is always |
| <tt>0x0B17C0DE</tt> and the version is currently always <tt>0</tt>. The Offset |
| field is the offset in bytes to the start of the bitcode stream in the file, and |
| the Size field is a size in bytes of the stream. CPUType is a target-specific |
| value that can be used to encode the CPU of the target. |
| </p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div> |
| <!-- *********************************************************************** --> |
| |
| <div class="doc_text"> |
| |
| <p> |
| LLVM IR is encoded into a bitstream by defining blocks and records. It uses |
| blocks for things like constant pools, functions, symbol tables, etc. It uses |
| records for things like instructions, global variable descriptors, type |
| descriptions, etc. This document does not describe the set of abbreviations |
| that the writer uses, as these are fully self-described in the file, and the |
| reader is not allowed to build in any knowledge of this. |
| </p> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="basics">Basics</a> |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"><a name="ir_magic">LLVM IR Magic Number</a></div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| The magic number for LLVM IR files is: |
| </p> |
| |
| <div class="doc_code"> |
| <p> |
| <tt>[0x0<sub>4</sub>, 0xC<sub>4</sub>, 0xE<sub>4</sub>, 0xD<sub>4</sub>]</tt> |
| </p> |
| </div> |
| |
| <p> |
| When combined with the bitcode magic number and viewed as bytes, this is |
| <tt>"BC 0xC0DE"</tt>. |
| </p> |
| |
| </div> |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"><a name="ir_signed_vbr">Signed VBRs</a></div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| <a href="#variablewidth">Variable Width Integers</a> are an efficient way to |
| encode arbitrary sized unsigned values, but is an extremely inefficient way to |
| encode signed values (as signed values are otherwise treated as maximally large |
| unsigned values). |
| </p> |
| |
| <p> |
| As such, signed vbr values of a specific width are emitted as follows: |
| </p> |
| |
| <ul> |
| <li>Positive values are emitted as vbrs of the specified width, but with their |
| value shifted left by one.</li> |
| <li>Negative values are emitted as vbrs of the specified width, but the negated |
| value is shifted left by one, and the low bit is set.</li> |
| </ul> |
| |
| <p> |
| With this encoding, small positive and small negative values can both be emitted |
| efficiently. |
| </p> |
| |
| </div> |
| |
| |
| <!-- _______________________________________________________________________ --> |
| <div class="doc_subsubsection"><a name="ir_blocks">LLVM IR Blocks</a></div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| LLVM IR is defined with the following blocks: |
| </p> |
| |
| <ul> |
| <li>8 — <tt>MODULE_BLOCK</tt> — This is the top-level block that |
| contains the entire module, and describes a variety of per-module |
| information.</li> |
| <li>9 — <tt>PARAMATTR_BLOCK</tt> — This enumerates the parameter |
| attributes.</li> |
| <li>10 — <tt>TYPE_BLOCK</tt> — This describes all of the types in |
| the module.</li> |
| <li>11 — <tt>CONSTANTS_BLOCK</tt> — This describes constants for a |
| module or function.</li> |
| <li>12 — <tt>FUNCTION_BLOCK</tt> — This describes a function |
| body.</li> |
| <li>13 — <tt>TYPE_SYMTAB_BLOCK</tt> — This describes the type symbol |
| table.</li> |
| <li>14 — <tt>VALUE_SYMTAB_BLOCK</tt> — This describes a value symbol |
| table.</li> |
| </ul> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <div class="doc_subsection"><a name="MODULE_BLOCK">MODULE_BLOCK Contents</a> |
| </div> |
| |
| <div class="doc_text"> |
| |
| <p> |
| </p> |
| |
| </div> |
| |
| |
| <!-- *********************************************************************** --> |
| <hr> |
| <address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a> |
| <a href="http://validator.w3.org/check/referer"><img |
| src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a> |
| <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> |
| <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> |
| Last modified: $Date$ |
| </address> |
| </body> |
| </html> |