| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" | 
 | 2 |                       "http://www.w3.org/TR/html4/strict.dtd"> | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 3 | <html> | 
 | 4 | <head> | 
 | 5 |   <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> | 
 | 6 |   <title>LLVM Bitcode File Format</title> | 
 | 7 |   <link rel="stylesheet" href="llvm.css" type="text/css"> | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 8 | </head> | 
 | 9 | <body> | 
 | 10 | <div class="doc_title"> LLVM Bitcode File Format </div> | 
 | 11 | <ol> | 
 | 12 |   <li><a href="#abstract">Abstract</a></li> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 13 |   <li><a href="#overview">Overview</a></li> | 
 | 14 |   <li><a href="#bitstream">Bitstream Format</a> | 
 | 15 |     <ol> | 
 | 16 |     <li><a href="#magic">Magic Numbers</a></li> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 17 |     <li><a href="#primitives">Primitives</a></li> | 
 | 18 |     <li><a href="#abbrevid">Abbreviation IDs</a></li> | 
 | 19 |     <li><a href="#blocks">Blocks</a></li> | 
 | 20 |     <li><a href="#datarecord">Data Records</a></li> | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 21 |     <li><a href="#abbreviations">Abbreviations</a></li> | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 22 |     <li><a href="#stdblocks">Standard Blocks</a></li> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 23 |     </ol> | 
 | 24 |   </li> | 
| Chris Lattner | 6fa6a32 | 2008-07-09 05:14:23 +0000 | [diff] [blame] | 25 |   <li><a href="#wrapper">Bitcode Wrapper Format</a> | 
 | 26 |   </li> | 
| Chris Lattner | 69b3e40 | 2007-05-13 01:39:44 +0000 | [diff] [blame] | 27 |   <li><a href="#llvmir">LLVM IR Encoding</a> | 
 | 28 |     <ol> | 
 | 29 |     <li><a href="#basics">Basics</a></li> | 
 | 30 |     </ol> | 
 | 31 |   </li> | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 32 | </ol> | 
 | 33 | <div class="doc_author"> | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 34 |   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a> | 
 | 35 |   and <a href="http://www.reverberate.org">Joshua Haberman</a>. | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 36 | </p> | 
 | 37 | </div> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 38 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 39 | <!-- *********************************************************************** --> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 40 | <div class="doc_section"> <a name="abstract">Abstract</a></div> | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 41 | <!-- *********************************************************************** --> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 42 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 43 | <div class="doc_text"> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 44 |  | 
 | 45 | <p>This document describes the LLVM bitstream file format and the encoding of | 
 | 46 | the LLVM IR into it.</p> | 
 | 47 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 48 | </div> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 49 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 50 | <!-- *********************************************************************** --> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 51 | <div class="doc_section"> <a name="overview">Overview</a></div> | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 52 | <!-- *********************************************************************** --> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 53 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 54 | <div class="doc_text"> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 55 |  | 
 | 56 | <p> | 
 | 57 | What is commonly known as the LLVM bitcode file format (also, sometimes | 
 | 58 | anachronistically known as bytecode) is actually two things: a <a  | 
 | 59 | href="#bitstream">bitstream container format</a> | 
 | 60 | and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p> | 
 | 61 |  | 
 | 62 | <p> | 
| Reid Spencer | 58d0547 | 2007-05-12 08:01:52 +0000 | [diff] [blame] | 63 | The bitstream format is an abstract encoding of structured data, very | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 64 | similar to XML in some ways.  Like XML, bitstream files contain tags, and nested | 
 | 65 | structures, and you can parse the file without having to understand the tags. | 
 | 66 | Unlike XML, the bitstream format is a binary encoding, and unlike XML it | 
 | 67 | provides a mechanism for the file to self-describe "abbreviations", which are | 
 | 68 | effectively size optimizations for the content.</p> | 
 | 69 |  | 
| Chris Lattner | 6fa6a32 | 2008-07-09 05:14:23 +0000 | [diff] [blame] | 70 | <p>LLVM IR files may be optionally embedded into a <a  | 
 | 71 | href="#wrapper">wrapper</a> structure that makes it easy to embed extra data | 
 | 72 | along with LLVM IR files.</p> | 
 | 73 |  | 
 | 74 | <p>This document first describes the LLVM bitstream format, describes the | 
 | 75 | wrapper format, then describes the record structure used by LLVM IR files. | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 76 | </p> | 
 | 77 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 78 | </div> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 79 |  | 
 | 80 | <!-- *********************************************************************** --> | 
 | 81 | <div class="doc_section"> <a name="bitstream">Bitstream Format</a></div> | 
 | 82 | <!-- *********************************************************************** --> | 
 | 83 |  | 
 | 84 | <div class="doc_text"> | 
 | 85 |  | 
 | 86 | <p> | 
 | 87 | The bitstream format is literally a stream of bits, with a very simple | 
 | 88 | structure.  This structure consists of the following concepts: | 
 | 89 | </p> | 
 | 90 |  | 
 | 91 | <ul> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 92 | <li>A "<a href="#magic">magic number</a>" that identifies the contents of | 
 | 93 |     the stream.</li> | 
 | 94 | <li>Encoding <a href="#primitives">primitives</a> like variable bit-rate | 
 | 95 |     integers.</li>  | 
 | 96 | <li><a href="#blocks">Blocks</a>, which define nested content.</li>  | 
 | 97 | <li><a href="#datarecord">Data Records</a>, which describe entities within the | 
 | 98 |     file.</li>  | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 99 | <li>Abbreviations, which specify compression optimizations for the file.</li>  | 
 | 100 | </ul> | 
 | 101 |  | 
 | 102 | <p>Note that the <a  | 
 | 103 | href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be | 
 | 104 | used to dump and inspect arbitrary bitstreams, which is very useful for | 
 | 105 | understanding the encoding.</p> | 
 | 106 |  | 
 | 107 | </div> | 
 | 108 |  | 
 | 109 | <!-- ======================================================================= --> | 
 | 110 | <div class="doc_subsection"><a name="magic">Magic Numbers</a> | 
 | 111 | </div> | 
 | 112 |  | 
 | 113 | <div class="doc_text"> | 
 | 114 |  | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 115 | <p>The first two bytes of a bitcode file are 'BC' (0x42, 0x43). | 
 | 116 | The second two bytes are an application-specific magic number.  Generic | 
 | 117 | bitcode tools can look at only the first two bytes to verify the file is | 
 | 118 | bitcode, while application-specific programs will want to look at all four.</p> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 119 |  | 
 | 120 | </div> | 
 | 121 |  | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 122 | <!-- ======================================================================= --> | 
 | 123 | <div class="doc_subsection"><a name="primitives">Primitives</a> | 
 | 124 | </div> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 125 |  | 
 | 126 | <div class="doc_text"> | 
 | 127 |  | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 128 | <p> | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 129 | A bitstream literally consists of a stream of bits, which are read in order | 
 | 130 | starting with the least significant bit of each byte.  The stream is made up of a | 
| Chris Lattner | 69b3e40 | 2007-05-13 01:39:44 +0000 | [diff] [blame] | 131 | number of primitive values that encode a stream of unsigned integer values. | 
 | 132 | These | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 133 | integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed | 
 | 134 | Width Integers</a> or as <a href="#variablewidth">Variable Width | 
 | 135 | Integers</a>. | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 136 | </p> | 
 | 137 |  | 
 | 138 | </div> | 
 | 139 |  | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 140 | <!-- _______________________________________________________________________ --> | 
 | 141 | <div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a> | 
 | 142 | </div> | 
 | 143 |  | 
 | 144 | <div class="doc_text"> | 
 | 145 |  | 
 | 146 | <p>Fixed-width integer values have their low bits emitted directly to the file. | 
 | 147 |    For example, a 3-bit integer value encodes 1 as 001.  Fixed width integers | 
 | 148 |    are used when there are a well-known number of options for a field.  For | 
 | 149 |    example, boolean values are usually encoded with a 1-bit wide integer.  | 
 | 150 | </p> | 
 | 151 |  | 
 | 152 | </div> | 
 | 153 |  | 
 | 154 | <!-- _______________________________________________________________________ --> | 
 | 155 | <div class="doc_subsubsection"> <a name="variablewidth">Variable Width | 
 | 156 | Integers</a></div> | 
 | 157 |  | 
 | 158 | <div class="doc_text"> | 
 | 159 |  | 
 | 160 | <p>Variable-width integer (VBR) values encode values of arbitrary size, | 
 | 161 | optimizing for the case where the values are small.  Given a 4-bit VBR field, | 
 | 162 | any 3-bit value (0 through 7) is encoded directly, with the high bit set to | 
 | 163 | zero.  Values larger than N-1 bits emit their bits in a series of N-1 bit | 
 | 164 | chunks, where all but the last set the high bit.</p> | 
 | 165 |  | 
 | 166 | <p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a | 
 | 167 | vbr4 value.  The first set of four bits indicates the value 3 (011) with a | 
 | 168 | continuation piece (indicated by a high bit of 1).  The next word indicates a | 
 | 169 | value of 24 (011 << 3) with no continuation.  The sum (3+24) yields the value | 
 | 170 | 27. | 
 | 171 | </p> | 
 | 172 |  | 
 | 173 | </div> | 
 | 174 |  | 
 | 175 | <!-- _______________________________________________________________________ --> | 
 | 176 | <div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div> | 
 | 177 |  | 
 | 178 | <div class="doc_text"> | 
 | 179 |  | 
 | 180 | <p>6-bit characters encode common characters into a fixed 6-bit field.  They | 
| Chris Lattner | f1d64e9 | 2007-05-12 07:50:14 +0000 | [diff] [blame] | 181 | represent the following characters with the following 6-bit values:</p> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 182 |  | 
 | 183 | <ul> | 
 | 184 | <li>'a' .. 'z' - 0 .. 25</li> | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 185 | <li>'A' .. 'Z' - 26 .. 51</li> | 
 | 186 | <li>'0' .. '9' - 52 .. 61</li> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 187 | <li>'.' - 62</li> | 
 | 188 | <li>'_' - 63</li> | 
 | 189 | </ul> | 
 | 190 |  | 
 | 191 | <p>This encoding is only suitable for encoding characters and strings that | 
 | 192 | consist only of the above characters.  It is completely incapable of encoding | 
 | 193 | characters not in the set.</p> | 
 | 194 |  | 
 | 195 | </div> | 
 | 196 |  | 
 | 197 | <!-- _______________________________________________________________________ --> | 
 | 198 | <div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div> | 
 | 199 |  | 
 | 200 | <div class="doc_text"> | 
 | 201 |  | 
 | 202 | <p>Occasionally, it is useful to emit zero bits until the bitstream is a | 
 | 203 | multiple of 32 bits.  This ensures that the bit position in the stream can be | 
 | 204 | represented as a multiple of 32-bit words.</p> | 
 | 205 |  | 
 | 206 | </div> | 
 | 207 |  | 
 | 208 |  | 
 | 209 | <!-- ======================================================================= --> | 
 | 210 | <div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a> | 
 | 211 | </div> | 
 | 212 |  | 
 | 213 | <div class="doc_text"> | 
 | 214 |  | 
 | 215 | <p> | 
 | 216 | A bitstream is a sequential series of <a href="#blocks">Blocks</a> and | 
 | 217 | <a href="#datarecord">Data Records</a>.  Both of these start with an | 
 | 218 | abbreviation ID encoded as a fixed-bitwidth field.  The width is specified by | 
 | 219 | the current block, as described below.  The value of the abbreviation ID | 
 | 220 | specifies either a builtin ID (which have special meanings, defined below) or | 
 | 221 | one of the abbreviation IDs defined by the stream itself. | 
 | 222 | </p> | 
 | 223 |  | 
 | 224 | <p> | 
 | 225 | The set of builtin abbrev IDs is: | 
 | 226 | </p> | 
 | 227 |  | 
 | 228 | <ul> | 
 | 229 | <li>0 - <a href="#END_BLOCK">END_BLOCK</a> - This abbrev ID marks the end of the | 
 | 230 |     current block.</li> | 
 | 231 | <li>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a> - This abbrev ID marks the | 
 | 232 |     beginning of a new block.</li> | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 233 | <li>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a> - This defines a new | 
 | 234 |     abbreviation.</li> | 
 | 235 | <li>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a> - This ID specifies the | 
 | 236 |     definition of an unabbreviated record.</li> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 237 | </ul> | 
 | 238 |  | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 239 | <p>Abbreviation IDs 4 and above are defined by the stream itself, and specify | 
 | 240 | an <a href="#abbrev_records">abbreviated record encoding</a>.</p> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 241 |  | 
 | 242 | </div> | 
 | 243 |  | 
 | 244 | <!-- ======================================================================= --> | 
 | 245 | <div class="doc_subsection"><a name="blocks">Blocks</a> | 
 | 246 | </div> | 
 | 247 |  | 
 | 248 | <div class="doc_text"> | 
 | 249 |  | 
 | 250 | <p> | 
 | 251 | Blocks in a bitstream denote nested regions of the stream, and are identified by | 
 | 252 | a content-specific id number (for example, LLVM IR uses an ID of 12 to represent | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 253 | function bodies).  Block IDs 0-7 are reserved for <a href="#stdblocks">standard blocks</a> | 
 | 254 | whose meaning is defined by Bitcode; block IDs 8 and greater are | 
 | 255 | application specific. Nested blocks capture the hierachical structure of the data | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 256 | encoded in it, and various properties are associated with blocks as the file is | 
 | 257 | parsed.  Block definitions allow the reader to efficiently skip blocks | 
 | 258 | in constant time if the reader wants a summary of blocks, or if it wants to | 
 | 259 | efficiently skip data they do not understand.  The LLVM IR reader uses this | 
 | 260 | mechanism to skip function bodies, lazily reading them on demand. | 
 | 261 | </p> | 
 | 262 |  | 
 | 263 | <p> | 
 | 264 | When reading and encoding the stream, several properties are maintained for the | 
 | 265 | block.  In particular, each block maintains: | 
 | 266 | </p> | 
 | 267 |  | 
 | 268 | <ol> | 
 | 269 | <li>A current abbrev id width.  This value starts at 2, and is set every time a | 
 | 270 |     block record is entered.  The block entry specifies the abbrev id width for | 
 | 271 |     the body of the block.</li> | 
 | 272 |  | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 273 | <li>A set of abbreviations.  Abbreviations may be defined within a block, in | 
 | 274 |     which case they are only defined in that block (neither subblocks nor | 
 | 275 |     enclosing blocks see the abbreviation).  Abbreviations can also be defined | 
 | 276 |     inside a <a href="#BLOCKINFO">BLOCKINFO</a> block, in which case they are | 
 | 277 |     defined in all blocks that match the ID that the BLOCKINFO block is describing. | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 278 | </li> | 
 | 279 | </ol> | 
 | 280 |  | 
 | 281 | <p>As sub blocks are entered, these properties are saved and the new sub-block | 
 | 282 | has its own set of abbreviations, and its own abbrev id width.  When a sub-block | 
 | 283 | is popped, the saved values are restored.</p> | 
 | 284 |  | 
 | 285 | </div> | 
 | 286 |  | 
 | 287 | <!-- _______________________________________________________________________ --> | 
 | 288 | <div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK | 
 | 289 | Encoding</a></div> | 
 | 290 |  | 
 | 291 | <div class="doc_text"> | 
 | 292 |  | 
 | 293 | <p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>, | 
 | 294 |      <align32bits>, blocklen<sub>32</sub>]</tt></p> | 
 | 295 |  | 
 | 296 | <p> | 
 | 297 | The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block record. | 
 | 298 | The <tt>blockid</tt> value is encoded as a 8-bit VBR identifier, and indicates | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 299 | the type of block being entered (which can be a <a href="#stdblocks">standard | 
 | 300 | block</a> or an application-specific block).  The | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 301 | <tt>newabbrevlen</tt> value is a 4-bit VBR which specifies the | 
 | 302 | abbrev id width for the sub-block.  The <tt>blocklen</tt> is a 32-bit aligned | 
 | 303 | value that specifies the size of the subblock, in 32-bit words.  This value | 
 | 304 | allows the reader to skip over the entire block in one jump. | 
 | 305 | </p> | 
 | 306 |  | 
 | 307 | </div> | 
 | 308 |  | 
 | 309 | <!-- _______________________________________________________________________ --> | 
 | 310 | <div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK | 
 | 311 | Encoding</a></div> | 
 | 312 |  | 
 | 313 | <div class="doc_text"> | 
 | 314 |  | 
 | 315 | <p><tt>[END_BLOCK, <align32bits>]</tt></p> | 
 | 316 |  | 
 | 317 | <p> | 
 | 318 | The END_BLOCK abbreviation ID specifies the end of the current block record. | 
 | 319 | Its end is aligned to 32-bits to ensure that the size of the block is an even | 
 | 320 | multiple of 32-bits.</p> | 
 | 321 |  | 
 | 322 | </div> | 
 | 323 |  | 
 | 324 |  | 
 | 325 |  | 
 | 326 | <!-- ======================================================================= --> | 
 | 327 | <div class="doc_subsection"><a name="datarecord">Data Records</a> | 
 | 328 | </div> | 
 | 329 |  | 
 | 330 | <div class="doc_text"> | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 331 | <p> | 
 | 332 | Data records consist of a record code and a number of (up to) 64-bit integer | 
 | 333 | values.  The interpretation of the code and values is application specific and | 
 | 334 | there are multiple different ways to encode a record (with an unabbrev record | 
 | 335 | or with an abbreviation).  In the LLVM IR format, for example, there is a record | 
 | 336 | which encodes the target triple of a module.  The code is MODULE_CODE_TRIPLE, | 
 | 337 | and the values of the record are the ascii codes for the characters in the | 
 | 338 | string.</p> | 
 | 339 |  | 
 | 340 | </div> | 
 | 341 |  | 
 | 342 | <!-- _______________________________________________________________________ --> | 
 | 343 | <div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD | 
 | 344 | Encoding</a></div> | 
 | 345 |  | 
 | 346 | <div class="doc_text"> | 
 | 347 |  | 
 | 348 | <p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>, | 
 | 349 |        op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p> | 
 | 350 |  | 
 | 351 | <p>An UNABBREV_RECORD provides a default fallback encoding, which is both | 
 | 352 | completely general and also extremely inefficient.  It can describe an arbitrary | 
 | 353 | record, by emitting the code and operands as vbrs.</p> | 
 | 354 |  | 
 | 355 | <p>For example, emitting an LLVM IR target triple as an unabbreviated record | 
 | 356 | requires emitting the UNABBREV_RECORD abbrevid, a vbr6 for the | 
 | 357 | MODULE_CODE_TRIPLE code, a vbr6 for the length of the string (which is equal to | 
 | 358 | the number of operands), and a vbr6 for each character.  Since there are no | 
 | 359 | letters with value less than 32, each letter would need to be emitted as at | 
 | 360 | least a two-part VBR, which means that each letter would require at least 12 | 
 | 361 | bits.  This is not an efficient encoding, but it is fully general.</p> | 
 | 362 |  | 
 | 363 | </div> | 
 | 364 |  | 
 | 365 | <!-- _______________________________________________________________________ --> | 
 | 366 | <div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record | 
 | 367 | Encoding</a></div> | 
 | 368 |  | 
 | 369 | <div class="doc_text"> | 
 | 370 |  | 
 | 371 | <p><tt>[<abbrevid>, fields...]</tt></p> | 
 | 372 |  | 
 | 373 | <p>An abbreviated record is a abbreviation id followed by a set of fields that | 
 | 374 | are encoded according to the <a href="#abbreviations">abbreviation  | 
 | 375 | definition</a>.  This allows records to be encoded significantly more densely | 
 | 376 | than records encoded with the <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a> | 
 | 377 | type, and allows the abbreviation types to be specified in the stream itself, | 
 | 378 | which allows the files to be completely self describing.  The actual encoding | 
 | 379 | of abbreviations is defined below. | 
 | 380 | </p> | 
 | 381 |  | 
 | 382 | </div> | 
 | 383 |  | 
 | 384 | <!-- ======================================================================= --> | 
 | 385 | <div class="doc_subsection"><a name="abbreviations">Abbreviations</a> | 
 | 386 | </div> | 
 | 387 |  | 
 | 388 | <div class="doc_text"> | 
 | 389 | <p> | 
 | 390 | Abbreviations are an important form of compression for bitstreams.  The idea is | 
 | 391 | to specify a dense encoding for a class of records once, then use that encoding | 
 | 392 | to emit many records.  It takes space to emit the encoding into the file, but | 
 | 393 | the space is recouped (hopefully plus some) when the records that use it are | 
 | 394 | emitted. | 
 | 395 | </p> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 396 |  | 
 | 397 | <p> | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 398 | Abbreviations can be determined dynamically per client, per file.  Since the | 
 | 399 | abbreviations are stored in the bitstream itself, different streams of the same | 
 | 400 | format can contain different sets of abbreviations if the specific stream does | 
 | 401 | not need it.  As a concrete example, LLVM IR files usually emit an abbreviation | 
 | 402 | for binary operators.  If a specific LLVM module contained no or few binary | 
 | 403 | operators, the abbreviation does not need to be emitted. | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 404 | </p> | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 405 | </div> | 
 | 406 |  | 
 | 407 | <!-- _______________________________________________________________________ --> | 
 | 408 | <div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV | 
 | 409 |  Encoding</a></div> | 
 | 410 |  | 
 | 411 | <div class="doc_text"> | 
 | 412 |  | 
 | 413 | <p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1, | 
 | 414 |  ...]</tt></p> | 
 | 415 |  | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 416 | <p>A DEFINE_ABBREV record adds an abbreviation to the list of currently | 
 | 417 | defined abbreviations in the scope of this block.  This definition only | 
 | 418 | exists inside this immediate block -- it is not visible in subblocks or | 
 | 419 | enclosing blocks. | 
 | 420 | Abbreviations are implicitly assigned IDs | 
 | 421 | sequentially starting from 4 (the first application-defined abbreviation ID). | 
 | 422 | Any abbreviations defined in a BLOCKINFO record receive IDs first, in order, | 
 | 423 | followed by any abbreviations defined within the block itself. | 
 | 424 | Abbreviated data records reference this ID to indicate what abbreviation | 
 | 425 | they are invoking.</p> | 
 | 426 |  | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 427 | <p>An abbreviation definition consists of the DEFINE_ABBREV abbrevid followed | 
 | 428 | by a VBR that specifies the number of abbrev operands, then the abbrev | 
 | 429 | operands themselves.  Abbreviation operands come in three forms.  They all start | 
 | 430 | with a single bit that indicates whether the abbrev operand is a literal operand | 
 | 431 | (when the bit is 1) or an encoding operand (when the bit is 0).</p> | 
 | 432 |  | 
 | 433 | <ol> | 
 | 434 | <li>Literal operands - <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt> - | 
 | 435 | Literal operands specify that the value in the result | 
 | 436 | is always a single specific value.  This specific value is emitted as a vbr8 | 
 | 437 | after the bit indicating that it is a literal operand.</li> | 
 | 438 | <li>Encoding info without data - <tt>[0<sub>1</sub>, encoding<sub>3</sub>]</tt> | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 439 |  - Operand encodings that do not have extra data are just emitted as their code. | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 440 | </li> | 
 | 441 | <li>Encoding info with data - <tt>[0<sub>1</sub>, encoding<sub>3</sub>,  | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 442 | value<sub>vbr5</sub>]</tt> - Operand encodings that do have extra data are | 
 | 443 | emitted as their code, followed by the extra data. | 
| Chris Lattner | daeb63c | 2007-05-12 07:49:15 +0000 | [diff] [blame] | 444 | </li> | 
 | 445 | </ol> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 446 |  | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 447 | <p>The possible operand encodings are:</p> | 
 | 448 |  | 
 | 449 | <ul> | 
 | 450 | <li>1 - Fixed - The field should be emitted as a <a  | 
 | 451 |     href="#fixedwidth">fixed-width value</a>, whose width | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 452 |     is specified by the operand's extra data.</li> | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 453 | <li>2 - VBR - The field should be emitted as a <a  | 
 | 454 |     href="#variablewidth">variable-width value</a>, whose width | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 455 |     is specified by the operand's extra data.</li> | 
 | 456 | <li>3 - Array - This field is an array of values.  The array operand has no | 
 | 457 |     extra data, but expects another operand to follow it which indicates the | 
 | 458 |     element type of the array.  When reading an array in an abbreviated record, | 
 | 459 |     the first integer is a vbr6 that indicates the array length, followed by | 
 | 460 |     the encoded elements of the array.  An array may only occur as the last | 
 | 461 |     operand of an abbreviation (except for the one final operand that gives | 
 | 462 |     the array's type).</li> | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 463 | <li>4 - Char6 - This field should be emitted as a <a href="#char6">char6-encoded | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 464 |     value</a>.  This operand type takes no extra data.</li> | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 465 | </ul> | 
 | 466 |  | 
 | 467 | <p>For example, target triples in LLVM modules are encoded as a record of the | 
 | 468 | form <tt>[TRIPLE, 'a', 'b', 'c', 'd']</tt>.  Consider if the bitstream emitted | 
 | 469 | the following abbrev entry:</p> | 
 | 470 |  | 
 | 471 | <ul> | 
 | 472 | <li><tt>[0, Fixed, 4]</tt></li> | 
 | 473 | <li><tt>[0, Array]</tt></li> | 
 | 474 | <li><tt>[0, Char6]</tt></li> | 
 | 475 | </ul> | 
 | 476 |  | 
 | 477 | <p>When emitting a record with this abbreviation, the above entry would be | 
 | 478 | emitted as:</p> | 
 | 479 |  | 
 | 480 | <p><tt>[4<sub>abbrevwidth</sub>, 2<sub>4</sub>, 4<sub>vbr6</sub>, | 
 | 481 |    0<sub>6</sub>, 1<sub>6</sub>, 2<sub>6</sub>, 3<sub>6</sub>]</tt></p> | 
 | 482 |  | 
 | 483 | <p>These values are:</p> | 
 | 484 |  | 
 | 485 | <ol> | 
 | 486 | <li>The first value, 4, is the abbreviation ID for this abbreviation.</li> | 
 | 487 | <li>The second value, 2, is the code for TRIPLE in LLVM IR files.</li> | 
 | 488 | <li>The third value, 4, is the length of the array.</li> | 
 | 489 | <li>The rest of the values are the char6 encoded values for "abcd".</li> | 
 | 490 | </ol> | 
 | 491 |  | 
 | 492 | <p>With this abbreviation, the triple is emitted with only 37 bits (assuming a | 
 | 493 | abbrev id width of 3).  Without the abbreviation, significantly more space would | 
 | 494 | be required to emit the target triple.  Also, since the TRIPLE value is not | 
 | 495 | emitted as a literal in the abbreviation, the abbreviation can also be used for | 
 | 496 | any other string value. | 
 | 497 | </p> | 
 | 498 |  | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 499 | </div> | 
 | 500 |  | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 501 | <!-- ======================================================================= --> | 
 | 502 | <div class="doc_subsection"><a name="stdblocks">Standard Blocks</a> | 
 | 503 | </div> | 
 | 504 |  | 
 | 505 | <div class="doc_text"> | 
 | 506 |  | 
 | 507 | <p> | 
 | 508 | In addition to the basic block structure and record encodings, the bitstream | 
 | 509 | also defines specific builtin block types.  These block types specify how the | 
 | 510 | stream is to be decoded or other metadata.  In the future, new standard blocks | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 511 | may be added.  Block IDs 0-7 are reserved for standard blocks. | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 512 | </p> | 
 | 513 |  | 
 | 514 | </div> | 
 | 515 |  | 
 | 516 | <!-- _______________________________________________________________________ --> | 
 | 517 | <div class="doc_subsubsection"><a name="BLOCKINFO">#0 - BLOCKINFO | 
 | 518 | Block</a></div> | 
 | 519 |  | 
 | 520 | <div class="doc_text"> | 
 | 521 |  | 
 | 522 | <p>The BLOCKINFO block allows the description of metadata for other blocks.  The | 
 | 523 |   currently specified records are:</p> | 
 | 524 |   | 
 | 525 | <ul> | 
 | 526 | <li><tt>[SETBID (#1), blockid]</tt></li> | 
 | 527 | <li><tt>[DEFINE_ABBREV, ...]</tt></li> | 
 | 528 | </ul> | 
 | 529 |  | 
 | 530 | <p> | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 531 | The SETBID record indicates which block ID is being described.  SETBID | 
 | 532 | records can occur multiple times throughout the block to change which | 
 | 533 | block ID is being described.  There must be a SETBID record prior to | 
 | 534 | any other records. | 
 | 535 | </p> | 
 | 536 |  | 
 | 537 | <p> | 
 | 538 | Standard DEFINE_ABBREV records can occur inside BLOCKINFO blocks, but unlike | 
 | 539 | their occurrence in normal blocks, the abbreviation is defined for blocks | 
 | 540 | matching the block ID we are describing, <i>not</i> the BLOCKINFO block itself. | 
 | 541 | The abbreviations defined in BLOCKINFO blocks receive abbreviation ids | 
 | 542 | as described in <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a>. | 
 | 543 | </p> | 
 | 544 |  | 
 | 545 | <p> | 
 | 546 | Note that although the data in BLOCKINFO blocks is described as "metadata," the | 
 | 547 | abbreviations they contain are essential for parsing records from the | 
 | 548 | corresponding blocks.  It is not safe to skip them. | 
| Chris Lattner | 7300af5 | 2007-05-13 00:59:52 +0000 | [diff] [blame] | 549 | </p> | 
 | 550 |  | 
 | 551 | </div> | 
| Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame] | 552 |  | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 553 | <!-- *********************************************************************** --> | 
| Chris Lattner | 6fa6a32 | 2008-07-09 05:14:23 +0000 | [diff] [blame] | 554 | <div class="doc_section"> <a name="wrapper">Bitcode Wrapper Format</a></div> | 
 | 555 | <!-- *********************************************************************** --> | 
 | 556 |  | 
 | 557 | <div class="doc_text"> | 
 | 558 |  | 
 | 559 | <p>Bitcode files for LLVM IR may optionally be wrapped in a simple wrapper | 
 | 560 | structure.  This structure contains a simple header that indicates the offset | 
 | 561 | and size of the embedded BC file.  This allows additional information to be | 
 | 562 | stored alongside the BC file.  The structure of this file header is: | 
 | 563 | </p> | 
 | 564 |  | 
 | 565 | <p> | 
 | 566 | <pre> | 
 | 567 | [Magic<sub>32</sub>, | 
 | 568 |  Version<sub>32</sub>, | 
 | 569 |  Offset<sub>32</sub>, | 
 | 570 |  Size<sub>32</sub>, | 
 | 571 |  CPUType<sub>32</sub>] | 
 | 572 | </pre></p> | 
 | 573 |  | 
 | 574 | <p>Each of the fields are 32-bit fields stored in little endian form (as with | 
 | 575 | the rest of the bitcode file fields).  The Magic number is always | 
 | 576 | <tt>0x0B17C0DE</tt> and the version is currently always <tt>0</tt>.  The Offset | 
 | 577 | field is the offset in bytes to the start of the bitcode stream in the file, and | 
 | 578 | the Size field is a size in bytes of the stream. CPUType is a target-specific | 
 | 579 | value that can be used to encode the CPU of the target. | 
 | 580 | </div> | 
 | 581 |  | 
 | 582 |  | 
 | 583 | <!-- *********************************************************************** --> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 584 | <div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div> | 
 | 585 | <!-- *********************************************************************** --> | 
 | 586 |  | 
 | 587 | <div class="doc_text"> | 
 | 588 |  | 
| Chris Lattner | 69b3e40 | 2007-05-13 01:39:44 +0000 | [diff] [blame] | 589 | <p>LLVM IR is encoded into a bitstream by defining blocks and records.  It uses | 
 | 590 | blocks for things like constant pools, functions, symbol tables, etc.  It uses | 
 | 591 | records for things like instructions, global variable descriptors, type | 
 | 592 | descriptions, etc.  This document does not describe the set of abbreviations | 
 | 593 | that the writer uses, as these are fully self-described in the file, and the | 
 | 594 | reader is not allowed to build in any knowledge of this.</p> | 
 | 595 |  | 
 | 596 | </div> | 
 | 597 |  | 
 | 598 | <!-- ======================================================================= --> | 
 | 599 | <div class="doc_subsection"><a name="basics">Basics</a> | 
 | 600 | </div> | 
 | 601 |  | 
 | 602 | <!-- _______________________________________________________________________ --> | 
 | 603 | <div class="doc_subsubsection"><a name="ir_magic">LLVM IR Magic Number</a></div> | 
 | 604 |  | 
 | 605 | <div class="doc_text"> | 
 | 606 |  | 
 | 607 | <p> | 
 | 608 | The magic number for LLVM IR files is: | 
 | 609 | </p> | 
 | 610 |  | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 611 | <p><tt>[0x0<sub>4</sub>, 0xC<sub>4</sub>, 0xE<sub>4</sub>, 0xD<sub>4</sub>]</tt></p> | 
| Chris Lattner | 69b3e40 | 2007-05-13 01:39:44 +0000 | [diff] [blame] | 612 |  | 
| Chris Lattner | f19b8e4 | 2007-10-08 18:42:45 +0000 | [diff] [blame] | 613 | <p>When combined with the bitcode magic number and viewed as bytes, this is "BC 0xC0DE".</p> | 
| Chris Lattner | 69b3e40 | 2007-05-13 01:39:44 +0000 | [diff] [blame] | 614 |  | 
 | 615 | </div> | 
 | 616 |  | 
 | 617 | <!-- _______________________________________________________________________ --> | 
 | 618 | <div class="doc_subsubsection"><a name="ir_signed_vbr">Signed VBRs</a></div> | 
 | 619 |  | 
 | 620 | <div class="doc_text"> | 
 | 621 |  | 
 | 622 | <p> | 
 | 623 | <a href="#variablewidth">Variable Width Integers</a> are an efficient way to | 
 | 624 | encode arbitrary sized unsigned values, but is an extremely inefficient way to | 
 | 625 | encode signed values (as signed values are otherwise treated as maximally large | 
 | 626 | unsigned values).</p> | 
 | 627 |  | 
 | 628 | <p>As such, signed vbr values of a specific width are emitted as follows:</p> | 
 | 629 |  | 
 | 630 | <ul> | 
 | 631 | <li>Positive values are emitted as vbrs of the specified width, but with their | 
 | 632 |     value shifted left by one.</li> | 
 | 633 | <li>Negative values are emitted as vbrs of the specified width, but the negated | 
 | 634 |     value is shifted left by one, and the low bit is set.</li> | 
 | 635 | </ul> | 
 | 636 |  | 
 | 637 | <p>With this encoding, small positive and small negative values can both be | 
 | 638 | emitted efficiently.</p> | 
 | 639 |  | 
 | 640 | </div> | 
 | 641 |  | 
 | 642 |  | 
 | 643 | <!-- _______________________________________________________________________ --> | 
 | 644 | <div class="doc_subsubsection"><a name="ir_blocks">LLVM IR Blocks</a></div> | 
 | 645 |  | 
 | 646 | <div class="doc_text"> | 
 | 647 |  | 
 | 648 | <p> | 
 | 649 | LLVM IR is defined with the following blocks: | 
 | 650 | </p> | 
 | 651 |  | 
 | 652 | <ul> | 
 | 653 | <li>8  - MODULE_BLOCK - This is the top-level block that contains the | 
 | 654 |     entire module, and describes a variety of per-module information.</li> | 
 | 655 | <li>9  - PARAMATTR_BLOCK - This enumerates the parameter attributes.</li> | 
 | 656 | <li>10 - TYPE_BLOCK - This describes all of the types in the module.</li> | 
 | 657 | <li>11 - CONSTANTS_BLOCK - This describes constants for a module or | 
 | 658 |     function.</li> | 
 | 659 | <li>12 - FUNCTION_BLOCK - This describes a function body.</li> | 
 | 660 | <li>13 - TYPE_SYMTAB_BLOCK - This describes the type symbol table.</li> | 
 | 661 | <li>14 - VALUE_SYMTAB_BLOCK - This describes a value symbol table.</li> | 
 | 662 | </ul> | 
 | 663 |  | 
 | 664 | </div> | 
 | 665 |  | 
 | 666 | <!-- ======================================================================= --> | 
 | 667 | <div class="doc_subsection"><a name="MODULE_BLOCK">MODULE_BLOCK Contents</a> | 
 | 668 | </div> | 
 | 669 |  | 
 | 670 | <div class="doc_text"> | 
 | 671 |  | 
 | 672 | <p> | 
 | 673 | </p> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 674 |  | 
 | 675 | </div> | 
 | 676 |  | 
 | 677 |  | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 678 | <!-- *********************************************************************** --> | 
 | 679 | <hr> | 
 | 680 | <address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img | 
 | 681 |  src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> | 
 | 682 | <a href="http://validator.w3.org/check/referer"><img | 
 | 683 |  src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> | 
| Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 684 |  <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> | 
| Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 685 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> | 
 | 686 | Last modified: $Date$ | 
 | 687 | </address> | 
 | 688 | </body> | 
 | 689 | </html> |