Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 3 | <html> |
| 4 | <head> |
| 5 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| 6 | <title>LLVM Bitcode File Format</title> |
| 7 | <link rel="stylesheet" href="llvm.css" type="text/css"> |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 8 | </head> |
| 9 | <body> |
| 10 | <div class="doc_title"> LLVM Bitcode File Format </div> |
| 11 | <ol> |
| 12 | <li><a href="#abstract">Abstract</a></li> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 13 | <li><a href="#overview">Overview</a></li> |
| 14 | <li><a href="#bitstream">Bitstream Format</a> |
| 15 | <ol> |
| 16 | <li><a href="#magic">Magic Numbers</a></li> |
Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 17 | <li><a href="#primitives">Primitives</a></li> |
| 18 | <li><a href="#abbrevid">Abbreviation IDs</a></li> |
| 19 | <li><a href="#blocks">Blocks</a></li> |
| 20 | <li><a href="#datarecord">Data Records</a></li> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 21 | </ol> |
| 22 | </li> |
| 23 | <li><a href="#llvmir">LLVM IR Encoding</a></li> |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 24 | </ol> |
| 25 | <div class="doc_author"> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 26 | <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>. |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 27 | </p> |
| 28 | </div> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 29 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 30 | <!-- *********************************************************************** --> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 31 | <div class="doc_section"> <a name="abstract">Abstract</a></div> |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 32 | <!-- *********************************************************************** --> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 33 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 34 | <div class="doc_text"> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 35 | |
| 36 | <p>This document describes the LLVM bitstream file format and the encoding of |
| 37 | the LLVM IR into it.</p> |
| 38 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 39 | </div> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 40 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 41 | <!-- *********************************************************************** --> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 42 | <div class="doc_section"> <a name="overview">Overview</a></div> |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 43 | <!-- *********************************************************************** --> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 44 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 45 | <div class="doc_text"> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 46 | |
| 47 | <p> |
| 48 | What is commonly known as the LLVM bitcode file format (also, sometimes |
| 49 | anachronistically known as bytecode) is actually two things: a <a |
| 50 | href="#bitstream">bitstream container format</a> |
| 51 | and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p> |
| 52 | |
| 53 | <p> |
| 54 | The bitstream format is an abstract encoding of structured data, like very |
| 55 | similar to XML in some ways. Like XML, bitstream files contain tags, and nested |
| 56 | structures, and you can parse the file without having to understand the tags. |
| 57 | Unlike XML, the bitstream format is a binary encoding, and unlike XML it |
| 58 | provides a mechanism for the file to self-describe "abbreviations", which are |
| 59 | effectively size optimizations for the content.</p> |
| 60 | |
| 61 | <p>This document first describes the LLVM bitstream format, then describes the |
| 62 | record structure used by LLVM IR files. |
| 63 | </p> |
| 64 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 65 | </div> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 66 | |
| 67 | <!-- *********************************************************************** --> |
| 68 | <div class="doc_section"> <a name="bitstream">Bitstream Format</a></div> |
| 69 | <!-- *********************************************************************** --> |
| 70 | |
| 71 | <div class="doc_text"> |
| 72 | |
| 73 | <p> |
| 74 | The bitstream format is literally a stream of bits, with a very simple |
| 75 | structure. This structure consists of the following concepts: |
| 76 | </p> |
| 77 | |
| 78 | <ul> |
Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 79 | <li>A "<a href="#magic">magic number</a>" that identifies the contents of |
| 80 | the stream.</li> |
| 81 | <li>Encoding <a href="#primitives">primitives</a> like variable bit-rate |
| 82 | integers.</li> |
| 83 | <li><a href="#blocks">Blocks</a>, which define nested content.</li> |
| 84 | <li><a href="#datarecord">Data Records</a>, which describe entities within the |
| 85 | file.</li> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 86 | <li>Abbreviations, which specify compression optimizations for the file.</li> |
| 87 | </ul> |
| 88 | |
| 89 | <p>Note that the <a |
| 90 | href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be |
| 91 | used to dump and inspect arbitrary bitstreams, which is very useful for |
| 92 | understanding the encoding.</p> |
| 93 | |
| 94 | </div> |
| 95 | |
| 96 | <!-- ======================================================================= --> |
| 97 | <div class="doc_subsection"><a name="magic">Magic Numbers</a> |
| 98 | </div> |
| 99 | |
| 100 | <div class="doc_text"> |
| 101 | |
Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 102 | <p>The first four bytes of the stream identify the encoding of the file. This |
| 103 | is used by a reader to know what is contained in the file.</p> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 104 | |
| 105 | </div> |
| 106 | |
Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 107 | <!-- ======================================================================= --> |
| 108 | <div class="doc_subsection"><a name="primitives">Primitives</a> |
| 109 | </div> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 110 | |
| 111 | <div class="doc_text"> |
| 112 | |
Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 113 | <p> |
| 114 | A bitstream literally consists of a stream of bits. This stream is made up of a |
| 115 | number of primitive values that encode a stream of integer values. These |
| 116 | integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed |
| 117 | Width Integers</a> or as <a href="#variablewidth">Variable Width |
| 118 | Integers</a>. |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 119 | </p> |
| 120 | |
| 121 | </div> |
| 122 | |
Chris Lattner | 3a1716d | 2007-05-12 05:37:42 +0000 | [diff] [blame^] | 123 | <!-- _______________________________________________________________________ --> |
| 124 | <div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a> |
| 125 | </div> |
| 126 | |
| 127 | <div class="doc_text"> |
| 128 | |
| 129 | <p>Fixed-width integer values have their low bits emitted directly to the file. |
| 130 | For example, a 3-bit integer value encodes 1 as 001. Fixed width integers |
| 131 | are used when there are a well-known number of options for a field. For |
| 132 | example, boolean values are usually encoded with a 1-bit wide integer. |
| 133 | </p> |
| 134 | |
| 135 | </div> |
| 136 | |
| 137 | <!-- _______________________________________________________________________ --> |
| 138 | <div class="doc_subsubsection"> <a name="variablewidth">Variable Width |
| 139 | Integers</a></div> |
| 140 | |
| 141 | <div class="doc_text"> |
| 142 | |
| 143 | <p>Variable-width integer (VBR) values encode values of arbitrary size, |
| 144 | optimizing for the case where the values are small. Given a 4-bit VBR field, |
| 145 | any 3-bit value (0 through 7) is encoded directly, with the high bit set to |
| 146 | zero. Values larger than N-1 bits emit their bits in a series of N-1 bit |
| 147 | chunks, where all but the last set the high bit.</p> |
| 148 | |
| 149 | <p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a |
| 150 | vbr4 value. The first set of four bits indicates the value 3 (011) with a |
| 151 | continuation piece (indicated by a high bit of 1). The next word indicates a |
| 152 | value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value |
| 153 | 27. |
| 154 | </p> |
| 155 | |
| 156 | </div> |
| 157 | |
| 158 | <!-- _______________________________________________________________________ --> |
| 159 | <div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div> |
| 160 | |
| 161 | <div class="doc_text"> |
| 162 | |
| 163 | <p>6-bit characters encode common characters into a fixed 6-bit field. They |
| 164 | represent the following characters with the following 6-bit values:<s/p> |
| 165 | |
| 166 | <ul> |
| 167 | <li>'a' .. 'z' - 0 .. 25</li> |
| 168 | <li>'A' .. 'Z' - 26 .. 52</li> |
| 169 | <li>'0' .. '9' - 53 .. 61</li> |
| 170 | <li>'.' - 62</li> |
| 171 | <li>'_' - 63</li> |
| 172 | </ul> |
| 173 | |
| 174 | <p>This encoding is only suitable for encoding characters and strings that |
| 175 | consist only of the above characters. It is completely incapable of encoding |
| 176 | characters not in the set.</p> |
| 177 | |
| 178 | </div> |
| 179 | |
| 180 | <!-- _______________________________________________________________________ --> |
| 181 | <div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div> |
| 182 | |
| 183 | <div class="doc_text"> |
| 184 | |
| 185 | <p>Occasionally, it is useful to emit zero bits until the bitstream is a |
| 186 | multiple of 32 bits. This ensures that the bit position in the stream can be |
| 187 | represented as a multiple of 32-bit words.</p> |
| 188 | |
| 189 | </div> |
| 190 | |
| 191 | |
| 192 | <!-- ======================================================================= --> |
| 193 | <div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a> |
| 194 | </div> |
| 195 | |
| 196 | <div class="doc_text"> |
| 197 | |
| 198 | <p> |
| 199 | A bitstream is a sequential series of <a href="#blocks">Blocks</a> and |
| 200 | <a href="#datarecord">Data Records</a>. Both of these start with an |
| 201 | abbreviation ID encoded as a fixed-bitwidth field. The width is specified by |
| 202 | the current block, as described below. The value of the abbreviation ID |
| 203 | specifies either a builtin ID (which have special meanings, defined below) or |
| 204 | one of the abbreviation IDs defined by the stream itself. |
| 205 | </p> |
| 206 | |
| 207 | <p> |
| 208 | The set of builtin abbrev IDs is: |
| 209 | </p> |
| 210 | |
| 211 | <ul> |
| 212 | <li>0 - <a href="#END_BLOCK">END_BLOCK</a> - This abbrev ID marks the end of the |
| 213 | current block.</li> |
| 214 | <li>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a> - This abbrev ID marks the |
| 215 | beginning of a new block.</li> |
| 216 | <li>2 - DEFINE_ABBREV - This defines a new abbreviation.</li> |
| 217 | <li>3 - UNABBREV_RECORD - This ID specifies the definition of an unabbreviated |
| 218 | record.</li> |
| 219 | </ul> |
| 220 | |
| 221 | <p>Abbreviation IDs 4 and above are defined by the stream itself.</p> |
| 222 | |
| 223 | </div> |
| 224 | |
| 225 | <!-- ======================================================================= --> |
| 226 | <div class="doc_subsection"><a name="blocks">Blocks</a> |
| 227 | </div> |
| 228 | |
| 229 | <div class="doc_text"> |
| 230 | |
| 231 | <p> |
| 232 | Blocks in a bitstream denote nested regions of the stream, and are identified by |
| 233 | a content-specific id number (for example, LLVM IR uses an ID of 12 to represent |
| 234 | function bodies). Nested blocks capture the hierachical structure of the data |
| 235 | encoded in it, and various properties are associated with blocks as the file is |
| 236 | parsed. Block definitions allow the reader to efficiently skip blocks |
| 237 | in constant time if the reader wants a summary of blocks, or if it wants to |
| 238 | efficiently skip data they do not understand. The LLVM IR reader uses this |
| 239 | mechanism to skip function bodies, lazily reading them on demand. |
| 240 | </p> |
| 241 | |
| 242 | <p> |
| 243 | When reading and encoding the stream, several properties are maintained for the |
| 244 | block. In particular, each block maintains: |
| 245 | </p> |
| 246 | |
| 247 | <ol> |
| 248 | <li>A current abbrev id width. This value starts at 2, and is set every time a |
| 249 | block record is entered. The block entry specifies the abbrev id width for |
| 250 | the body of the block.</li> |
| 251 | |
| 252 | <li>A set of abbreviations. Abbreviations may be defined within a block, or |
| 253 | they may be associated with all blocks of a particular ID. |
| 254 | </li> |
| 255 | </ol> |
| 256 | |
| 257 | <p>As sub blocks are entered, these properties are saved and the new sub-block |
| 258 | has its own set of abbreviations, and its own abbrev id width. When a sub-block |
| 259 | is popped, the saved values are restored.</p> |
| 260 | |
| 261 | </div> |
| 262 | |
| 263 | <!-- _______________________________________________________________________ --> |
| 264 | <div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK |
| 265 | Encoding</a></div> |
| 266 | |
| 267 | <div class="doc_text"> |
| 268 | |
| 269 | <p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>, |
| 270 | <align32bits>, blocklen<sub>32</sub>]</tt></p> |
| 271 | |
| 272 | <p> |
| 273 | The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block record. |
| 274 | The <tt>blockid</tt> value is encoded as a 8-bit VBR identifier, and indicates |
| 275 | the type of block being entered (which is application specific). The |
| 276 | <tt>newabbrevlen</tt> value is a 4-bit VBR which specifies the |
| 277 | abbrev id width for the sub-block. The <tt>blocklen</tt> is a 32-bit aligned |
| 278 | value that specifies the size of the subblock, in 32-bit words. This value |
| 279 | allows the reader to skip over the entire block in one jump. |
| 280 | </p> |
| 281 | |
| 282 | </div> |
| 283 | |
| 284 | <!-- _______________________________________________________________________ --> |
| 285 | <div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK |
| 286 | Encoding</a></div> |
| 287 | |
| 288 | <div class="doc_text"> |
| 289 | |
| 290 | <p><tt>[END_BLOCK, <align32bits>]</tt></p> |
| 291 | |
| 292 | <p> |
| 293 | The END_BLOCK abbreviation ID specifies the end of the current block record. |
| 294 | Its end is aligned to 32-bits to ensure that the size of the block is an even |
| 295 | multiple of 32-bits.</p> |
| 296 | |
| 297 | </div> |
| 298 | |
| 299 | |
| 300 | |
| 301 | <!-- ======================================================================= --> |
| 302 | <div class="doc_subsection"><a name="datarecord">Data Records</a> |
| 303 | </div> |
| 304 | |
| 305 | <div class="doc_text"> |
| 306 | |
| 307 | <p> |
| 308 | blah |
| 309 | </p> |
| 310 | |
| 311 | </div> |
| 312 | |
| 313 | |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 314 | <!-- *********************************************************************** --> |
| 315 | <div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div> |
| 316 | <!-- *********************************************************************** --> |
| 317 | |
| 318 | <div class="doc_text"> |
| 319 | |
| 320 | <p></p> |
| 321 | |
| 322 | </div> |
| 323 | |
| 324 | |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 325 | <!-- *********************************************************************** --> |
| 326 | <hr> |
| 327 | <address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 328 | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> |
| 329 | <a href="http://validator.w3.org/check/referer"><img |
| 330 | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> |
Chris Lattner | e9ef457 | 2007-05-12 03:23:40 +0000 | [diff] [blame] | 331 | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> |
Reid Spencer | 2c1ce4f | 2007-01-20 23:21:08 +0000 | [diff] [blame] | 332 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> |
| 333 | Last modified: $Date$ |
| 334 | </address> |
| 335 | </body> |
| 336 | </html> |