blob: b84cd0e75bc13c5831b74ac43b3fc2c64df51e7b [file] [log] [blame]
Chris Lattner3a1716d2007-05-12 05:37:42 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
Reid Spencer2c1ce4f2007-01-20 23:21:08 +00003<html>
4<head>
5 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
6 <title>LLVM Bitcode File Format</title>
7 <link rel="stylesheet" href="llvm.css" type="text/css">
Reid Spencer2c1ce4f2007-01-20 23:21:08 +00008</head>
9<body>
10<div class="doc_title"> LLVM Bitcode File Format </div>
11<ol>
12 <li><a href="#abstract">Abstract</a></li>
Chris Lattnere9ef4572007-05-12 03:23:40 +000013 <li><a href="#overview">Overview</a></li>
14 <li><a href="#bitstream">Bitstream Format</a>
15 <ol>
16 <li><a href="#magic">Magic Numbers</a></li>
Chris Lattner3a1716d2007-05-12 05:37:42 +000017 <li><a href="#primitives">Primitives</a></li>
18 <li><a href="#abbrevid">Abbreviation IDs</a></li>
19 <li><a href="#blocks">Blocks</a></li>
20 <li><a href="#datarecord">Data Records</a></li>
Chris Lattnere9ef4572007-05-12 03:23:40 +000021 </ol>
22 </li>
23 <li><a href="#llvmir">LLVM IR Encoding</a></li>
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000024</ol>
25<div class="doc_author">
Chris Lattnere9ef4572007-05-12 03:23:40 +000026 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>.
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000027</p>
28</div>
Chris Lattnere9ef4572007-05-12 03:23:40 +000029
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000030<!-- *********************************************************************** -->
Chris Lattnere9ef4572007-05-12 03:23:40 +000031<div class="doc_section"> <a name="abstract">Abstract</a></div>
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000032<!-- *********************************************************************** -->
Chris Lattnere9ef4572007-05-12 03:23:40 +000033
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000034<div class="doc_text">
Chris Lattnere9ef4572007-05-12 03:23:40 +000035
36<p>This document describes the LLVM bitstream file format and the encoding of
37the LLVM IR into it.</p>
38
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000039</div>
Chris Lattnere9ef4572007-05-12 03:23:40 +000040
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000041<!-- *********************************************************************** -->
Chris Lattnere9ef4572007-05-12 03:23:40 +000042<div class="doc_section"> <a name="overview">Overview</a></div>
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000043<!-- *********************************************************************** -->
Chris Lattnere9ef4572007-05-12 03:23:40 +000044
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000045<div class="doc_text">
Chris Lattnere9ef4572007-05-12 03:23:40 +000046
47<p>
48What is commonly known as the LLVM bitcode file format (also, sometimes
49anachronistically known as bytecode) is actually two things: a <a
50href="#bitstream">bitstream container format</a>
51and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p>
52
53<p>
54The bitstream format is an abstract encoding of structured data, like very
55similar to XML in some ways. Like XML, bitstream files contain tags, and nested
56structures, and you can parse the file without having to understand the tags.
57Unlike XML, the bitstream format is a binary encoding, and unlike XML it
58provides a mechanism for the file to self-describe "abbreviations", which are
59effectively size optimizations for the content.</p>
60
61<p>This document first describes the LLVM bitstream format, then describes the
62record structure used by LLVM IR files.
63</p>
64
Reid Spencer2c1ce4f2007-01-20 23:21:08 +000065</div>
Chris Lattnere9ef4572007-05-12 03:23:40 +000066
67<!-- *********************************************************************** -->
68<div class="doc_section"> <a name="bitstream">Bitstream Format</a></div>
69<!-- *********************************************************************** -->
70
71<div class="doc_text">
72
73<p>
74The bitstream format is literally a stream of bits, with a very simple
75structure. This structure consists of the following concepts:
76</p>
77
78<ul>
Chris Lattner3a1716d2007-05-12 05:37:42 +000079<li>A "<a href="#magic">magic number</a>" that identifies the contents of
80 the stream.</li>
81<li>Encoding <a href="#primitives">primitives</a> like variable bit-rate
82 integers.</li>
83<li><a href="#blocks">Blocks</a>, which define nested content.</li>
84<li><a href="#datarecord">Data Records</a>, which describe entities within the
85 file.</li>
Chris Lattnere9ef4572007-05-12 03:23:40 +000086<li>Abbreviations, which specify compression optimizations for the file.</li>
87</ul>
88
89<p>Note that the <a
90href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be
91used to dump and inspect arbitrary bitstreams, which is very useful for
92understanding the encoding.</p>
93
94</div>
95
96<!-- ======================================================================= -->
97<div class="doc_subsection"><a name="magic">Magic Numbers</a>
98</div>
99
100<div class="doc_text">
101
Chris Lattner3a1716d2007-05-12 05:37:42 +0000102<p>The first four bytes of the stream identify the encoding of the file. This
103is used by a reader to know what is contained in the file.</p>
Chris Lattnere9ef4572007-05-12 03:23:40 +0000104
105</div>
106
Chris Lattner3a1716d2007-05-12 05:37:42 +0000107<!-- ======================================================================= -->
108<div class="doc_subsection"><a name="primitives">Primitives</a>
109</div>
Chris Lattnere9ef4572007-05-12 03:23:40 +0000110
111<div class="doc_text">
112
Chris Lattner3a1716d2007-05-12 05:37:42 +0000113<p>
114A bitstream literally consists of a stream of bits. This stream is made up of a
115number of primitive values that encode a stream of integer values. These
116integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed
117Width Integers</a> or as <a href="#variablewidth">Variable Width
118Integers</a>.
Chris Lattnere9ef4572007-05-12 03:23:40 +0000119</p>
120
121</div>
122
Chris Lattner3a1716d2007-05-12 05:37:42 +0000123<!-- _______________________________________________________________________ -->
124<div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a>
125</div>
126
127<div class="doc_text">
128
129<p>Fixed-width integer values have their low bits emitted directly to the file.
130 For example, a 3-bit integer value encodes 1 as 001. Fixed width integers
131 are used when there are a well-known number of options for a field. For
132 example, boolean values are usually encoded with a 1-bit wide integer.
133</p>
134
135</div>
136
137<!-- _______________________________________________________________________ -->
138<div class="doc_subsubsection"> <a name="variablewidth">Variable Width
139Integers</a></div>
140
141<div class="doc_text">
142
143<p>Variable-width integer (VBR) values encode values of arbitrary size,
144optimizing for the case where the values are small. Given a 4-bit VBR field,
145any 3-bit value (0 through 7) is encoded directly, with the high bit set to
146zero. Values larger than N-1 bits emit their bits in a series of N-1 bit
147chunks, where all but the last set the high bit.</p>
148
149<p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a
150vbr4 value. The first set of four bits indicates the value 3 (011) with a
151continuation piece (indicated by a high bit of 1). The next word indicates a
152value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value
15327.
154</p>
155
156</div>
157
158<!-- _______________________________________________________________________ -->
159<div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div>
160
161<div class="doc_text">
162
163<p>6-bit characters encode common characters into a fixed 6-bit field. They
164represent the following characters with the following 6-bit values:<s/p>
165
166<ul>
167<li>'a' .. 'z' - 0 .. 25</li>
168<li>'A' .. 'Z' - 26 .. 52</li>
169<li>'0' .. '9' - 53 .. 61</li>
170<li>'.' - 62</li>
171<li>'_' - 63</li>
172</ul>
173
174<p>This encoding is only suitable for encoding characters and strings that
175consist only of the above characters. It is completely incapable of encoding
176characters not in the set.</p>
177
178</div>
179
180<!-- _______________________________________________________________________ -->
181<div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div>
182
183<div class="doc_text">
184
185<p>Occasionally, it is useful to emit zero bits until the bitstream is a
186multiple of 32 bits. This ensures that the bit position in the stream can be
187represented as a multiple of 32-bit words.</p>
188
189</div>
190
191
192<!-- ======================================================================= -->
193<div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a>
194</div>
195
196<div class="doc_text">
197
198<p>
199A bitstream is a sequential series of <a href="#blocks">Blocks</a> and
200<a href="#datarecord">Data Records</a>. Both of these start with an
201abbreviation ID encoded as a fixed-bitwidth field. The width is specified by
202the current block, as described below. The value of the abbreviation ID
203specifies either a builtin ID (which have special meanings, defined below) or
204one of the abbreviation IDs defined by the stream itself.
205</p>
206
207<p>
208The set of builtin abbrev IDs is:
209</p>
210
211<ul>
212<li>0 - <a href="#END_BLOCK">END_BLOCK</a> - This abbrev ID marks the end of the
213 current block.</li>
214<li>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a> - This abbrev ID marks the
215 beginning of a new block.</li>
216<li>2 - DEFINE_ABBREV - This defines a new abbreviation.</li>
217<li>3 - UNABBREV_RECORD - This ID specifies the definition of an unabbreviated
218 record.</li>
219</ul>
220
221<p>Abbreviation IDs 4 and above are defined by the stream itself.</p>
222
223</div>
224
225<!-- ======================================================================= -->
226<div class="doc_subsection"><a name="blocks">Blocks</a>
227</div>
228
229<div class="doc_text">
230
231<p>
232Blocks in a bitstream denote nested regions of the stream, and are identified by
233a content-specific id number (for example, LLVM IR uses an ID of 12 to represent
234function bodies). Nested blocks capture the hierachical structure of the data
235encoded in it, and various properties are associated with blocks as the file is
236parsed. Block definitions allow the reader to efficiently skip blocks
237in constant time if the reader wants a summary of blocks, or if it wants to
238efficiently skip data they do not understand. The LLVM IR reader uses this
239mechanism to skip function bodies, lazily reading them on demand.
240</p>
241
242<p>
243When reading and encoding the stream, several properties are maintained for the
244block. In particular, each block maintains:
245</p>
246
247<ol>
248<li>A current abbrev id width. This value starts at 2, and is set every time a
249 block record is entered. The block entry specifies the abbrev id width for
250 the body of the block.</li>
251
252<li>A set of abbreviations. Abbreviations may be defined within a block, or
253 they may be associated with all blocks of a particular ID.
254</li>
255</ol>
256
257<p>As sub blocks are entered, these properties are saved and the new sub-block
258has its own set of abbreviations, and its own abbrev id width. When a sub-block
259is popped, the saved values are restored.</p>
260
261</div>
262
263<!-- _______________________________________________________________________ -->
264<div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK
265Encoding</a></div>
266
267<div class="doc_text">
268
269<p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>,
270 &lt;align32bits&gt;, blocklen<sub>32</sub>]</tt></p>
271
272<p>
273The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block record.
274The <tt>blockid</tt> value is encoded as a 8-bit VBR identifier, and indicates
275the type of block being entered (which is application specific). The
276<tt>newabbrevlen</tt> value is a 4-bit VBR which specifies the
277abbrev id width for the sub-block. The <tt>blocklen</tt> is a 32-bit aligned
278value that specifies the size of the subblock, in 32-bit words. This value
279allows the reader to skip over the entire block in one jump.
280</p>
281
282</div>
283
284<!-- _______________________________________________________________________ -->
285<div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK
286Encoding</a></div>
287
288<div class="doc_text">
289
290<p><tt>[END_BLOCK, &lt;align32bits&gt;]</tt></p>
291
292<p>
293The END_BLOCK abbreviation ID specifies the end of the current block record.
294Its end is aligned to 32-bits to ensure that the size of the block is an even
295multiple of 32-bits.</p>
296
297</div>
298
299
300
301<!-- ======================================================================= -->
302<div class="doc_subsection"><a name="datarecord">Data Records</a>
303</div>
304
305<div class="doc_text">
306
307<p>
308blah
309</p>
310
311</div>
312
313
Chris Lattnere9ef4572007-05-12 03:23:40 +0000314<!-- *********************************************************************** -->
315<div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div>
316<!-- *********************************************************************** -->
317
318<div class="doc_text">
319
320<p></p>
321
322</div>
323
324
Reid Spencer2c1ce4f2007-01-20 23:21:08 +0000325<!-- *********************************************************************** -->
326<hr>
327<address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
328 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
329<a href="http://validator.w3.org/check/referer"><img
330 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnere9ef4572007-05-12 03:23:40 +0000331 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
Reid Spencer2c1ce4f2007-01-20 23:21:08 +0000332<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
333Last modified: $Date$
334</address>
335</body>
336</html>