blob: 3d324f64ba81798b5a7a038b24f265bbe292254b [file] [log] [blame]
Jim Stichnothefb89712015-09-03 13:19:54 -07001Design of the Subzero fast code generator
2=========================================
3
4Introduction
5------------
6
7The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
8compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
9PNaCl toolchain to compile their application to architecture-neutral PNaCl
10bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
11possible. The ``.pexe`` file is downloaded to the user's browser where the
12PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
13`sandboxed
14<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
15native code. The translator uses architecture-specific optimizations as much as
16practical to generate good native code.
17
18The native code can be cached by the browser to avoid repeating translation on
19future page loads. However, first-time user experience is hampered by long
20translation times. The LLVM-based PNaCl translator is pretty slow, even when
21using ``-O0`` to minimize optimizations, so delays are especially noticeable on
22slow browser platforms such as ARM-based Chromebooks.
23
24Translator slowness can be mitigated or hidden in a number of ways.
25
26- Parallel translation. However, slow machines where this matters most, e.g.
27 ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
28 and are likely to less memory available for multiple translation threads to
29 use.
30
31- Streaming translation, i.e. start translating as soon as the download starts.
32 This doesn't help much when translation speed is 10× slower than download
33 speed, or the ``.pexe`` file is already cached while the translated binary was
34 flushed from the cache.
35
36- Arrange the web page such that translation is done in parallel with
37 downloading large assets.
38
39- Arrange the web page to distract the user with `cat videos
40 <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
41 progress.
42
43Or, improve translator performance to something more reasonable.
44
45This document describes Subzero's attempt to improve translation speed by an
46order of magnitude while rivaling LLVM's code quality. Subzero does this
47through minimal IR layering, lean data structures and passes, and a careful
48selection of fast optimization passes. It has two optimization recipes: full
49optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
50following (described in more detail below):
51
52+-----------------------------------+-----------------------+
53| O2 recipe | Om1 recipe |
54+===================================+=======================+
55| Parse .pexe file | Parse .pexe file |
56+-----------------------------------+-----------------------+
57| Loop nest analysis | |
58+-----------------------------------+-----------------------+
59| Address mode inference | |
60+-----------------------------------+-----------------------+
61| Read-modify-write (RMW) transform | |
62+-----------------------------------+-----------------------+
63| Basic liveness analysis | |
64+-----------------------------------+-----------------------+
65| Load optimization | |
66+-----------------------------------+-----------------------+
67| | Phi lowering (simple) |
68+-----------------------------------+-----------------------+
69| Target lowering | Target lowering |
70+-----------------------------------+-----------------------+
71| Full liveness analysis | |
72+-----------------------------------+-----------------------+
73| Register allocation | |
74+-----------------------------------+-----------------------+
75| Phi lowering (advanced) | |
76+-----------------------------------+-----------------------+
77| Post-phi register allocation | |
78+-----------------------------------+-----------------------+
79| Branch optimization | |
80+-----------------------------------+-----------------------+
81| Code emission | Code emission |
82+-----------------------------------+-----------------------+
83
84Goals
85=====
86
87Translation speed
88-----------------
89
90We'd like to be able to translate a ``.pexe`` file as fast as download speed.
91Any faster is in a sense wasted effort. Download speed varies greatly, but
92we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
93slow machine. We observe a 3× single-thread performance difference between A15
94and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
95``.pexe`` file could be compressed to 50% on the web server using gzip transport
96compression, so we set the translation speed goal to 6 MB/sec on the high-end
97Xeon workstation.
98
99Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
100⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
101
102In other words, Subzero's goal is to improve over LLVM's translation speed by
10310×.
104
105Code quality
106------------
107
108Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
109code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On
110average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
111
112It's important to note that the quality of Subzero-generated code depends on
113target-neutral optimizations and simplifications being run beforehand in the
114developer environment. The ``.pexe`` file reflects these optimizations. For
115example, Subzero assumes that the basic blocks are ordered topologically where
116possible (which makes liveness analysis converge fastest), and Subzero does not
117do any function inlining because it should already have been done.
118
119Translator size
120---------------
121
122The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
123We think 1 MB is a more reasonable size -- especially for such a component that
124is distributed to a billion Chrome users. Thus we target a 10× reduction in
125binary size.
126
127For development, Subzero can be built for all target architectures, and all
128debugging and diagnostic options enabled. For a smaller translator, we restrict
129to a single target architecture, and define a ``MINIMAL`` build where
130unnecessary features are compiled out.
131
132Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
133include directories, which have little impact on translator size. It also uses
134some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
135with little size impact. In non-``MINIMAL`` builds, the translator size is much
136larger due to including code for parsing text-format bitcode files and forming
137LLVM IR.
138
139Memory footprint
140----------------
141
142The current LLVM-based translator suffers from an issue in which some
143function-specific data has to be retained in memory until all translation
144completes, and therefore the memory footprint grows without bound. Large
145``.pexe`` files can lead to the translator process holding hundreds of MB of
146memory by the end. The translator runs in a separate process, so this memory
147growth doesn't *directly* affect other processes, but it does dirty the physical
148memory and contributes to a perception of bloat and sometimes a reality of
149out-of-memory tab killing, especially noticeable on weaker systems.
150
151Subzero should maintain a stable memory footprint throughout translation. It's
152not really practical to set a specific limit, because there is not really a
153practical limit on a single function's size, but the footprint should be
154"reasonable" and be proportional to the largest input function size, not the
155total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or
156inexorable memory growth. (We use ASAN builds to test for leaks.)
157
158Multithreaded translation
159-------------------------
160
161It should be practical to translate different functions concurrently and see
162good scalability. Some locking may be needed, such as accessing output buffers
163or constant pools, but that should be fairly minimal. In contrast, LLVM was
164only designed for module-level parallelism, and as such, the PNaCl translator
165internally splits a ``.pexe`` file into several modules for concurrent
166translation. All output needs to be deterministic regardless of the level of
167multithreading, i.e. functions and data should always be output in the same
168order.
169
170Target architectures
171--------------------
172
173Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future
174targets include ARM64 and MIPS64, though these targets lack NaCl support
175including a sandbox model or a validator.
176
177The first implementation is for x86-32, because it was expected to be
178particularly challenging, and thus more likely to draw out any design problems
179early:
180
181- There are a number of special cases, asymmetries, and warts in the x86
182 instruction set.
183
184- Complex addressing modes may be leveraged for better code quality.
185
186- 64-bit integer operations have to be lowered into longer sequences of 32-bit
187 operations.
188
189- Paucity of physical registers may reveal code quality issues early in the
190 design.
191
192Detailed design
193===============
194
195Intermediate representation - ICE
196---------------------------------
197
198Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's
199IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a
200representation of global variables and initializers, and a set of functions.
201Each function contains a list of basic blocks, and each basic block constains a
202list of instructions. Instructions that operate on stack and register variables
203do so using static single assignment (SSA) form.
204
205The ``.pexe`` file is translated one function at a time (or in parallel by
206multiple translation threads). The recipe for optimization passes depends on
207the specific target and optimization level, and is described in detail below.
208Global variables (types and initializers) are simply and directly translated to
209object code, without any meaningful attempts at optimization.
210
211A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
212Its key contents include:
213
214- A list of ``CfgNode`` pointers, generally held in topological order.
215
216- A list of ``Variable`` pointers corresponding to local variables used in the
217 function plus compiler-generated temporaries.
218
219A basic block is represented by the ``Ice::CfgNode`` class. Its key contents
220include:
221
222- A linear list of instructions, in the same style as LLVM. The last
223 instruction of the list is always a terminator instruction: branch, switch,
224 return, unreachable.
225
226- A list of Phi instructions, also in the same style as LLVM. They are held as
227 a linear list for convenience, though per Phi semantics, they are executed "in
228 parallel" without dependencies on each other.
229
230- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
231 another list for outgoing edges.
232
233- The node's unique, 0-based index into the CFG's node list.
234
235An instruction is represented by the ``Ice::Inst`` class. Its key contents
236include:
237
238- A list of source operands.
239
240- Its destination variable, if the instruction produces a result in an
241 ``Ice::Variable``.
242
243- A bitvector indicating which variables' live ranges this instruction ends.
244 This is computed during liveness analysis.
245
246Instructions kinds are divided into high-level ICE instructions and low-level
247ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode
248instruction kinds. Each target architecture implementation extends the
249instruction space with its own set of low-level instructions. Generally,
250low-level instructions correspond to individual machine instructions. The
251high-level ICE instruction space includes a few additional instruction kinds
252that are not part of LLVM but are generally useful (e.g., an Assignment
253instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
254instructions for sandboxing).
255
256Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
257ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
258<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
259the following:
260
261- Alloca: allocate data on the stack
262
263- Arithmetic: binary operations of the form ``A = B op C``
264
265- Br: conditional or unconditional branch
266
267- Call: function call
268
269- Cast: unary type-conversion operations
270
271- ExtractElement: extract a scalar element from a vector-type value
272
273- Fcmp: floating-point comparison
274
275- Icmp: integer comparison
276
277- IntrinsicCall: call a known intrinsic
278
279- InsertElement: insert a scalar element into a vector-type value
280
281- Load: load a value from memory
282
283- Phi: implement the SSA phi node
284
285- Ret: return from the function
286
287- Select: essentially the C language operation of the form ``X = C ? Y : Z``
288
289- Store: store a value into memory
290
291- Switch: generalized branch to multiple possible locations
292
293- Unreachable: indicate that this portion of the code is unreachable
294
295The additional high-level ICE instructions are the following:
296
297- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi
298 instructions to non-SSA assignments, before lowering to machine code.
299
300- BundleLock, BundleUnlock. These are markers used for sandboxing, but are
301 common across all targets and so they are elevated to the high-level
302 instruction set.
303
304- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in
305 liveness analysis, elevated to the high-level because they are used by all
306 targets. They are described in more detail at the end of this section.
307
308- JumpTable: this represents the result of switch optimization analysis, where
309 some switch instructions may use jump tables instead of cascading
310 compare/branches.
311
312An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an
313operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants
314include scalar integer constants, scalar floating point constants, Undef (an
315unspecified constant of a particular scalar or vector type), and symbol
316constants (essentially addresses of globals). Note that the PNaCl ABI does not
317include vector-type constants besides Undef, and as such, Subzero (so far) has
318no reason to represent vector-type constants internally. A variable represents
319a value allocated on the stack (though not including alloca-derived storage).
320Among other things, a variable holds its unique, 0-based index into the CFG's
321variable list.
322
323Each target can extend the ``Constant`` and ``Variable`` classes for its own
324needs. In addition, the ``Operand`` class may be extended, e.g. to define an
325x86 ``MemOperand`` that encodes a base register, an index register, an index
326register shift amount, and a constant offset.
327
328Register allocation and liveness analysis are restricted to Variable operands.
329Because of the importance of register allocation to code quality, and the
330translation-time cost of liveness analysis, Variable operands get some special
331treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate
332across all the Variables of an instruction. An instruction holds a list of
333operands, but an operand may contain 0, 1, or more Variables. As such, the
334``Operand`` class specially holds a list of Variables contained within, for
335quick access.
336
337A Subzero transformation pass may work by deleting an existing instruction and
338replacing it with zero or more new instructions. Instead of actually deleting
339the existing instruction, we generally mark it as deleted and insert the new
340instructions right after the deleted instruction. When printing the IR for
341debugging, this is a big help because it makes it much more clear how the
342non-deleted instructions came about.
343
344Subzero has a few special instructions to help with liveness analysis
345consistency.
346
347- The FakeDef instruction gives a fake definition of some variable. For
348 example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
349 but an ICE instruction can represent only one destination variable. This is
350 similar for multiply instructions, and for function calls that return a 64-bit
351 integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``
352 trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
353
354- The FakeUse instruction registers a use of a variable, typically to prevent an
355 earlier assignment to that variable from being dead-code eliminated. For
356 example, lowering an operation like ``x=cc?y:z`` may be done using x86's
357 conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a
358 FakeUse of ``x`` between the two instructions, the liveness analysis pass may
359 dead-code eliminate the first instruction.
360
361- The FakeKill instruction is added after a call instruction, and is a quick way
362 of indicating that caller-save registers are invalidated.
363
364Pexe parsing
365------------
366
367Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It
368parses the ``.pexe`` file function by function, ultimately constructing an ICE
369CFG for each function. After a function is parsed, its CFG is handed off to the
370translation phase. The bitcode parser also parses global initializer data and
371hands it off to be translated to data sections in the object file.
372
373Subzero has another parsing strategy for testing/debugging. LLVM libraries can
374be used to parse a module into LLVM IR (though very slowly relative to Subzero
375native parsing). Then we iterate across the LLVM IR and construct high-level
376ICE, handing off each CFG to the translation phase.
377
378Overview of lowering
379--------------------
380
381In general, translation goes like this:
382
383- Parse the next function from the ``.pexe`` file and construct a CFG consisting
384 of high-level ICE.
385
386- Do analysis passes and transformation passes on the high-level ICE, as
387 desired.
388
389- Lower each high-level ICE instruction into a sequence of zero or more
390 low-level ICE instructions. Each high-level instruction is generally lowered
391 independently, though the target lowering is allowed to look ahead in the
392 CfgNode's instruction list if desired.
393
394- Do more analysis and transformation passes on the low-level ICE, as desired.
395
396- Assemble the low-level CFG into an ELF object file (alternatively, a textual
397 assembly file that is later assembled by some external tool).
398
399- Repeat for all functions, and also produce object code for data such as global
400 initializers and internal constant pools.
401
402Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,
403the intention is to apply all available optimizations to get the best code
404quality (though the initial code quality goal is measured against LLVM's ``O0``
405code quality). For ``Om1``, the intention is to apply as few optimizations as
406possible and produce code as quickly as possible, accepting poor code quality.
407``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
408"sub-zero".
409
410High-level debuggability of generated code is so far not a design requirement.
411Subzero doesn't really do transformations that would obfuscate debugging; the
412main thing might be that register allocation (including stack slot coalescing
413for stack-allocated variables whose live ranges don't overlap) may render a
414variable's value unobtainable after its live range ends. This would not be an
415issue for ``Om1`` since it doesn't register-allocate program-level variables,
416nor does it coalesce stack slots. That said, fully supporting debuggability
417would require a few additions:
418
419- DWARF support would need to be added to Subzero's ELF file emitter. Subzero
420 propagates global symbol names, local variable names, and function-internal
421 label names that are present in the ``.pexe`` file. This would allow a
422 debugger to map addresses back to symbols in the ``.pexe`` file.
423
424- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
425 file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
426 metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
427 `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
428
429- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
430 the toolchain would need to be modified to preserve it.
431
432Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
433produces code with one third the code quality. ``Om1`` is good for testing and
434debugging -- during translation, it tends to expose errors in the basic lowering
435that might otherwise have been hidden by the register allocator or other
436optimization passes. It also helps determine whether a code correctness problem
437is a fundamental problem in the basic lowering, or an error in another
438optimization pass.
439
440The implementation of target lowering also controls the recipe of passes used
441for ``Om1`` and ``O2`` translation. For example, address mode inference may
442only be relevant for x86.
443
444Lowering strategy
445-----------------
446
447The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
448each high-level instruction down to a sequence of low-level target-specific
449instructions, in a largely context-free setting. That is, each high-level
450instruction conceptually has a simple template expansion into low-level
451instructions, and lowering can in theory be done in any order. This may sound
452like a small effort, but quite a large number of templates may be needed because
453of the number of PNaCl types and instruction variants. Furthermore, there may
454be optimized templates, e.g. to take advantage of operator commutativity (for
455example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is
456similar to other template-based approaches in fast code generation or
457interpretation, though some decisions are deferred until after some global
458analysis passes, mostly related to register allocation, stack slot assignment,
459and specific choice of instruction variant and addressing mode.
460
461The key idea for a lowering template is to produce valid low-level instructions
462that are guaranteed to meet address mode and other structural requirements of
463the instruction set. For example, on x86, the source operand of an integer
464store instruction must be an immediate or a physical register; a shift
465instruction's shift amount must be an immediate or in register ``%cl``; a
466function's integer return value is in ``%eax``; most x86 instructions are
467two-operand, in contrast to corresponding three-operand high-level instructions;
468etc.
469
470Because target lowering runs before register allocation, there is no way to know
471whether a given ``Ice::Variable`` operand lives on the stack or in a physical
472register. When the low-level instruction calls for a physical register operand,
473the target lowering can create an infinite-weight Variable. This tells the
474register allocator to assign infinite weight when making decisions, effectively
475guaranteeing some physical register. Variables can also be pre-colored to a
476specific physical register (``cl`` in the shift example above), which also gives
477infinite weight.
478
479To illustrate, consider a high-level arithmetic instruction on 32-bit integer
480operands::
481
482 A = B + C
483
484X86 target lowering might produce the following::
485
486 T.inf = B // mov instruction
487 T.inf += C // add instruction
488 A = T.inf // mov instruction
489
490Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a
491physical register, the three lowered instructions are all encodable regardless
492of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
493whether ``A`` is a physical register or in memory.
494
495In this example, ``A`` must be a Variable and one may be tempted to simplify the
496lowering sequence by setting ``A`` as infinite-weight and using::
497
498 A = B // mov instruction
499 A += C // add instruction
500
501This has two problems. First, if the original instruction was actually ``A =
502B + A``, the result would be incorrect. Second, assigning ``A`` a physical
503register applies throughout ``A``'s entire live range. This is probably not
504what is intended, and may ultimately lead to a failure to allocate a register
505for an infinite-weight variable.
506
507This style of lowering leads to many temporaries being generated, so in ``O2``
508mode, we rely on the register allocator to clean things up. For example, in the
509example above, if ``B`` ends up getting a physical register and its live range
510ends at this instruction, the register allocator is likely to reuse that
511register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
512copy, which is removed as an emission-time peephole optimization.
513
514O2 lowering
515-----------
516
517Currently, the ``O2`` lowering recipe is the following:
518
519- Loop nest analysis
520
521- Address mode inference
522
523- Read-modify-write (RMW) transformation
524
525- Basic liveness analysis
526
527- Load optimization
528
529- Target lowering
530
531- Full liveness analysis
532
533- Register allocation
534
535- Phi instruction lowering (advanced)
536
537- Post-phi lowering register allocation
538
539- Branch optimization
540
541These passes are described in more detail below.
542
543Om1 lowering
544------------
545
546Currently, the ``Om1`` lowering recipe is the following:
547
548- Phi instruction lowering (simple)
549
550- Target lowering
551
552- Register allocation (infinite-weight and pre-colored only)
553
554Optimization passes
555-------------------
556
557Liveness analysis
558^^^^^^^^^^^^^^^^^
559
560Liveness analysis is a standard dataflow optimization, implemented as follows.
561For each node (basic block), its live-out set is computed as the union of the
562live-in sets of its successor nodes. Then the node's instructions are processed
563in reverse order, updating the live set, until the beginning of the node is
564reached, and the node's live-in set is recorded. If this iteration has changed
565the node's live-in set, the node's predecessors are marked for reprocessing.
566This continues until no more nodes need reprocessing. If nodes are processed in
567reverse topological order, the number of iterations over the CFG is generally
568equal to the maximum loop nest depth.
569
570To implement this, each node records its live-in and live-out sets, initialized
571to the empty set. Each instruction records which of its Variables' live ranges
572end in that instruction, initialized to the empty set. A side effect of
573liveness analysis is dead instruction elimination. Each instruction can be
574marked as tentatively dead, and after the algorithm converges, the tentatively
575dead instructions are permanently deleted.
576
577Optionally, after this liveness analysis completes, we can do live range
578construction, in which we calculate the live range of each variable in terms of
579instruction numbers. A live range is represented as a union of segments, where
580the segment endpoints are instruction numbers. Instruction numbers are required
581to be unique across the CFG, and monotonically increasing within a basic block.
582As a union of segments, live ranges can contain "gaps" and are therefore
583precise. Because of SSA properties, a variable's live range can start at most
584once in a basic block, and can end at most once in a basic block. Liveness
585analysis keeps track of which variable/instruction tuples begin live ranges and
586end live ranges, and combined with live-in and live-out sets, we can efficiently
587build up live ranges of all variables across all basic blocks.
588
589A lot of care is taken to try to make liveness analysis fast and efficient.
590Because of the lowering strategy, the number of variables is generally
591proportional to the number of instructions, leading to an O(N^2) complexity
592algorithm if implemented naively. To improve things based on sparsity, we note
593that most variables are "local" and referenced in at most one basic block (in
594contrast to the "global" variables with multi-block usage), and therefore cannot
595be live across basic blocks. Therefore, the live-in and live-out sets,
596typically represented as bit vectors, can be limited to the set of global
597variables, and the intra-block liveness bit vector can be compacted to hold the
598global variables plus the local variables for that block.
599
600Register allocation
601^^^^^^^^^^^^^^^^^^^
602
603Subzero implements a simple linear-scan register allocator, based on the
604allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
605Register Allocation in the Context of SSA Form and Register Constraints
606<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has
607several nice features:
608
609- Live ranges are represented as unions of segments, as described above, rather
610 than a single start/end tuple.
611
612- It allows pre-coloring of variables with specific physical registers.
613
614- It applies equally well to pre-lowered Phi instructions.
615
616The paper suggests an approach of aggressively coalescing variables across Phi
617instructions (i.e., trying to force Phi source and destination variables to have
618the same register assignment), but we reject that in favor of the more natural
619preference mechanism described below.
620
621We enhance the algorithm in the paper with the capability of automatic inference
622of register preference, and with the capability of allowing overlapping live
623ranges to safely share the same register in certain circumstances. If we are
624considering register allocation for variable ``A``, and ``A`` has a single
625defining instruction ``A=B+C``, then the preferred register for ``A``, if
626available, would be the register assigned to ``B`` or ``C``, if any, provided
627that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this
628way we infer a good register preference for ``A``.
629
630We allow overlapping live ranges to get the same register in certain cases.
631Suppose a high-level instruction like::
632
633 A = unary_op(B)
634
635has been target-lowered like::
636
637 T.inf = B
638 A = unary_op(T.inf)
639
640Further, assume that ``B``'s live range continues beyond this instruction
641sequence, and that ``B`` has already been assigned some register. Normally, we
642might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
643turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
644different registers. But ``T.inf`` is just a read-only copy of ``B`` that is
645guaranteed to be in a register, so in theory these overlapping live ranges could
646safely have the same register. Our implementation allows this overlap as long
647as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
648modified within ``T.inf``'s live range.
649
650Subzero's register allocator can be run in 3 configurations.
651
652- Normal mode. All Variables are considered for register allocation. It
653 requires full liveness analysis and live range construction as a prerequisite.
654 This is used by ``O2`` lowering.
655
656- Minimal mode. Only infinite-weight or pre-colored Variables are considered.
657 All other Variables are stack-allocated. It does not require liveness
658 analysis; instead, it quickly scans the instructions and records first
659 definitions and last uses of all relevant Variables, using that to construct a
660 single-segment live range. Although this includes most of the Variables, the
661 live ranges are mostly simple, short, and rarely overlapping, which the
662 register allocator handles efficiently. This is used by ``Om1`` lowering.
663
664- Post-phi lowering mode. Advanced phi lowering is done after normal-mode
665 register allocation, and may result in new infinite-weight Variables that need
666 registers. One would like to just run something like minimal mode to assign
667 registers to the new Variables while respecting existing register allocation
668 decisions. However, it sometimes happens that there are no free registers.
669 In this case, some register needs to be forcibly spilled to the stack and
670 temporarily reassigned to the new Variable, and reloaded at the end of the new
671 Variable's live range. The register must be one that has no explicit
672 references during the Variable's live range. Since Subzero currently doesn't
673 track def/use chains (though it does record the CfgNode where a Variable is
674 defined), we just do a brute-force search across the CfgNode's instruction
675 list for the instruction numbers of interest. This situation happens very
676 rarely, so there's little point for now in improving its performance.
677
678The basic linear-scan algorithm may, as it proceeds, rescind an early register
679allocation decision, leaving that Variable to be stack-allocated. Some of these
680times, it turns out that the Variable could have been given a different register
681without conflict, but by this time it's too late. The literature recognizes
682this situation and describes "second-chance bin-packing", which Subzero can do.
683We can rerun the register allocator in a mode that respects existing register
684allocation decisions, and sometimes it finds new non-conflicting opportunities.
685In fact, we can repeatedly run the register allocator until convergence.
686Unfortunately, in the current implementation, these subsequent register
687allocation passes end up being extremely expensive. This is because of the
688treatment of the "unhandled pre-colored" Variable set, which is normally very
689small but ends up being quite large on subsequent passes. Its performance can
690probably be made acceptable with a better choice of data structures, but for now
691this second-chance mechanism is disabled.
692
693Future work is to implement LLVM's `Greedy
694<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
695register allocator as a replacement for the basic linear-scan algorithm, given
696LLVM's experience with its improvement in code quality. (The blog post claims
697that the Greedy allocator also improved maintainability because a lot of hacks
698could be removed, but Subzero is probably not yet to that level of hacks, and is
699less likely to see that particular benefit.)
700
701Basic phi lowering
702^^^^^^^^^^^^^^^^^^
703
704The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
705implements it). Consider this example::
706
707 L1:
708 ...
709 br L3
710 L2:
711 ...
712 br L3
713 L3:
714 A = phi [B, L1], [C, L2]
715 X = phi [Y, L1], [Z, L2]
716
717For each destination of a phi instruction, we can create a temporary and insert
718the temporary's assignment at the end of the predecessor block::
719
720 L1:
721 ...
722 A' = B
723 X' = Y
724 br L3
725 L2:
726 ...
727 A' = C
728 X' = Z
729 br L3
730 L2:
731 A = A'
732 X = X'
733
734This transformation is very simple and reliable. It can be done before target
735lowering and register allocation, and it easily avoids the classic lost-copy and
736related problems. ``Om1`` lowering uses this strategy.
737
738However, it has the disadvantage of initializing temporaries even for branches
739not taken, though that could be mitigated by splitting non-critical edges and
740putting assignments in the edge-split nodes. Another problem is that without
741extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
742specific ordering even though phi semantics are that the assignments are
743parallel or unordered. This sometimes imposes false live range overlaps and
744leads to poorer register allocation.
745
746Advanced phi lowering
747^^^^^^^^^^^^^^^^^^^^^
748
749``O2`` lowering defers phi lowering until after register allocation to avoid the
750problem of false live range overlaps. It works as follows. We split each
751incoming edge and move the (parallel) phi assignments into the split nodes. We
752linearize each set of assignments by finding a safe, topological ordering of the
753assignments, respecting register assignments as well. For example::
754
755 A = B
756 X = Y
757
758Normally these assignments could be executed in either order, but if ``B`` and
759``X`` are assigned the same physical register, we would want to use the above
760ordering. Dependency cycles are broken by introducing a temporary. For
761example::
762
763 A = B
764 B = A
765
766Here, a temporary breaks the cycle::
767
768 t = A
769 A = B
770 B = t
771
772Finally, we use the existing target lowering to lower the assignments in this
773basic block, and once that is done for all basic blocks, we run the post-phi
774variant of register allocation on the edge-split basic blocks.
775
776When computing a topological order, we try to first schedule assignments whose
777source has a physical register, and last schedule assignments whose destination
778has a physical register. This helps reduce register pressure.
779
780X86 address mode inference
781^^^^^^^^^^^^^^^^^^^^^^^^^^
782
783We try to take advantage of the x86 addressing mode that includes a base
784register, an index register, an index register scale amount, and an immediate
785offset. We do this through simple pattern matching. Starting with a load or
786store instruction where the address is a variable, we initialize the base
787register to that variable, and look up the instruction where that variable is
788defined. If that is an add instruction of two variables and the index register
789hasn't been set, we replace the base and index register with those two
790variables. If instead it is an add instruction of a variable and a constant, we
791replace the base register with the variable and add the constant to the
792immediate offset.
793
794There are several more patterns that can be matched. This pattern matching
795continues on the load or store instruction until no more matches are found.
796Because a program typically has few load and store instructions (not to be
797confused with instructions that manipulate stack variables), this address mode
798inference pass is fast.
799
800X86 read-modify-write inference
801^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
802
803A reasonably common bitcode pattern is a non-atomic update of a memory
804location::
805
806 x = load addr
807 y = add x, 1
808 store y, addr
809
810On x86, with good register allocation, the Subzero passes described above
811generate code with only this quality::
812
813 mov [%ebx], %eax
814 add $1, %eax
815 mov %eax, [%ebx]
816
817However, x86 allows for this kind of code::
818
819 add $1, [%ebx]
820
821which requires fewer instructions, but perhaps more importantly, requires fewer
822physical registers.
823
824It's also important to note that this transformation only makes sense if the
825store instruction ends ``x``'s live range.
826
827Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
828opportunities via simple pattern matching. The only problem is that it is run
829before liveness analysis, which is needed to determine whether ``x``'s live
830range ends after the RMW. Since liveness analysis is one of the most expensive
831passes, it's not attractive to run it an extra time just for RMW analysis.
832Instead, we essentially generate both the RMW and the non-RMW versions, and then
833during lowering, the RMW version deletes itself if it finds x still live.
834
835X86 compare-branch inference
836^^^^^^^^^^^^^^^^^^^^^^^^^^^^
837
838In the LLVM instruction set, the compare/branch pattern works like this::
839
840 cond = icmp eq a, b
841 br cond, target
842
843The result of the icmp instruction is a single bit, and a conditional branch
844tests that bit. By contrast, most target architectures use this pattern::
845
846 cmp a, b // implicitly sets various bits of FLAGS register
847 br eq, target // branch on a particular FLAGS bit
848
849A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
850``cond`` and conditionally branches. Subzero has a pass that identifies
851boolean-based operations like this and folds them into a single
852compare/branch-like operation. It is set up for more than just cmp/br though.
853Boolean producers include icmp (integer compare), fcmp (floating-point compare),
854and trunc (integer truncation when the destination has bool type). Boolean
855consumers include branch, select (the ternary operator from the C language), and
856sign-extend and zero-extend when the source has bool type.
857
858Sandboxing
859^^^^^^^^^^
860
861Native Client's sandbox model uses software fault isolation (SFI) to provide
862safety when running untrusted code in a browser or other environment. Subzero
863implements Native Client's `sandboxing
864<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
865to enable Subzero-translated executables to be run inside Chrome. Subzero also
866provides a fairly simple framework for investigating alternative sandbox models
867or other restrictions on the sandbox model.
868
869Sandboxing in Subzero is not actually implemented as a separate pass, but is
870integrated into lowering and assembly.
871
872- Indirect branches, including the ret instruction, are masked to a bundle
873 boundary and bundle-locked.
874
875- Call instructions are aligned to the end of the bundle so that the return
876 address is bundle-aligned.
877
878- Indirect branch targets, including function entry and targets in a switch
879 statement jump table, are bundle-aligned.
880
881- The intrinsic for reading the thread pointer is inlined appropriately.
882
883- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
884 base register. We reduce the aggressiveness of address mode inference to
885 leave room for the sandbox base register during lowering. There are no memory
886 sandboxing changes for x86-32.
887
888Code emission
889-------------
890
891Subzero's integrated assembler is derived from Dart's `assembler code
892<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass
893that iterates through the low-level ICE instructions and invokes the relevant
894assembler functions. Placeholders are added for later fixup of branch target
895offsets. (Backward branches use short offsets if possible; forward branches
896generally use long offsets unless it is an intra-block branch of "known" short
897length.) The assembler emits into a staging buffer. Once emission into the
898staging buffer for a function is complete, the data is emitted to the output
899file as an ELF object file, and metadata such as relocations, symbol table, and
900string table, are accumulated for emission at the end. Global data initializers
901are emitted similarly. A key point is that at this point, the staging buffer
902can be deallocated, and only a minimum of data needs to held until the end.
903
904As a debugging alternative, Subzero can emit textual assembly code which can
905then be run through an external assembler. This is of course super slow, but
906quite valuable when bringing up a new target.
907
908As another debugging option, the staging buffer can be emitted as textual
909assembly, primarily in the form of ".byte" lines. This allows the assembler to
910be tested separately from the ELF related code.
911
912Memory management
913-----------------
914
915Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
916LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects
917allocated from the arena are never actually freed; instead, when the CFG
918translation completes and the CFG is deleted, the entire arena memory is
919reclaimed at once. This style of allocation works well in an environment like a
920compiler where there are distinct phases with only easily-identifiable objects
921living across phases. It frees the developer from having to manage object
922deletion, and it amortizes deletion costs across just a single arena deletion at
923the end of the phase. Furthermore, it helps scalability by allocating entirely
924from thread-local memory pools, and minimizing global locking of the heap.
925
926Instructions are probably the most heavily allocated complex class in Subzero.
927We represent an instruction list as an intrusive doubly linked list, allocate
928all instructions from the ``CfgLocalAllocator``, and we make sure each
929instruction subclass is basically `POD
930<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
931trivial destructor. This way, when the CFG is finished, we don't need to
932individually deallocate every instruction. We do similar for Variables, which
933is probably the second most popular complex class.
934
935There are some situations where passes need to use some `STL container class
936<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the
937``CfgLocalAllocator`` as the container allocator if this is needed.
938
939Multithreaded translation
940-------------------------
941
942Subzero is designed to be able to translate functions in parallel. With the
943``-threads=N`` command-line option, there is a 3-stage producer-consumer
944pipeline:
945
946- A single thread parses the ``.pexe`` file and produces a sequence of work
947 units. A work unit can be either a fully constructed CFG, or a set of global
948 initializers. The work unit includes its sequence number denoting its parse
949 order. Each work unit is added to the translation queue.
950
951- There are N translation threads that draw work units from the translation
952 queue and lower them into assembler buffers. Each assembler buffer is added
953 to the emitter queue, tagged with its sequence number. The CFG and its
954 ``CfgLocalAllocator`` are disposed of at this point.
955
956- A single thread draws assembler buffers from the emitter queue and appends to
957 the output file. It uses the sequence numbers to reintegrate the assembler
958 buffers according to the original parse order, such that output order is
959 always deterministic.
960
961This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
962for a total of ``N+2`` execution threads, taking the parser and emitter threads
963into account. For the special case of ``N=0``, execution is entirely sequential
964-- the same thread parses, translates, and emits, one function at a time. This
965is useful for performance measurements.
966
967Ideally, we would like to get near-linear scalability as the number of
968translation threads increases. We expect that ``-threads=1`` should be slightly
969faster than ``-threads=0`` as the small amount of time spent parsing and
970emitting is done largely in parallel with translation. With perfect
971scalability, we see ``-threads=N`` translating ``N`` times as fast as
972``-threads=1``, up until the point where parsing or emitting becomes the
973bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory
974performance would become a bottleneck and efficiency might peak at, say, 75%.
975
976Currently, parsing takes about 11% of total sequential time. If translation
977scalability ever gets so fast and awesomely scalable that parsing becomes a
978bottleneck, it should be possible to make parsing multithreaded as well.
979
980Internally, all shared, mutable data is held in the GlobalContext object, and
981access to each field is guarded by a mutex.
982
983Security
984--------
985
986Subzero includes a number of security features in the generated code, as well as
987in the Subzero translator itself, which run on top of the existing Native Client
988sandbox as well as Chrome's OS-level sandbox.
989
990Sandboxed translator
991^^^^^^^^^^^^^^^^^^^^
992
993When running inside the browser, the Subzero translator executes as sandboxed,
994untrusted code that is initially checked by the validator, just like the
995LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no
996more or less secure than the translator it replaces, from the point of view of
997the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and
998was designed from the start with security in mind, so one expects fewer attacker
999opportunities here.
1000
1001Code diversification
1002^^^^^^^^^^^^^^^^^^^^
1003
1004`Return-oriented programming
1005<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
1006now-common technique for starting with e.g. a known buffer overflow situation
1007and launching it into a deeper exploit. The attacker scans the executable
1008looking for ROP gadgets, which are short sequences of code that happen to load
1009known values into known registers and then return. An attacker who manages to
1010overwrite parts of the stack can overwrite it with carefully chosen return
1011addresses such that certain ROP gadgets are effectively chained together to set
1012up the register state as desired, finally returning to some code that manages to
1013do something nasty based on those register values.
1014
1015If there is a popular ``.pexe`` with a large install base, the attacker could
1016run Subzero on it and scan the executable for suitable ROP gadgets to use as
1017part of a potential exploit. Note that if the trusted validator is working
1018correctly, these ROP gadgets are limited to starting at a bundle boundary and
1019cannot use the trick of finding a gadget that happens to begin inside another
1020instruction. All the same, gadgets with these constraints still exist and the
1021attacker has access to them. This is the attack model we focus most on --
1022protecting the user against misuse of a "trusted" developer's application, as
1023opposed to mischief from a malicious ``.pexe`` file.
1024
1025Subzero can mitigate these attacks to some degree through code diversification.
1026Specifically, we can apply some randomness to the code generation that makes ROP
1027gadgets less predictable. This randomness can have some compile-time cost, and
1028it can affect the code quality; and some diversifications may be more effective
1029than others. A more detailed treatment of hardening techniques may be found in
1030the Matasano report "`Attacking Clientside JIT Compilers
1031<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
1032
1033To evaluate diversification effectiveness, we use a third-party ROP gadget
1034finder and limit its results to bundle-aligned addresses. For a given
1035diversification technique, we run it with a number of different random seeds,
1036find ROP gadgets for each version, and determine how persistent each ROP gadget
1037is across the different versions. A gadget is persistent if the same gadget is
1038found at the same code address. The best diversifications are ones with low
1039gadget persistence rates.
1040
1041Subzero implements 7 different diversification techniques. Below is a
1042discussion of each technique, its effectiveness, and its cost. The discussions
1043of cost and effectiveness are for a single diversification technique; the
1044translation-time costs for multiple techniques are additive, but the effects of
1045multiple techniques on code quality and effectiveness are not yet known.
1046
1047In Subzero's implementation, each randomization is "repeatable" in a sense.
1048Each pass that includes a randomization option gets its own private instance of
1049a random number generator (RNG). The RNG is seeded with a combination of a
1050global seed, the pass ID, and the function's sequence number. The global seed
1051is designed to be different across runs (perhaps based on the current time), but
1052for debugging, the global seed can be set to a specific value and the results
1053will be repeatable.
1054
1055Subzero-generated code is subject to diversification once per translation, and
1056then Chrome caches the diversified binary for subsequent executions. An
1057attacker may attempt to run the binary multiple times hoping for
1058higher-probability combinations of ROP gadgets. When the attacker guesses
1059wrong, a likely outcome is an application crash. Chrome throttles creation of
1060crashy processes which reduces the likelihood of the attacker eventually gaining
1061a foothold.
1062
1063Constant blinding
1064~~~~~~~~~~~~~~~~~
1065
1066Here, we prevent attackers from controlling large immediates in the text
1067(executable) section. A random cookie is generated for each function, and if
1068the constant exceeds a specified threshold, the constant is obfuscated with the
1069cookie and equivalent code is generated. For example, instead of this x86
1070instruction::
1071
1072 mov $0x11223344, <%Reg/Mem>
1073
1074the following code might be generated::
1075
1076 mov $(0x11223344+Cookie), %temp
1077 lea -Cookie(%temp), %temp
1078 mov %temp, <%Reg/Mem>
1079
1080The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
1081prevent unintended effects on the flags register.
1082
1083This transformation has almost no effect on translation time, and about 1%
1084impact on code quality, depending on the threshold chosen. It does little to
1085reduce gadget persistence, but it does remove a lot of potential opportunities
1086to construct intra-instruction ROP gadgets (which an attacker could use only if
1087a validator bug were discovered, since the Native Client sandbox and associated
1088validator force returns and other indirect branches to be to bundle-aligned
1089addresses).
1090
1091Constant pooling
1092~~~~~~~~~~~~~~~~
1093
1094This is similar to constant blinding, in that large immediates are removed from
1095the text section. In this case, each unique constant above the threshold is
1096stored in a read-only data section and the constant is accessed via a memory
1097load. For the above example, the following code might be generated::
1098
1099 mov $Label$1, %temp
1100 mov %temp, <%Reg/Mem>
1101
1102This has a similarly small impact on translation time and ROP gadget
1103persistence, and a smaller (better) impact on code quality. This is because it
1104uses fewer instructions, and in some cases good register allocation leads to no
1105increase in instruction count. Note that this still gives an attacker some
1106limited amount of control over some text section values, unless we randomize the
1107constant pool layout.
1108
1109Static data reordering
1110~~~~~~~~~~~~~~~~~~~~~~
1111
1112This transformation limits the attacker's ability to control bits in global data
1113address references. It simply permutes the order in memory of global variables
1114and internal constant pool entries. For the constant pool, we only permute
1115within a type (i.e., emit a randomized list of ints, followed by a randomized
1116list of floats, etc.) to maintain good packing in the face of alignment
1117constraints.
1118
1119As might be expected, this has no impact on code quality, translation time, or
1120ROP gadget persistence (though as above, it limits opportunities for
1121intra-instruction ROP gadgets with a broken validator).
1122
1123Basic block reordering
1124~~~~~~~~~~~~~~~~~~~~~~
1125
1126Here, we randomize the order of basic blocks within a function, with the
1127constraint that we still want to maintain a topological order as much as
1128possible, to avoid making the code too branchy.
1129
1130This has no impact on code quality, and about 1% impact on translation time, due
1131to a separate pass to recompute layout. It ends up having a huge effect on ROP
1132gadget persistence, tied for best with nop insertion, reducing ROP gadget
1133persistence to less than 5%.
1134
1135Function reordering
1136~~~~~~~~~~~~~~~~~~~
1137
1138Here, we permute the order that functions are emitted, primarily to shift ROP
1139gadgets around to less predictable locations. It may also change call address
1140offsets in case the attacker was trying to control that offset in the code.
1141
1142To control latency and memory footprint, we don't arbitrarily permute functions.
1143Instead, for some relatively small value of N, we queue up N assembler buffers,
1144and then emit the N functions in random order, and repeat until all functions
1145are emitted.
1146
1147Function reordering has no impact on translation time or code quality.
1148Measurements indicate that it reduces ROP gadget persistence to about 15%.
1149
1150Nop insertion
1151~~~~~~~~~~~~~
1152
1153This diversification randomly adds a nop instruction after each regular
1154instruction, with some probability. Nop instructions of different lengths may
1155be selected. Nop instructions are never added inside a bundle_lock region.
1156Note that when sandboxing is enabled, nop instructions are already being added
1157for bundle alignment, so the diversification nop instructions may simply be
1158taking the place of alignment nop instructions, though distributed differently
1159through the bundle.
1160
1161In Subzero's currently implementation, nop insertion adds 3-5% to the
1162translation time, but this is probably because it is implemented as a separate
1163pass that adds actual nop instructions to the IR. The overhead would probably
1164be a lot less if it were integrated into the assembler pass. The code quality
1165is also reduced by 3-5%, making nop insertion the most expensive of the
1166diversification techniques.
1167
1168Nop insertion is very effective in reducing ROP gadget persistence, at the same
1169level as basic block randomization (less than 5%). But given nop insertion's
1170impact on translation time and code quality, one would most likely prefer to use
1171basic block randomization instead (though the combined effects of the different
1172diversification techniques have not yet been studied).
1173
1174Register allocation randomization
1175~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1176
1177In this diversification, the register allocator tries to make different but
1178mostly functionally equivalent choices, while maintaining stable code quality.
1179
1180A naive approach would be the following. Whenever the allocator has more than
1181one choice for assigning a register, choose randomly among those options. And
1182whenever there are no registers available and there is a tie for the
1183lowest-weight variable, randomly select one of the lowest-weight variables to
1184evict. Because of the one-pass nature of the linear-scan algorithm, this
1185randomization strategy can have a large impact on which variables are ultimately
1186assigned registers, with a corresponding large impact on code quality.
1187
1188Instead, we choose an approach that tries to keep code quality stable regardless
1189of the random seed. We partition the set of physical registers into equivalence
1190classes. If a register is pre-colored in the function (i.e., referenced
1191explicitly by name), it forms its own equivalence class. The remaining
1192registers are partitioned according to their combination of attributes such as
1193integer versus floating-point, 8-bit versus 32-bit, caller-save versus
1194callee-saved, etc. Each equivalence class is randomly permuted, and the
1195complete permutation is applied to the final register assignments.
1196
1197Register randomization reduces ROP gadget persistence to about 10% on average,
1198though there tends to be fairly high variance across functions and applications.
1199This probably has to do with the set of restrictions in the x86-32 instruction
1200set and ABI, such as few general-purpose registers, ``%eax`` used for return
1201values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As
1202intended, register randomization has no impact on code quality, and a slight
1203(0.5%) impact on translation time due to an extra scan over the variables to
1204identify pre-colored registers.
1205
1206Fuzzing
1207^^^^^^^
1208
1209We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
1210combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
1211<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to
1212find and fix cases where Subzero crashes or otherwise ungracefully fails on
1213unexpected inputs, and to do so automatically over a large range of unexpected
1214inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility
1215of an attacker exploiting these bugs.
1216
1217Most of the problems found so far are ones most appropriately handled in the
1218parser. However, there have been a couple that have identified problems in the
1219lowering, or otherwise inappropriately triggered assertion failures and fatal
1220errors. We continue to dig into this area.
1221
1222Future security work
1223^^^^^^^^^^^^^^^^^^^^
1224
1225Subzero is well-positioned to explore other future security enhancements, e.g.:
1226
1227- Tightening the Native Client sandbox. ABI changes, such as the previous work
1228 on `hiding the sandbox base address
1229 <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
1230 in x86-64, are easy to experiment with in Subzero.
1231
1232- Making the executable code section read-only. This would prevent a PNaCl
1233 application from inspecting its own binary and trying to find ROP gadgets even
1234 after code diversification has been performed. It may still be susceptible to
1235 `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
1236 security is still overall improved.
1237
1238- Instruction selection diversification. It may be possible to lower a given
1239 instruction in several largely equivalent ways, which gives more opportunities
1240 for code randomization.
1241
1242Chrome integration
1243------------------
1244
1245Currently Subzero is available in Chrome for the x86-32 architecture, but under
1246a flag. When the flag is enabled, Subzero is used when the `manifest file
1247<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
1248linking to the ``.pexe`` file specifies the ``O0`` optimization level.
1249
1250The next step is to remove the flag, i.e. invoke Subzero as the only translator
1251for ``O0``-specified manifest files.
1252
1253Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
1254case Subzero could be used for all PNaCl translation.
1255
1256Command line options
1257--------------------
1258
1259Subzero has a number of command-line options for debugging and diagnostics.
1260Among the more interesting are the following.
1261
1262- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
1263 diagnostic output, with various levels of detail after each pass. Instruction
1264 numbers can be printed or suppressed. Deleted instructions can be printed or
1265 suppressed (they are retained in the instruction list, as discussed earlier,
1266 because they can help explain how lower-level instructions originated).
1267 Liveness information can be printed when available. Details of register
1268 allocation can be printed as register allocator decisions are made. And more.
1269
1270- Running Subzero with any level of verbosity produces an enormous amount of
1271 output. When debugging a single function, verbose output can be suppressed
1272 except for a particular function. The ``-verbose-focus`` flag suppresses
1273 verbose output except for the specified function.
1274
1275- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
1276 at exit. Timing markers can be placed in the Subzero source code to demarcate
1277 logical operations or passes of interest. Basic timing information plus
1278 call-stack type timing information is printed at the end.
1279
1280- Along with ``-timing``, the user can instead get a report on the overall
1281 translation time for each function, to help focus on timing outliers. Also,
1282 ``-timing-focus`` limits the ``-timing`` reporting to a single function,
1283 instead of aggregating pass timing across all functions.
1284
1285- The ``-szstats`` option reports various statistics on each function, such as
1286 stack frame size, static instruction count, etc. It may be helpful to track
1287 these stats over time as Subzero is improved, as an approximate measure of
1288 code quality.
1289
1290- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
1291 output, annotate the assembly output with register-focused liveness
1292 information. In particular, each basic block is annotated with which
1293 registers are live-in and live-out, and each instruction is annotated with
1294 which registers' and stack locations' live ranges end at that instruction.
1295 This is really useful when studying the generated code to find opportunities
1296 for code quality improvements.
1297
1298Testing and debugging
1299---------------------
1300
1301LLVM lit tests
1302^^^^^^^^^^^^^^
1303
1304For basic testing, Subzero uses LLVM's `lit
1305<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We
1306have a suite of hundreds of small functions where we test for particular
1307assembly code patterns across different target architectures.
1308
1309Cross tests
1310^^^^^^^^^^^
1311
1312Unfortunately, the lit tests don't do a great job of precisely testing the
1313correctness of the output. Much better are the cross tests, which are execution
1314tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
1315variety of interesting inputs. Each cross test consists of a set of C, C++,
1316and/or low-level bitcode files. The C and C++ source files are compiled down to
1317bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
1318Subzero mangles global symbol names with a special prefix to avoid duplicate
1319symbol errors. A driver program invokes both versions on a large set of
1320interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
1321differ. Cross tests turn out to be an excellent way of testing the basic
1322lowering patterns, but they are less useful for testing more global things like
1323liveness analysis and register allocation.
1324
1325Bisection debugging
1326^^^^^^^^^^^^^^^^^^^
1327
1328Sometimes with a new application, Subzero will end up producing incorrect code
1329that either crashes at runtime or otherwise produces the wrong results. When
1330this happens, we need to narrow it down to a single function (or small set of
1331functions) that yield incorrect behavior. For this, we have a bisection
1332debugging framework. Here, we initially translate the entire application once
1333with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to
1334selectively weaken symbols based on a whitelist or blacklist provided on the
1335command line. The two object files can then be linked together without link
1336errors, with the desired version of each method "winning". Then the binary is
1337tested, and bisection proceeds based on whether the binary produces correct
1338output.
1339
1340When the bisection completes, we are left with a minimal set of
1341Subzero-translated functions that cause the failure. Usually it is a single
1342function, though sometimes it might require a combination of several functions
1343to cause a failure; this may be due to an incorrect call ABI, for example.
1344However, Murphy's Law implies that the single failing function is enormous and
1345impractical to debug. In that case, we can restart the bisection, explicitly
1346blacklisting the enormous function, and try to find another candidate to debug.
1347(Future work is to automate this to find all minimal sets of functions, so that
1348debugging can focus on the simplest example.)
1349
1350Fuzz testing
1351^^^^^^^^^^^^
1352
1353As described above, we try to find internal Subzero bugs using fuzz testing
1354techniques.
1355
1356Sanitizers
1357^^^^^^^^^^
1358
1359Subzero can be built with `AddressSanitizer
1360<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
1361<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is
1362done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,
1363multithreading has been simple enough that TSan hasn't found any bugs, but ASan
1364has found at least one memory leak which was subsequently fixed.
1365`UndefinedBehaviorSanitizer
1366<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
1367(UBSan) support is in progress. `Control flow integrity sanitization
1368<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
1369consideration.
1370
1371Current status
1372==============
1373
1374Target architectures
1375--------------------
1376
1377Subzero is currently more or less complete for the x86-32 target. It has been
1378refactored and extended to handle x86-64 as well, and that is mostly complete at
1379this point.
1380
1381ARM32 work is in progress. It currently lacks the testing level of x86, at
1382least in part because Subzero's register allocator needs modifications to handle
1383ARM's aliasing of floating point and vector registers. Specifically, a 64-bit
1384register is actually a gang of two consecutive and aligned 32-bit registers, and
1385a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
1386ARM64 work has not started; when it does, it will be native-only since the
1387Native Client sandbox model, validator, and other tools have never been defined.
1388
1389An external contributor is adding MIPS support, in most part by following the
1390ARM work.
1391
1392Translator performance
1393----------------------
1394
1395Single-threaded translation speed is currently about 5× the ``pnacl-llc``
1396translation speed. For a large ``.pexe`` file, the time breaks down as:
1397
1398- 11% for parsing and initial IR building
1399
1400- 4% for emitting to /dev/null
1401
1402- 27% for liveness analysis (two liveness passes plus live range construction)
1403
1404- 15% for linear-scan register allocation
1405
1406- 9% for basic lowering
1407
1408- 10% for advanced phi lowering
1409
1410- ~11% for other minor analysis
1411
1412- ~10% measurement overhead to acquire these numbers
1413
1414Some improvements could undoubtedly be made, but it will be hard to increase the
1415speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With
1416``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
1417``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
1418
1419Code quality
1420------------
1421
1422Measured across 16 components of spec2k, Subzero's code quality is uniformly
1423better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
1424between ``pnacl-llc`` ``-O0`` and ``-O2``.
1425
1426Translator size
1427---------------
1428
1429When built in MINIMAL mode, the x86-64 native translator size for the x86-32
1430target is about 700 KB, not including the size of functions referenced in
1431dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1
1432MB, and it is statically linked and also includes nop padding for bundling as
1433well as indirect branch masking.
1434
1435Translator memory footprint
1436---------------------------
1437
1438It's hard to draw firm conclusions about memory footprint, since the footprint
1439is at least proportional to the input function size, and there is no real limit
1440on the size of functions in the ``.pexe`` file.
1441
1442That said, we looked at the memory footprint over time as Subzero translated
1443``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
1444disposal. One of LLVM's libraries that Subzero uses can report the current
1445malloc heap usage. With single-threaded translation, Subzero tends to hover
1446around 15 MB of memory usage. There are a couple of monstrous functions where
1447Subzero grows to around 100 MB, but then it drops back down after those
1448functions finish translating. In contrast, ``pnacl-llc`` grows larger and
1449larger throughout translation, reaching several hundred MB by the time it
1450completes.
1451
1452It's a bit more interesting when we enable multithreaded translation. When
1453there are N translation threads, Subzero implements a policy that limits the
1454size of the translation queue to N entries -- if it is "full" when the parser
1455tries to add a new CFG, the parser blocks until one of the translation threads
1456removes a CFG. This means the number of in-memory CFGs can (and generally does)
1457reach 2*N+1, and so the memory footprint rises in proportion to the number of
1458threads. Adding to the pressure is the observation that the monstrous functions
1459also take proportionally longer time to translate, so there's a good chance many
1460of the monstrous functions will be active at the same time with multithreaded
1461translation. As a result, for N=32, Subzero's memory footprint peaks at about
1462260 MB, but drops back down as the large functions finish translating.
1463
1464If this peak memory size becomes a problem, it might be possible for the parser
1465to resequence the functions to try to spread out the larger functions, or to
1466throttle the translation queue to prevent too many in-flight large functions.
1467It may also be possible to throttle based on memory pressure signaling from
1468Chrome.
1469
1470Translator scalability
1471----------------------
1472
1473Currently scalability is "not very good". Multiple translation threads lead to
1474faster translation, but not to the degree desired. We haven't dug in to
1475investigate yet.
1476
1477There are a few areas to investigate. First, there may be contention on the
1478constant pool, which all threads access, and which requires locked access even
1479for reading. This could be mitigated by keeping a CFG-local cache of the most
1480common constants.
1481
1482Second, there may be contention on memory allocation. While almost all CFG
1483objects are allocated from the CFG-local allocator, some passes use temporary
1484STL containers that use the default allocator, which may require global locking.
1485This could be mitigated by switching these to the CFG-local allocator.
1486
1487Third, multithreading may make the default allocator strategy more expensive.
1488In a single-threaded environment, a pass will allocate its containers, run the
1489pass, and deallocate the containers. This results in stack-like allocation
1490behavior and makes the heap free list easier to manage, with less heap
1491fragmentation. But when multithreading is added, the allocations and
1492deallocations become much less stack-like, making allocation and deallocation
1493operations individually more expensive. Again, this could be mitigated by
1494switching these to the CFG-local allocator.