Blame - DESIGN.rst - platform/external/swiftshader

blob: 3d324f64ba81798b5a7a038b24f265bbe292254b [file] [log] [blame]

Jim Stichnoth	efb8971	2015-09-03 13:19:54 -0700	[diff] [blame^]	1	Design of the Subzero fast code generator
				2	=========================================
				3
				4	Introduction
				5	------------
				6
				7	The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
				8	compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
				9	PNaCl toolchain to compile their application to architecture-neutral PNaCl
				10	bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
				11	possible. The ``.pexe`` file is downloaded to the user's browser where the
				12	PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
				13	`sandboxed
				14	<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
				15	native code. The translator uses architecture-specific optimizations as much as
				16	practical to generate good native code.
				17
				18	The native code can be cached by the browser to avoid repeating translation on
				19	future page loads. However, first-time user experience is hampered by long
				20	translation times. The LLVM-based PNaCl translator is pretty slow, even when
				21	using ``-O0`` to minimize optimizations, so delays are especially noticeable on
				22	slow browser platforms such as ARM-based Chromebooks.
				23
				24	Translator slowness can be mitigated or hidden in a number of ways.
				25
				26	- Parallel translation. However, slow machines where this matters most, e.g.
				27	ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
				28	and are likely to less memory available for multiple translation threads to
				29	use.
				30
				31	- Streaming translation, i.e. start translating as soon as the download starts.
				32	This doesn't help much when translation speed is 10× slower than download
				33	speed, or the ``.pexe`` file is already cached while the translated binary was
				34	flushed from the cache.
				35
				36	- Arrange the web page such that translation is done in parallel with
				37	downloading large assets.
				38
				39	- Arrange the web page to distract the user with `cat videos
				40	<https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
				41	progress.
				42
				43	Or, improve translator performance to something more reasonable.
				44
				45	This document describes Subzero's attempt to improve translation speed by an
				46	order of magnitude while rivaling LLVM's code quality. Subzero does this
				47	through minimal IR layering, lean data structures and passes, and a careful
				48	selection of fast optimization passes. It has two optimization recipes: full
				49	optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
				50	following (described in more detail below):
				51
				52	+-----------------------------------+-----------------------+
				53	\| O2 recipe \| Om1 recipe \|
				54	+===================================+=======================+
				55	\| Parse .pexe file \| Parse .pexe file \|
				56	+-----------------------------------+-----------------------+
				57	\| Loop nest analysis \| \|
				58	+-----------------------------------+-----------------------+
				59	\| Address mode inference \| \|
				60	+-----------------------------------+-----------------------+
				61	\| Read-modify-write (RMW) transform \| \|
				62	+-----------------------------------+-----------------------+
				63	\| Basic liveness analysis \| \|
				64	+-----------------------------------+-----------------------+
				65	\| Load optimization \| \|
				66	+-----------------------------------+-----------------------+
				67	\| \| Phi lowering (simple) \|
				68	+-----------------------------------+-----------------------+
				69	\| Target lowering \| Target lowering \|
				70	+-----------------------------------+-----------------------+
				71	\| Full liveness analysis \| \|
				72	+-----------------------------------+-----------------------+
				73	\| Register allocation \| \|
				74	+-----------------------------------+-----------------------+
				75	\| Phi lowering (advanced) \| \|
				76	+-----------------------------------+-----------------------+
				77	\| Post-phi register allocation \| \|
				78	+-----------------------------------+-----------------------+
				79	\| Branch optimization \| \|
				80	+-----------------------------------+-----------------------+
				81	\| Code emission \| Code emission \|
				82	+-----------------------------------+-----------------------+
				83
				84	Goals
				85	=====
				86
				87	Translation speed
				88	-----------------
				89
				90	We'd like to be able to translate a ``.pexe`` file as fast as download speed.
				91	Any faster is in a sense wasted effort. Download speed varies greatly, but
				92	we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
				93	slow machine. We observe a 3× single-thread performance difference between A15
				94	and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
				95	``.pexe`` file could be compressed to 50% on the web server using gzip transport
				96	compression, so we set the translation speed goal to 6 MB/sec on the high-end
				97	Xeon workstation.
				98
				99	Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
				100	⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
				101
				102	In other words, Subzero's goal is to improve over LLVM's translation speed by
				103	10×.
				104
				105	Code quality
				106	------------
				107
				108	Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
				109	code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On
				110	average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
				111
				112	It's important to note that the quality of Subzero-generated code depends on
				113	target-neutral optimizations and simplifications being run beforehand in the
				114	developer environment. The ``.pexe`` file reflects these optimizations. For
				115	example, Subzero assumes that the basic blocks are ordered topologically where
				116	possible (which makes liveness analysis converge fastest), and Subzero does not
				117	do any function inlining because it should already have been done.
				118
				119	Translator size
				120	---------------
				121
				122	The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
				123	We think 1 MB is a more reasonable size -- especially for such a component that
				124	is distributed to a billion Chrome users. Thus we target a 10× reduction in
				125	binary size.
				126
				127	For development, Subzero can be built for all target architectures, and all
				128	debugging and diagnostic options enabled. For a smaller translator, we restrict
				129	to a single target architecture, and define a ``MINIMAL`` build where
				130	unnecessary features are compiled out.
				131
				132	Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
				133	include directories, which have little impact on translator size. It also uses
				134	some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
				135	with little size impact. In non-``MINIMAL`` builds, the translator size is much
				136	larger due to including code for parsing text-format bitcode files and forming
				137	LLVM IR.
				138
				139	Memory footprint
				140	----------------
				141
				142	The current LLVM-based translator suffers from an issue in which some
				143	function-specific data has to be retained in memory until all translation
				144	completes, and therefore the memory footprint grows without bound. Large
				145	``.pexe`` files can lead to the translator process holding hundreds of MB of
				146	memory by the end. The translator runs in a separate process, so this memory
				147	growth doesn't directly affect other processes, but it does dirty the physical
				148	memory and contributes to a perception of bloat and sometimes a reality of
				149	out-of-memory tab killing, especially noticeable on weaker systems.
				150
				151	Subzero should maintain a stable memory footprint throughout translation. It's
				152	not really practical to set a specific limit, because there is not really a
				153	practical limit on a single function's size, but the footprint should be
				154	"reasonable" and be proportional to the largest input function size, not the
				155	total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or
				156	inexorable memory growth. (We use ASAN builds to test for leaks.)
				157
				158	Multithreaded translation
				159	-------------------------
				160
				161	It should be practical to translate different functions concurrently and see
				162	good scalability. Some locking may be needed, such as accessing output buffers
				163	or constant pools, but that should be fairly minimal. In contrast, LLVM was
				164	only designed for module-level parallelism, and as such, the PNaCl translator
				165	internally splits a ``.pexe`` file into several modules for concurrent
				166	translation. All output needs to be deterministic regardless of the level of
				167	multithreading, i.e. functions and data should always be output in the same
				168	order.
				169
				170	Target architectures
				171	--------------------
				172
				173	Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future
				174	targets include ARM64 and MIPS64, though these targets lack NaCl support
				175	including a sandbox model or a validator.
				176
				177	The first implementation is for x86-32, because it was expected to be
				178	particularly challenging, and thus more likely to draw out any design problems
				179	early:
				180
				181	- There are a number of special cases, asymmetries, and warts in the x86
				182	instruction set.
				183
				184	- Complex addressing modes may be leveraged for better code quality.
				185
				186	- 64-bit integer operations have to be lowered into longer sequences of 32-bit
				187	operations.
				188
				189	- Paucity of physical registers may reveal code quality issues early in the
				190	design.
				191
				192	Detailed design
				193	===============
				194
				195	Intermediate representation - ICE
				196	---------------------------------
				197
				198	Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's
				199	IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a
				200	representation of global variables and initializers, and a set of functions.
				201	Each function contains a list of basic blocks, and each basic block constains a
				202	list of instructions. Instructions that operate on stack and register variables
				203	do so using static single assignment (SSA) form.
				204
				205	The ``.pexe`` file is translated one function at a time (or in parallel by
				206	multiple translation threads). The recipe for optimization passes depends on
				207	the specific target and optimization level, and is described in detail below.
				208	Global variables (types and initializers) are simply and directly translated to
				209	object code, without any meaningful attempts at optimization.
				210
				211	A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
				212	Its key contents include:
				213
				214	- A list of ``CfgNode`` pointers, generally held in topological order.
				215
				216	- A list of ``Variable`` pointers corresponding to local variables used in the
				217	function plus compiler-generated temporaries.
				218
				219	A basic block is represented by the ``Ice::CfgNode`` class. Its key contents
				220	include:
				221
				222	- A linear list of instructions, in the same style as LLVM. The last
				223	instruction of the list is always a terminator instruction: branch, switch,
				224	return, unreachable.
				225
				226	- A list of Phi instructions, also in the same style as LLVM. They are held as
				227	a linear list for convenience, though per Phi semantics, they are executed "in
				228	parallel" without dependencies on each other.
				229
				230	- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
				231	another list for outgoing edges.
				232
				233	- The node's unique, 0-based index into the CFG's node list.
				234
				235	An instruction is represented by the ``Ice::Inst`` class. Its key contents
				236	include:
				237
				238	- A list of source operands.
				239
				240	- Its destination variable, if the instruction produces a result in an
				241	``Ice::Variable``.
				242
				243	- A bitvector indicating which variables' live ranges this instruction ends.
				244	This is computed during liveness analysis.
				245
				246	Instructions kinds are divided into high-level ICE instructions and low-level
				247	ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode
				248	instruction kinds. Each target architecture implementation extends the
				249	instruction space with its own set of low-level instructions. Generally,
				250	low-level instructions correspond to individual machine instructions. The
				251	high-level ICE instruction space includes a few additional instruction kinds
				252	that are not part of LLVM but are generally useful (e.g., an Assignment
				253	instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
				254	instructions for sandboxing).
				255
				256	Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
				257	ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
				258	<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
				259	the following:
				260
				261	- Alloca: allocate data on the stack
				262
				263	- Arithmetic: binary operations of the form ``A = B op C``
				264
				265	- Br: conditional or unconditional branch
				266
				267	- Call: function call
				268
				269	- Cast: unary type-conversion operations
				270
				271	- ExtractElement: extract a scalar element from a vector-type value
				272
				273	- Fcmp: floating-point comparison
				274
				275	- Icmp: integer comparison
				276
				277	- IntrinsicCall: call a known intrinsic
				278
				279	- InsertElement: insert a scalar element into a vector-type value
				280
				281	- Load: load a value from memory
				282
				283	- Phi: implement the SSA phi node
				284
				285	- Ret: return from the function
				286
				287	- Select: essentially the C language operation of the form ``X = C ? Y : Z``
				288
				289	- Store: store a value into memory
				290
				291	- Switch: generalized branch to multiple possible locations
				292
				293	- Unreachable: indicate that this portion of the code is unreachable
				294
				295	The additional high-level ICE instructions are the following:
				296
				297	- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi
				298	instructions to non-SSA assignments, before lowering to machine code.
				299
				300	- BundleLock, BundleUnlock. These are markers used for sandboxing, but are
				301	common across all targets and so they are elevated to the high-level
				302	instruction set.
				303
				304	- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in
				305	liveness analysis, elevated to the high-level because they are used by all
				306	targets. They are described in more detail at the end of this section.
				307
				308	- JumpTable: this represents the result of switch optimization analysis, where
				309	some switch instructions may use jump tables instead of cascading
				310	compare/branches.
				311
				312	An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an
				313	operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants
				314	include scalar integer constants, scalar floating point constants, Undef (an
				315	unspecified constant of a particular scalar or vector type), and symbol
				316	constants (essentially addresses of globals). Note that the PNaCl ABI does not
				317	include vector-type constants besides Undef, and as such, Subzero (so far) has
				318	no reason to represent vector-type constants internally. A variable represents
				319	a value allocated on the stack (though not including alloca-derived storage).
				320	Among other things, a variable holds its unique, 0-based index into the CFG's
				321	variable list.
				322
				323	Each target can extend the ``Constant`` and ``Variable`` classes for its own
				324	needs. In addition, the ``Operand`` class may be extended, e.g. to define an
				325	x86 ``MemOperand`` that encodes a base register, an index register, an index
				326	register shift amount, and a constant offset.
				327
				328	Register allocation and liveness analysis are restricted to Variable operands.
				329	Because of the importance of register allocation to code quality, and the
				330	translation-time cost of liveness analysis, Variable operands get some special
				331	treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate
				332	across all the Variables of an instruction. An instruction holds a list of
				333	operands, but an operand may contain 0, 1, or more Variables. As such, the
				334	``Operand`` class specially holds a list of Variables contained within, for
				335	quick access.
				336
				337	A Subzero transformation pass may work by deleting an existing instruction and
				338	replacing it with zero or more new instructions. Instead of actually deleting
				339	the existing instruction, we generally mark it as deleted and insert the new
				340	instructions right after the deleted instruction. When printing the IR for
				341	debugging, this is a big help because it makes it much more clear how the
				342	non-deleted instructions came about.
				343
				344	Subzero has a few special instructions to help with liveness analysis
				345	consistency.
				346
				347	- The FakeDef instruction gives a fake definition of some variable. For
				348	example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
				349	but an ICE instruction can represent only one destination variable. This is
				350	similar for multiply instructions, and for function calls that return a 64-bit
				351	integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``
				352	trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
				353
				354	- The FakeUse instruction registers a use of a variable, typically to prevent an
				355	earlier assignment to that variable from being dead-code eliminated. For
				356	example, lowering an operation like ``x=cc?y:z`` may be done using x86's
				357	conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a
				358	FakeUse of ``x`` between the two instructions, the liveness analysis pass may
				359	dead-code eliminate the first instruction.
				360
				361	- The FakeKill instruction is added after a call instruction, and is a quick way
				362	of indicating that caller-save registers are invalidated.
				363
				364	Pexe parsing
				365	------------
				366
				367	Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It
				368	parses the ``.pexe`` file function by function, ultimately constructing an ICE
				369	CFG for each function. After a function is parsed, its CFG is handed off to the
				370	translation phase. The bitcode parser also parses global initializer data and
				371	hands it off to be translated to data sections in the object file.
				372
				373	Subzero has another parsing strategy for testing/debugging. LLVM libraries can
				374	be used to parse a module into LLVM IR (though very slowly relative to Subzero
				375	native parsing). Then we iterate across the LLVM IR and construct high-level
				376	ICE, handing off each CFG to the translation phase.
				377
				378	Overview of lowering
				379	--------------------
				380
				381	In general, translation goes like this:
				382
				383	- Parse the next function from the ``.pexe`` file and construct a CFG consisting
				384	of high-level ICE.
				385
				386	- Do analysis passes and transformation passes on the high-level ICE, as
				387	desired.
				388
				389	- Lower each high-level ICE instruction into a sequence of zero or more
				390	low-level ICE instructions. Each high-level instruction is generally lowered
				391	independently, though the target lowering is allowed to look ahead in the
				392	CfgNode's instruction list if desired.
				393
				394	- Do more analysis and transformation passes on the low-level ICE, as desired.
				395
				396	- Assemble the low-level CFG into an ELF object file (alternatively, a textual
				397	assembly file that is later assembled by some external tool).
				398
				399	- Repeat for all functions, and also produce object code for data such as global
				400	initializers and internal constant pools.
				401
				402	Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,
				403	the intention is to apply all available optimizations to get the best code
				404	quality (though the initial code quality goal is measured against LLVM's ``O0``
				405	code quality). For ``Om1``, the intention is to apply as few optimizations as
				406	possible and produce code as quickly as possible, accepting poor code quality.
				407	``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
				408	"sub-zero".
				409
				410	High-level debuggability of generated code is so far not a design requirement.
				411	Subzero doesn't really do transformations that would obfuscate debugging; the
				412	main thing might be that register allocation (including stack slot coalescing
				413	for stack-allocated variables whose live ranges don't overlap) may render a
				414	variable's value unobtainable after its live range ends. This would not be an
				415	issue for ``Om1`` since it doesn't register-allocate program-level variables,
				416	nor does it coalesce stack slots. That said, fully supporting debuggability
				417	would require a few additions:
				418
				419	- DWARF support would need to be added to Subzero's ELF file emitter. Subzero
				420	propagates global symbol names, local variable names, and function-internal
				421	label names that are present in the ``.pexe`` file. This would allow a
				422	debugger to map addresses back to symbols in the ``.pexe`` file.
				423
				424	- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
				425	file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
				426	metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
				427	`instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
				428
				429	- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
				430	the toolchain would need to be modified to preserve it.
				431
				432	Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
				433	produces code with one third the code quality. ``Om1`` is good for testing and
				434	debugging -- during translation, it tends to expose errors in the basic lowering
				435	that might otherwise have been hidden by the register allocator or other
				436	optimization passes. It also helps determine whether a code correctness problem
				437	is a fundamental problem in the basic lowering, or an error in another
				438	optimization pass.
				439
				440	The implementation of target lowering also controls the recipe of passes used
				441	for ``Om1`` and ``O2`` translation. For example, address mode inference may
				442	only be relevant for x86.
				443
				444	Lowering strategy
				445	-----------------
				446
				447	The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
				448	each high-level instruction down to a sequence of low-level target-specific
				449	instructions, in a largely context-free setting. That is, each high-level
				450	instruction conceptually has a simple template expansion into low-level
				451	instructions, and lowering can in theory be done in any order. This may sound
				452	like a small effort, but quite a large number of templates may be needed because
				453	of the number of PNaCl types and instruction variants. Furthermore, there may
				454	be optimized templates, e.g. to take advantage of operator commutativity (for
				455	example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is
				456	similar to other template-based approaches in fast code generation or
				457	interpretation, though some decisions are deferred until after some global
				458	analysis passes, mostly related to register allocation, stack slot assignment,
				459	and specific choice of instruction variant and addressing mode.
				460
				461	The key idea for a lowering template is to produce valid low-level instructions
				462	that are guaranteed to meet address mode and other structural requirements of
				463	the instruction set. For example, on x86, the source operand of an integer
				464	store instruction must be an immediate or a physical register; a shift
				465	instruction's shift amount must be an immediate or in register ``%cl``; a
				466	function's integer return value is in ``%eax``; most x86 instructions are
				467	two-operand, in contrast to corresponding three-operand high-level instructions;
				468	etc.
				469
				470	Because target lowering runs before register allocation, there is no way to know
				471	whether a given ``Ice::Variable`` operand lives on the stack or in a physical
				472	register. When the low-level instruction calls for a physical register operand,
				473	the target lowering can create an infinite-weight Variable. This tells the
				474	register allocator to assign infinite weight when making decisions, effectively
				475	guaranteeing some physical register. Variables can also be pre-colored to a
				476	specific physical register (``cl`` in the shift example above), which also gives
				477	infinite weight.
				478
				479	To illustrate, consider a high-level arithmetic instruction on 32-bit integer
				480	operands::
				481
				482	A = B + C
				483
				484	X86 target lowering might produce the following::
				485
				486	T.inf = B // mov instruction
				487	T.inf += C // add instruction
				488	A = T.inf // mov instruction
				489
				490	Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a
				491	physical register, the three lowered instructions are all encodable regardless
				492	of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
				493	whether ``A`` is a physical register or in memory.
				494
				495	In this example, ``A`` must be a Variable and one may be tempted to simplify the
				496	lowering sequence by setting ``A`` as infinite-weight and using::
				497
				498	A = B // mov instruction
				499	A += C // add instruction
				500
				501	This has two problems. First, if the original instruction was actually ``A =
				502	B + A``, the result would be incorrect. Second, assigning ``A`` a physical
				503	register applies throughout ``A``'s entire live range. This is probably not
				504	what is intended, and may ultimately lead to a failure to allocate a register
				505	for an infinite-weight variable.
				506
				507	This style of lowering leads to many temporaries being generated, so in ``O2``
				508	mode, we rely on the register allocator to clean things up. For example, in the
				509	example above, if ``B`` ends up getting a physical register and its live range
				510	ends at this instruction, the register allocator is likely to reuse that
				511	register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
				512	copy, which is removed as an emission-time peephole optimization.
				513
				514	O2 lowering
				515	-----------
				516
				517	Currently, the ``O2`` lowering recipe is the following:
				518
				519	- Loop nest analysis
				520
				521	- Address mode inference
				522
				523	- Read-modify-write (RMW) transformation
				524
				525	- Basic liveness analysis
				526
				527	- Load optimization
				528
				529	- Target lowering
				530
				531	- Full liveness analysis
				532
				533	- Register allocation
				534
				535	- Phi instruction lowering (advanced)
				536
				537	- Post-phi lowering register allocation
				538
				539	- Branch optimization
				540
				541	These passes are described in more detail below.
				542
				543	Om1 lowering
				544	------------
				545
				546	Currently, the ``Om1`` lowering recipe is the following:
				547
				548	- Phi instruction lowering (simple)
				549
				550	- Target lowering
				551
				552	- Register allocation (infinite-weight and pre-colored only)
				553
				554	Optimization passes
				555	-------------------
				556
				557	Liveness analysis
				558	^^^^^^^^^^^^^^^^^
				559
				560	Liveness analysis is a standard dataflow optimization, implemented as follows.
				561	For each node (basic block), its live-out set is computed as the union of the
				562	live-in sets of its successor nodes. Then the node's instructions are processed
				563	in reverse order, updating the live set, until the beginning of the node is
				564	reached, and the node's live-in set is recorded. If this iteration has changed
				565	the node's live-in set, the node's predecessors are marked for reprocessing.
				566	This continues until no more nodes need reprocessing. If nodes are processed in
				567	reverse topological order, the number of iterations over the CFG is generally
				568	equal to the maximum loop nest depth.
				569
				570	To implement this, each node records its live-in and live-out sets, initialized
				571	to the empty set. Each instruction records which of its Variables' live ranges
				572	end in that instruction, initialized to the empty set. A side effect of
				573	liveness analysis is dead instruction elimination. Each instruction can be
				574	marked as tentatively dead, and after the algorithm converges, the tentatively
				575	dead instructions are permanently deleted.
				576
				577	Optionally, after this liveness analysis completes, we can do live range
				578	construction, in which we calculate the live range of each variable in terms of
				579	instruction numbers. A live range is represented as a union of segments, where
				580	the segment endpoints are instruction numbers. Instruction numbers are required
				581	to be unique across the CFG, and monotonically increasing within a basic block.
				582	As a union of segments, live ranges can contain "gaps" and are therefore
				583	precise. Because of SSA properties, a variable's live range can start at most
				584	once in a basic block, and can end at most once in a basic block. Liveness
				585	analysis keeps track of which variable/instruction tuples begin live ranges and
				586	end live ranges, and combined with live-in and live-out sets, we can efficiently
				587	build up live ranges of all variables across all basic blocks.
				588
				589	A lot of care is taken to try to make liveness analysis fast and efficient.
				590	Because of the lowering strategy, the number of variables is generally
				591	proportional to the number of instructions, leading to an O(N^2) complexity
				592	algorithm if implemented naively. To improve things based on sparsity, we note
				593	that most variables are "local" and referenced in at most one basic block (in
				594	contrast to the "global" variables with multi-block usage), and therefore cannot
				595	be live across basic blocks. Therefore, the live-in and live-out sets,
				596	typically represented as bit vectors, can be limited to the set of global
				597	variables, and the intra-block liveness bit vector can be compacted to hold the
				598	global variables plus the local variables for that block.
				599
				600	Register allocation
				601	^^^^^^^^^^^^^^^^^^^
				602
				603	Subzero implements a simple linear-scan register allocator, based on the
				604	allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
				605	Register Allocation in the Context of SSA Form and Register Constraints
				606	<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has
				607	several nice features:
				608
				609	- Live ranges are represented as unions of segments, as described above, rather
				610	than a single start/end tuple.
				611
				612	- It allows pre-coloring of variables with specific physical registers.
				613
				614	- It applies equally well to pre-lowered Phi instructions.
				615
				616	The paper suggests an approach of aggressively coalescing variables across Phi
				617	instructions (i.e., trying to force Phi source and destination variables to have
				618	the same register assignment), but we reject that in favor of the more natural
				619	preference mechanism described below.
				620
				621	We enhance the algorithm in the paper with the capability of automatic inference
				622	of register preference, and with the capability of allowing overlapping live
				623	ranges to safely share the same register in certain circumstances. If we are
				624	considering register allocation for variable ``A``, and ``A`` has a single
				625	defining instruction ``A=B+C``, then the preferred register for ``A``, if
				626	available, would be the register assigned to ``B`` or ``C``, if any, provided
				627	that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this
				628	way we infer a good register preference for ``A``.
				629
				630	We allow overlapping live ranges to get the same register in certain cases.
				631	Suppose a high-level instruction like::
				632
				633	A = unary_op(B)
				634
				635	has been target-lowered like::
				636
				637	T.inf = B
				638	A = unary_op(T.inf)
				639
				640	Further, assume that ``B``'s live range continues beyond this instruction
				641	sequence, and that ``B`` has already been assigned some register. Normally, we
				642	might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
				643	turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
				644	different registers. But ``T.inf`` is just a read-only copy of ``B`` that is
				645	guaranteed to be in a register, so in theory these overlapping live ranges could
				646	safely have the same register. Our implementation allows this overlap as long
				647	as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
				648	modified within ``T.inf``'s live range.
				649
				650	Subzero's register allocator can be run in 3 configurations.
				651
				652	- Normal mode. All Variables are considered for register allocation. It
				653	requires full liveness analysis and live range construction as a prerequisite.
				654	This is used by ``O2`` lowering.
				655
				656	- Minimal mode. Only infinite-weight or pre-colored Variables are considered.
				657	All other Variables are stack-allocated. It does not require liveness
				658	analysis; instead, it quickly scans the instructions and records first
				659	definitions and last uses of all relevant Variables, using that to construct a
				660	single-segment live range. Although this includes most of the Variables, the
				661	live ranges are mostly simple, short, and rarely overlapping, which the
				662	register allocator handles efficiently. This is used by ``Om1`` lowering.
				663
				664	- Post-phi lowering mode. Advanced phi lowering is done after normal-mode
				665	register allocation, and may result in new infinite-weight Variables that need
				666	registers. One would like to just run something like minimal mode to assign
				667	registers to the new Variables while respecting existing register allocation
				668	decisions. However, it sometimes happens that there are no free registers.
				669	In this case, some register needs to be forcibly spilled to the stack and
				670	temporarily reassigned to the new Variable, and reloaded at the end of the new
				671	Variable's live range. The register must be one that has no explicit
				672	references during the Variable's live range. Since Subzero currently doesn't
				673	track def/use chains (though it does record the CfgNode where a Variable is
				674	defined), we just do a brute-force search across the CfgNode's instruction
				675	list for the instruction numbers of interest. This situation happens very
				676	rarely, so there's little point for now in improving its performance.
				677
				678	The basic linear-scan algorithm may, as it proceeds, rescind an early register
				679	allocation decision, leaving that Variable to be stack-allocated. Some of these
				680	times, it turns out that the Variable could have been given a different register
				681	without conflict, but by this time it's too late. The literature recognizes
				682	this situation and describes "second-chance bin-packing", which Subzero can do.
				683	We can rerun the register allocator in a mode that respects existing register
				684	allocation decisions, and sometimes it finds new non-conflicting opportunities.
				685	In fact, we can repeatedly run the register allocator until convergence.
				686	Unfortunately, in the current implementation, these subsequent register
				687	allocation passes end up being extremely expensive. This is because of the
				688	treatment of the "unhandled pre-colored" Variable set, which is normally very
				689	small but ends up being quite large on subsequent passes. Its performance can
				690	probably be made acceptable with a better choice of data structures, but for now
				691	this second-chance mechanism is disabled.
				692
				693	Future work is to implement LLVM's `Greedy
				694	<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
				695	register allocator as a replacement for the basic linear-scan algorithm, given
				696	LLVM's experience with its improvement in code quality. (The blog post claims
				697	that the Greedy allocator also improved maintainability because a lot of hacks
				698	could be removed, but Subzero is probably not yet to that level of hacks, and is
				699	less likely to see that particular benefit.)
				700
				701	Basic phi lowering
				702	^^^^^^^^^^^^^^^^^^
				703
				704	The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
				705	implements it). Consider this example::
				706
				707	L1:
				708	...
				709	br L3
				710	L2:
				711	...
				712	br L3
				713	L3:
				714	A = phi [B, L1], [C, L2]
				715	X = phi [Y, L1], [Z, L2]
				716
				717	For each destination of a phi instruction, we can create a temporary and insert
				718	the temporary's assignment at the end of the predecessor block::
				719
				720	L1:
				721	...
				722	A' = B
				723	X' = Y
				724	br L3
				725	L2:
				726	...
				727	A' = C
				728	X' = Z
				729	br L3
				730	L2:
				731	A = A'
				732	X = X'
				733
				734	This transformation is very simple and reliable. It can be done before target
				735	lowering and register allocation, and it easily avoids the classic lost-copy and
				736	related problems. ``Om1`` lowering uses this strategy.
				737
				738	However, it has the disadvantage of initializing temporaries even for branches
				739	not taken, though that could be mitigated by splitting non-critical edges and
				740	putting assignments in the edge-split nodes. Another problem is that without
				741	extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
				742	specific ordering even though phi semantics are that the assignments are
				743	parallel or unordered. This sometimes imposes false live range overlaps and
				744	leads to poorer register allocation.
				745
				746	Advanced phi lowering
				747	^^^^^^^^^^^^^^^^^^^^^
				748
				749	``O2`` lowering defers phi lowering until after register allocation to avoid the
				750	problem of false live range overlaps. It works as follows. We split each
				751	incoming edge and move the (parallel) phi assignments into the split nodes. We
				752	linearize each set of assignments by finding a safe, topological ordering of the
				753	assignments, respecting register assignments as well. For example::
				754
				755	A = B
				756	X = Y
				757
				758	Normally these assignments could be executed in either order, but if ``B`` and
				759	``X`` are assigned the same physical register, we would want to use the above
				760	ordering. Dependency cycles are broken by introducing a temporary. For
				761	example::
				762
				763	A = B
				764	B = A
				765
				766	Here, a temporary breaks the cycle::
				767
				768	t = A
				769	A = B
				770	B = t
				771
				772	Finally, we use the existing target lowering to lower the assignments in this
				773	basic block, and once that is done for all basic blocks, we run the post-phi
				774	variant of register allocation on the edge-split basic blocks.
				775
				776	When computing a topological order, we try to first schedule assignments whose
				777	source has a physical register, and last schedule assignments whose destination
				778	has a physical register. This helps reduce register pressure.
				779
				780	X86 address mode inference
				781	^^^^^^^^^^^^^^^^^^^^^^^^^^
				782
				783	We try to take advantage of the x86 addressing mode that includes a base
				784	register, an index register, an index register scale amount, and an immediate
				785	offset. We do this through simple pattern matching. Starting with a load or
				786	store instruction where the address is a variable, we initialize the base
				787	register to that variable, and look up the instruction where that variable is
				788	defined. If that is an add instruction of two variables and the index register
				789	hasn't been set, we replace the base and index register with those two
				790	variables. If instead it is an add instruction of a variable and a constant, we
				791	replace the base register with the variable and add the constant to the
				792	immediate offset.
				793
				794	There are several more patterns that can be matched. This pattern matching
				795	continues on the load or store instruction until no more matches are found.
				796	Because a program typically has few load and store instructions (not to be
				797	confused with instructions that manipulate stack variables), this address mode
				798	inference pass is fast.
				799
				800	X86 read-modify-write inference
				801	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				802
				803	A reasonably common bitcode pattern is a non-atomic update of a memory
				804	location::
				805
				806	x = load addr
				807	y = add x, 1
				808	store y, addr
				809
				810	On x86, with good register allocation, the Subzero passes described above
				811	generate code with only this quality::
				812
				813	mov [%ebx], %eax
				814	add $1, %eax
				815	mov %eax, [%ebx]
				816
				817	However, x86 allows for this kind of code::
				818
				819	add $1, [%ebx]
				820
				821	which requires fewer instructions, but perhaps more importantly, requires fewer
				822	physical registers.
				823
				824	It's also important to note that this transformation only makes sense if the
				825	store instruction ends ``x``'s live range.
				826
				827	Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
				828	opportunities via simple pattern matching. The only problem is that it is run
				829	before liveness analysis, which is needed to determine whether ``x``'s live
				830	range ends after the RMW. Since liveness analysis is one of the most expensive
				831	passes, it's not attractive to run it an extra time just for RMW analysis.
				832	Instead, we essentially generate both the RMW and the non-RMW versions, and then
				833	during lowering, the RMW version deletes itself if it finds x still live.
				834
				835	X86 compare-branch inference
				836	^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				837
				838	In the LLVM instruction set, the compare/branch pattern works like this::
				839
				840	cond = icmp eq a, b
				841	br cond, target
				842
				843	The result of the icmp instruction is a single bit, and a conditional branch
				844	tests that bit. By contrast, most target architectures use this pattern::
				845
				846	cmp a, b // implicitly sets various bits of FLAGS register
				847	br eq, target // branch on a particular FLAGS bit
				848
				849	A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
				850	``cond`` and conditionally branches. Subzero has a pass that identifies
				851	boolean-based operations like this and folds them into a single
				852	compare/branch-like operation. It is set up for more than just cmp/br though.
				853	Boolean producers include icmp (integer compare), fcmp (floating-point compare),
				854	and trunc (integer truncation when the destination has bool type). Boolean
				855	consumers include branch, select (the ternary operator from the C language), and
				856	sign-extend and zero-extend when the source has bool type.
				857
				858	Sandboxing
				859	^^^^^^^^^^
				860
				861	Native Client's sandbox model uses software fault isolation (SFI) to provide
				862	safety when running untrusted code in a browser or other environment. Subzero
				863	implements Native Client's `sandboxing
				864	<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
				865	to enable Subzero-translated executables to be run inside Chrome. Subzero also
				866	provides a fairly simple framework for investigating alternative sandbox models
				867	or other restrictions on the sandbox model.
				868
				869	Sandboxing in Subzero is not actually implemented as a separate pass, but is
				870	integrated into lowering and assembly.
				871
				872	- Indirect branches, including the ret instruction, are masked to a bundle
				873	boundary and bundle-locked.
				874
				875	- Call instructions are aligned to the end of the bundle so that the return
				876	address is bundle-aligned.
				877
				878	- Indirect branch targets, including function entry and targets in a switch
				879	statement jump table, are bundle-aligned.
				880
				881	- The intrinsic for reading the thread pointer is inlined appropriately.
				882
				883	- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
				884	base register. We reduce the aggressiveness of address mode inference to
				885	leave room for the sandbox base register during lowering. There are no memory
				886	sandboxing changes for x86-32.
				887
				888	Code emission
				889	-------------
				890
				891	Subzero's integrated assembler is derived from Dart's `assembler code
				892	<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass
				893	that iterates through the low-level ICE instructions and invokes the relevant
				894	assembler functions. Placeholders are added for later fixup of branch target
				895	offsets. (Backward branches use short offsets if possible; forward branches
				896	generally use long offsets unless it is an intra-block branch of "known" short
				897	length.) The assembler emits into a staging buffer. Once emission into the
				898	staging buffer for a function is complete, the data is emitted to the output
				899	file as an ELF object file, and metadata such as relocations, symbol table, and
				900	string table, are accumulated for emission at the end. Global data initializers
				901	are emitted similarly. A key point is that at this point, the staging buffer
				902	can be deallocated, and only a minimum of data needs to held until the end.
				903
				904	As a debugging alternative, Subzero can emit textual assembly code which can
				905	then be run through an external assembler. This is of course super slow, but
				906	quite valuable when bringing up a new target.
				907
				908	As another debugging option, the staging buffer can be emitted as textual
				909	assembly, primarily in the form of ".byte" lines. This allows the assembler to
				910	be tested separately from the ELF related code.
				911
				912	Memory management
				913	-----------------
				914
				915	Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
				916	LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects
				917	allocated from the arena are never actually freed; instead, when the CFG
				918	translation completes and the CFG is deleted, the entire arena memory is
				919	reclaimed at once. This style of allocation works well in an environment like a
				920	compiler where there are distinct phases with only easily-identifiable objects
				921	living across phases. It frees the developer from having to manage object
				922	deletion, and it amortizes deletion costs across just a single arena deletion at
				923	the end of the phase. Furthermore, it helps scalability by allocating entirely
				924	from thread-local memory pools, and minimizing global locking of the heap.
				925
				926	Instructions are probably the most heavily allocated complex class in Subzero.
				927	We represent an instruction list as an intrusive doubly linked list, allocate
				928	all instructions from the ``CfgLocalAllocator``, and we make sure each
				929	instruction subclass is basically `POD
				930	<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
				931	trivial destructor. This way, when the CFG is finished, we don't need to
				932	individually deallocate every instruction. We do similar for Variables, which
				933	is probably the second most popular complex class.
				934
				935	There are some situations where passes need to use some `STL container class
				936	<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the
				937	``CfgLocalAllocator`` as the container allocator if this is needed.
				938
				939	Multithreaded translation
				940	-------------------------
				941
				942	Subzero is designed to be able to translate functions in parallel. With the
				943	``-threads=N`` command-line option, there is a 3-stage producer-consumer
				944	pipeline:
				945
				946	- A single thread parses the ``.pexe`` file and produces a sequence of work
				947	units. A work unit can be either a fully constructed CFG, or a set of global
				948	initializers. The work unit includes its sequence number denoting its parse
				949	order. Each work unit is added to the translation queue.
				950
				951	- There are N translation threads that draw work units from the translation
				952	queue and lower them into assembler buffers. Each assembler buffer is added
				953	to the emitter queue, tagged with its sequence number. The CFG and its
				954	``CfgLocalAllocator`` are disposed of at this point.
				955
				956	- A single thread draws assembler buffers from the emitter queue and appends to
				957	the output file. It uses the sequence numbers to reintegrate the assembler
				958	buffers according to the original parse order, such that output order is
				959	always deterministic.
				960
				961	This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
				962	for a total of ``N+2`` execution threads, taking the parser and emitter threads
				963	into account. For the special case of ``N=0``, execution is entirely sequential
				964	-- the same thread parses, translates, and emits, one function at a time. This
				965	is useful for performance measurements.
				966
				967	Ideally, we would like to get near-linear scalability as the number of
				968	translation threads increases. We expect that ``-threads=1`` should be slightly
				969	faster than ``-threads=0`` as the small amount of time spent parsing and
				970	emitting is done largely in parallel with translation. With perfect
				971	scalability, we see ``-threads=N`` translating ``N`` times as fast as
				972	``-threads=1``, up until the point where parsing or emitting becomes the
				973	bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory
				974	performance would become a bottleneck and efficiency might peak at, say, 75%.
				975
				976	Currently, parsing takes about 11% of total sequential time. If translation
				977	scalability ever gets so fast and awesomely scalable that parsing becomes a
				978	bottleneck, it should be possible to make parsing multithreaded as well.
				979
				980	Internally, all shared, mutable data is held in the GlobalContext object, and
				981	access to each field is guarded by a mutex.
				982
				983	Security
				984	--------
				985
				986	Subzero includes a number of security features in the generated code, as well as
				987	in the Subzero translator itself, which run on top of the existing Native Client
				988	sandbox as well as Chrome's OS-level sandbox.
				989
				990	Sandboxed translator
				991	^^^^^^^^^^^^^^^^^^^^
				992
				993	When running inside the browser, the Subzero translator executes as sandboxed,
				994	untrusted code that is initially checked by the validator, just like the
				995	LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no
				996	more or less secure than the translator it replaces, from the point of view of
				997	the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and
				998	was designed from the start with security in mind, so one expects fewer attacker
				999	opportunities here.
				1000
				1001	Code diversification
				1002	^^^^^^^^^^^^^^^^^^^^
				1003
				1004	`Return-oriented programming
				1005	<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
				1006	now-common technique for starting with e.g. a known buffer overflow situation
				1007	and launching it into a deeper exploit. The attacker scans the executable
				1008	looking for ROP gadgets, which are short sequences of code that happen to load
				1009	known values into known registers and then return. An attacker who manages to
				1010	overwrite parts of the stack can overwrite it with carefully chosen return
				1011	addresses such that certain ROP gadgets are effectively chained together to set
				1012	up the register state as desired, finally returning to some code that manages to
				1013	do something nasty based on those register values.
				1014
				1015	If there is a popular ``.pexe`` with a large install base, the attacker could
				1016	run Subzero on it and scan the executable for suitable ROP gadgets to use as
				1017	part of a potential exploit. Note that if the trusted validator is working
				1018	correctly, these ROP gadgets are limited to starting at a bundle boundary and
				1019	cannot use the trick of finding a gadget that happens to begin inside another
				1020	instruction. All the same, gadgets with these constraints still exist and the
				1021	attacker has access to them. This is the attack model we focus most on --
				1022	protecting the user against misuse of a "trusted" developer's application, as
				1023	opposed to mischief from a malicious ``.pexe`` file.
				1024
				1025	Subzero can mitigate these attacks to some degree through code diversification.
				1026	Specifically, we can apply some randomness to the code generation that makes ROP
				1027	gadgets less predictable. This randomness can have some compile-time cost, and
				1028	it can affect the code quality; and some diversifications may be more effective
				1029	than others. A more detailed treatment of hardening techniques may be found in
				1030	the Matasano report "`Attacking Clientside JIT Compilers
				1031	<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
				1032
				1033	To evaluate diversification effectiveness, we use a third-party ROP gadget
				1034	finder and limit its results to bundle-aligned addresses. For a given
				1035	diversification technique, we run it with a number of different random seeds,
				1036	find ROP gadgets for each version, and determine how persistent each ROP gadget
				1037	is across the different versions. A gadget is persistent if the same gadget is
				1038	found at the same code address. The best diversifications are ones with low
				1039	gadget persistence rates.
				1040
				1041	Subzero implements 7 different diversification techniques. Below is a
				1042	discussion of each technique, its effectiveness, and its cost. The discussions
				1043	of cost and effectiveness are for a single diversification technique; the
				1044	translation-time costs for multiple techniques are additive, but the effects of
				1045	multiple techniques on code quality and effectiveness are not yet known.
				1046
				1047	In Subzero's implementation, each randomization is "repeatable" in a sense.
				1048	Each pass that includes a randomization option gets its own private instance of
				1049	a random number generator (RNG). The RNG is seeded with a combination of a
				1050	global seed, the pass ID, and the function's sequence number. The global seed
				1051	is designed to be different across runs (perhaps based on the current time), but
				1052	for debugging, the global seed can be set to a specific value and the results
				1053	will be repeatable.
				1054
				1055	Subzero-generated code is subject to diversification once per translation, and
				1056	then Chrome caches the diversified binary for subsequent executions. An
				1057	attacker may attempt to run the binary multiple times hoping for
				1058	higher-probability combinations of ROP gadgets. When the attacker guesses
				1059	wrong, a likely outcome is an application crash. Chrome throttles creation of
				1060	crashy processes which reduces the likelihood of the attacker eventually gaining
				1061	a foothold.
				1062
				1063	Constant blinding
				1064	~~~~~~~~~~~~~~~~~
				1065
				1066	Here, we prevent attackers from controlling large immediates in the text
				1067	(executable) section. A random cookie is generated for each function, and if
				1068	the constant exceeds a specified threshold, the constant is obfuscated with the
				1069	cookie and equivalent code is generated. For example, instead of this x86
				1070	instruction::
				1071
				1072	mov $0x11223344, <%Reg/Mem>
				1073
				1074	the following code might be generated::
				1075
				1076	mov $(0x11223344+Cookie), %temp
				1077	lea -Cookie(%temp), %temp
				1078	mov %temp, <%Reg/Mem>
				1079
				1080	The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
				1081	prevent unintended effects on the flags register.
				1082
				1083	This transformation has almost no effect on translation time, and about 1%
				1084	impact on code quality, depending on the threshold chosen. It does little to
				1085	reduce gadget persistence, but it does remove a lot of potential opportunities
				1086	to construct intra-instruction ROP gadgets (which an attacker could use only if
				1087	a validator bug were discovered, since the Native Client sandbox and associated
				1088	validator force returns and other indirect branches to be to bundle-aligned
				1089	addresses).
				1090
				1091	Constant pooling
				1092	~~~~~~~~~~~~~~~~
				1093
				1094	This is similar to constant blinding, in that large immediates are removed from
				1095	the text section. In this case, each unique constant above the threshold is
				1096	stored in a read-only data section and the constant is accessed via a memory
				1097	load. For the above example, the following code might be generated::
				1098
				1099	mov $Label$1, %temp
				1100	mov %temp, <%Reg/Mem>
				1101
				1102	This has a similarly small impact on translation time and ROP gadget
				1103	persistence, and a smaller (better) impact on code quality. This is because it
				1104	uses fewer instructions, and in some cases good register allocation leads to no
				1105	increase in instruction count. Note that this still gives an attacker some
				1106	limited amount of control over some text section values, unless we randomize the
				1107	constant pool layout.
				1108
				1109	Static data reordering
				1110	~~~~~~~~~~~~~~~~~~~~~~
				1111
				1112	This transformation limits the attacker's ability to control bits in global data
				1113	address references. It simply permutes the order in memory of global variables
				1114	and internal constant pool entries. For the constant pool, we only permute
				1115	within a type (i.e., emit a randomized list of ints, followed by a randomized
				1116	list of floats, etc.) to maintain good packing in the face of alignment
				1117	constraints.
				1118
				1119	As might be expected, this has no impact on code quality, translation time, or
				1120	ROP gadget persistence (though as above, it limits opportunities for
				1121	intra-instruction ROP gadgets with a broken validator).
				1122
				1123	Basic block reordering
				1124	~~~~~~~~~~~~~~~~~~~~~~
				1125
				1126	Here, we randomize the order of basic blocks within a function, with the
				1127	constraint that we still want to maintain a topological order as much as
				1128	possible, to avoid making the code too branchy.
				1129
				1130	This has no impact on code quality, and about 1% impact on translation time, due
				1131	to a separate pass to recompute layout. It ends up having a huge effect on ROP
				1132	gadget persistence, tied for best with nop insertion, reducing ROP gadget
				1133	persistence to less than 5%.
				1134
				1135	Function reordering
				1136	~~~~~~~~~~~~~~~~~~~
				1137
				1138	Here, we permute the order that functions are emitted, primarily to shift ROP
				1139	gadgets around to less predictable locations. It may also change call address
				1140	offsets in case the attacker was trying to control that offset in the code.
				1141
				1142	To control latency and memory footprint, we don't arbitrarily permute functions.
				1143	Instead, for some relatively small value of N, we queue up N assembler buffers,
				1144	and then emit the N functions in random order, and repeat until all functions
				1145	are emitted.
				1146
				1147	Function reordering has no impact on translation time or code quality.
				1148	Measurements indicate that it reduces ROP gadget persistence to about 15%.
				1149
				1150	Nop insertion
				1151	~~~~~~~~~~~~~
				1152
				1153	This diversification randomly adds a nop instruction after each regular
				1154	instruction, with some probability. Nop instructions of different lengths may
				1155	be selected. Nop instructions are never added inside a bundle_lock region.
				1156	Note that when sandboxing is enabled, nop instructions are already being added
				1157	for bundle alignment, so the diversification nop instructions may simply be
				1158	taking the place of alignment nop instructions, though distributed differently
				1159	through the bundle.
				1160
				1161	In Subzero's currently implementation, nop insertion adds 3-5% to the
				1162	translation time, but this is probably because it is implemented as a separate
				1163	pass that adds actual nop instructions to the IR. The overhead would probably
				1164	be a lot less if it were integrated into the assembler pass. The code quality
				1165	is also reduced by 3-5%, making nop insertion the most expensive of the
				1166	diversification techniques.
				1167
				1168	Nop insertion is very effective in reducing ROP gadget persistence, at the same
				1169	level as basic block randomization (less than 5%). But given nop insertion's
				1170	impact on translation time and code quality, one would most likely prefer to use
				1171	basic block randomization instead (though the combined effects of the different
				1172	diversification techniques have not yet been studied).
				1173
				1174	Register allocation randomization
				1175	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1176
				1177	In this diversification, the register allocator tries to make different but
				1178	mostly functionally equivalent choices, while maintaining stable code quality.
				1179
				1180	A naive approach would be the following. Whenever the allocator has more than
				1181	one choice for assigning a register, choose randomly among those options. And
				1182	whenever there are no registers available and there is a tie for the
				1183	lowest-weight variable, randomly select one of the lowest-weight variables to
				1184	evict. Because of the one-pass nature of the linear-scan algorithm, this
				1185	randomization strategy can have a large impact on which variables are ultimately
				1186	assigned registers, with a corresponding large impact on code quality.
				1187
				1188	Instead, we choose an approach that tries to keep code quality stable regardless
				1189	of the random seed. We partition the set of physical registers into equivalence
				1190	classes. If a register is pre-colored in the function (i.e., referenced
				1191	explicitly by name), it forms its own equivalence class. The remaining
				1192	registers are partitioned according to their combination of attributes such as
				1193	integer versus floating-point, 8-bit versus 32-bit, caller-save versus
				1194	callee-saved, etc. Each equivalence class is randomly permuted, and the
				1195	complete permutation is applied to the final register assignments.
				1196
				1197	Register randomization reduces ROP gadget persistence to about 10% on average,
				1198	though there tends to be fairly high variance across functions and applications.
				1199	This probably has to do with the set of restrictions in the x86-32 instruction
				1200	set and ABI, such as few general-purpose registers, ``%eax`` used for return
				1201	values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As
				1202	intended, register randomization has no impact on code quality, and a slight
				1203	(0.5%) impact on translation time due to an extra scan over the variables to
				1204	identify pre-colored registers.
				1205
				1206	Fuzzing
				1207	^^^^^^^
				1208
				1209	We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
				1210	combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
				1211	<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to
				1212	find and fix cases where Subzero crashes or otherwise ungracefully fails on
				1213	unexpected inputs, and to do so automatically over a large range of unexpected
				1214	inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility
				1215	of an attacker exploiting these bugs.
				1216
				1217	Most of the problems found so far are ones most appropriately handled in the
				1218	parser. However, there have been a couple that have identified problems in the
				1219	lowering, or otherwise inappropriately triggered assertion failures and fatal
				1220	errors. We continue to dig into this area.
				1221
				1222	Future security work
				1223	^^^^^^^^^^^^^^^^^^^^
				1224
				1225	Subzero is well-positioned to explore other future security enhancements, e.g.:
				1226
				1227	- Tightening the Native Client sandbox. ABI changes, such as the previous work
				1228	on `hiding the sandbox base address
				1229	<https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
				1230	in x86-64, are easy to experiment with in Subzero.
				1231
				1232	- Making the executable code section read-only. This would prevent a PNaCl
				1233	application from inspecting its own binary and trying to find ROP gadgets even
				1234	after code diversification has been performed. It may still be susceptible to
				1235	`blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
				1236	security is still overall improved.
				1237
				1238	- Instruction selection diversification. It may be possible to lower a given
				1239	instruction in several largely equivalent ways, which gives more opportunities
				1240	for code randomization.
				1241
				1242	Chrome integration
				1243	------------------
				1244
				1245	Currently Subzero is available in Chrome for the x86-32 architecture, but under
				1246	a flag. When the flag is enabled, Subzero is used when the `manifest file
				1247	<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
				1248	linking to the ``.pexe`` file specifies the ``O0`` optimization level.
				1249
				1250	The next step is to remove the flag, i.e. invoke Subzero as the only translator
				1251	for ``O0``-specified manifest files.
				1252
				1253	Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
				1254	case Subzero could be used for all PNaCl translation.
				1255
				1256	Command line options
				1257	--------------------
				1258
				1259	Subzero has a number of command-line options for debugging and diagnostics.
				1260	Among the more interesting are the following.
				1261
				1262	- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
				1263	diagnostic output, with various levels of detail after each pass. Instruction
				1264	numbers can be printed or suppressed. Deleted instructions can be printed or
				1265	suppressed (they are retained in the instruction list, as discussed earlier,
				1266	because they can help explain how lower-level instructions originated).
				1267	Liveness information can be printed when available. Details of register
				1268	allocation can be printed as register allocator decisions are made. And more.
				1269
				1270	- Running Subzero with any level of verbosity produces an enormous amount of
				1271	output. When debugging a single function, verbose output can be suppressed
				1272	except for a particular function. The ``-verbose-focus`` flag suppresses
				1273	verbose output except for the specified function.
				1274
				1275	- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
				1276	at exit. Timing markers can be placed in the Subzero source code to demarcate
				1277	logical operations or passes of interest. Basic timing information plus
				1278	call-stack type timing information is printed at the end.
				1279
				1280	- Along with ``-timing``, the user can instead get a report on the overall
				1281	translation time for each function, to help focus on timing outliers. Also,
				1282	``-timing-focus`` limits the ``-timing`` reporting to a single function,
				1283	instead of aggregating pass timing across all functions.
				1284
				1285	- The ``-szstats`` option reports various statistics on each function, such as
				1286	stack frame size, static instruction count, etc. It may be helpful to track
				1287	these stats over time as Subzero is improved, as an approximate measure of
				1288	code quality.
				1289
				1290	- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
				1291	output, annotate the assembly output with register-focused liveness
				1292	information. In particular, each basic block is annotated with which
				1293	registers are live-in and live-out, and each instruction is annotated with
				1294	which registers' and stack locations' live ranges end at that instruction.
				1295	This is really useful when studying the generated code to find opportunities
				1296	for code quality improvements.
				1297
				1298	Testing and debugging
				1299	---------------------
				1300
				1301	LLVM lit tests
				1302	^^^^^^^^^^^^^^
				1303
				1304	For basic testing, Subzero uses LLVM's `lit
				1305	<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We
				1306	have a suite of hundreds of small functions where we test for particular
				1307	assembly code patterns across different target architectures.
				1308
				1309	Cross tests
				1310	^^^^^^^^^^^
				1311
				1312	Unfortunately, the lit tests don't do a great job of precisely testing the
				1313	correctness of the output. Much better are the cross tests, which are execution
				1314	tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
				1315	variety of interesting inputs. Each cross test consists of a set of C, C++,
				1316	and/or low-level bitcode files. The C and C++ source files are compiled down to
				1317	bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
				1318	Subzero mangles global symbol names with a special prefix to avoid duplicate
				1319	symbol errors. A driver program invokes both versions on a large set of
				1320	interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
				1321	differ. Cross tests turn out to be an excellent way of testing the basic
				1322	lowering patterns, but they are less useful for testing more global things like
				1323	liveness analysis and register allocation.
				1324
				1325	Bisection debugging
				1326	^^^^^^^^^^^^^^^^^^^
				1327
				1328	Sometimes with a new application, Subzero will end up producing incorrect code
				1329	that either crashes at runtime or otherwise produces the wrong results. When
				1330	this happens, we need to narrow it down to a single function (or small set of
				1331	functions) that yield incorrect behavior. For this, we have a bisection
				1332	debugging framework. Here, we initially translate the entire application once
				1333	with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to
				1334	selectively weaken symbols based on a whitelist or blacklist provided on the
				1335	command line. The two object files can then be linked together without link
				1336	errors, with the desired version of each method "winning". Then the binary is
				1337	tested, and bisection proceeds based on whether the binary produces correct
				1338	output.
				1339
				1340	When the bisection completes, we are left with a minimal set of
				1341	Subzero-translated functions that cause the failure. Usually it is a single
				1342	function, though sometimes it might require a combination of several functions
				1343	to cause a failure; this may be due to an incorrect call ABI, for example.
				1344	However, Murphy's Law implies that the single failing function is enormous and
				1345	impractical to debug. In that case, we can restart the bisection, explicitly
				1346	blacklisting the enormous function, and try to find another candidate to debug.
				1347	(Future work is to automate this to find all minimal sets of functions, so that
				1348	debugging can focus on the simplest example.)
				1349
				1350	Fuzz testing
				1351	^^^^^^^^^^^^
				1352
				1353	As described above, we try to find internal Subzero bugs using fuzz testing
				1354	techniques.
				1355
				1356	Sanitizers
				1357	^^^^^^^^^^
				1358
				1359	Subzero can be built with `AddressSanitizer
				1360	<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
				1361	<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is
				1362	done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,
				1363	multithreading has been simple enough that TSan hasn't found any bugs, but ASan
				1364	has found at least one memory leak which was subsequently fixed.
				1365	`UndefinedBehaviorSanitizer
				1366	<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
				1367	(UBSan) support is in progress. `Control flow integrity sanitization
				1368	<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
				1369	consideration.
				1370
				1371	Current status
				1372	==============
				1373
				1374	Target architectures
				1375	--------------------
				1376
				1377	Subzero is currently more or less complete for the x86-32 target. It has been
				1378	refactored and extended to handle x86-64 as well, and that is mostly complete at
				1379	this point.
				1380
				1381	ARM32 work is in progress. It currently lacks the testing level of x86, at
				1382	least in part because Subzero's register allocator needs modifications to handle
				1383	ARM's aliasing of floating point and vector registers. Specifically, a 64-bit
				1384	register is actually a gang of two consecutive and aligned 32-bit registers, and
				1385	a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
				1386	ARM64 work has not started; when it does, it will be native-only since the
				1387	Native Client sandbox model, validator, and other tools have never been defined.
				1388
				1389	An external contributor is adding MIPS support, in most part by following the
				1390	ARM work.
				1391
				1392	Translator performance
				1393	----------------------
				1394
				1395	Single-threaded translation speed is currently about 5× the ``pnacl-llc``
				1396	translation speed. For a large ``.pexe`` file, the time breaks down as:
				1397
				1398	- 11% for parsing and initial IR building
				1399
				1400	- 4% for emitting to /dev/null
				1401
				1402	- 27% for liveness analysis (two liveness passes plus live range construction)
				1403
				1404	- 15% for linear-scan register allocation
				1405
				1406	- 9% for basic lowering
				1407
				1408	- 10% for advanced phi lowering
				1409
				1410	- ~11% for other minor analysis
				1411
				1412	- ~10% measurement overhead to acquire these numbers
				1413
				1414	Some improvements could undoubtedly be made, but it will be hard to increase the
				1415	speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With
				1416	``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
				1417	``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
				1418
				1419	Code quality
				1420	------------
				1421
				1422	Measured across 16 components of spec2k, Subzero's code quality is uniformly
				1423	better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
				1424	between ``pnacl-llc`` ``-O0`` and ``-O2``.
				1425
				1426	Translator size
				1427	---------------
				1428
				1429	When built in MINIMAL mode, the x86-64 native translator size for the x86-32
				1430	target is about 700 KB, not including the size of functions referenced in
				1431	dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1
				1432	MB, and it is statically linked and also includes nop padding for bundling as
				1433	well as indirect branch masking.
				1434
				1435	Translator memory footprint
				1436	---------------------------
				1437
				1438	It's hard to draw firm conclusions about memory footprint, since the footprint
				1439	is at least proportional to the input function size, and there is no real limit
				1440	on the size of functions in the ``.pexe`` file.
				1441
				1442	That said, we looked at the memory footprint over time as Subzero translated
				1443	``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
				1444	disposal. One of LLVM's libraries that Subzero uses can report the current
				1445	malloc heap usage. With single-threaded translation, Subzero tends to hover
				1446	around 15 MB of memory usage. There are a couple of monstrous functions where
				1447	Subzero grows to around 100 MB, but then it drops back down after those
				1448	functions finish translating. In contrast, ``pnacl-llc`` grows larger and
				1449	larger throughout translation, reaching several hundred MB by the time it
				1450	completes.
				1451
				1452	It's a bit more interesting when we enable multithreaded translation. When
				1453	there are N translation threads, Subzero implements a policy that limits the
				1454	size of the translation queue to N entries -- if it is "full" when the parser
				1455	tries to add a new CFG, the parser blocks until one of the translation threads
				1456	removes a CFG. This means the number of in-memory CFGs can (and generally does)
				1457	reach 2*N+1, and so the memory footprint rises in proportion to the number of
				1458	threads. Adding to the pressure is the observation that the monstrous functions
				1459	also take proportionally longer time to translate, so there's a good chance many
				1460	of the monstrous functions will be active at the same time with multithreaded
				1461	translation. As a result, for N=32, Subzero's memory footprint peaks at about
				1462	260 MB, but drops back down as the large functions finish translating.
				1463
				1464	If this peak memory size becomes a problem, it might be possible for the parser
				1465	to resequence the functions to try to spread out the larger functions, or to
				1466	throttle the translation queue to prevent too many in-flight large functions.
				1467	It may also be possible to throttle based on memory pressure signaling from
				1468	Chrome.
				1469
				1470	Translator scalability
				1471	----------------------
				1472
				1473	Currently scalability is "not very good". Multiple translation threads lead to
				1474	faster translation, but not to the degree desired. We haven't dug in to
				1475	investigate yet.
				1476
				1477	There are a few areas to investigate. First, there may be contention on the
				1478	constant pool, which all threads access, and which requires locked access even
				1479	for reading. This could be mitigated by keeping a CFG-local cache of the most
				1480	common constants.
				1481
				1482	Second, there may be contention on memory allocation. While almost all CFG
				1483	objects are allocated from the CFG-local allocator, some passes use temporary
				1484	STL containers that use the default allocator, which may require global locking.
				1485	This could be mitigated by switching these to the CFG-local allocator.
				1486
				1487	Third, multithreading may make the default allocator strategy more expensive.
				1488	In a single-threaded environment, a pass will allocate its containers, run the
				1489	pass, and deallocate the containers. This results in stack-like allocation
				1490	behavior and makes the heap free list easier to manage, with less heap
				1491	fragmentation. But when multithreading is added, the allocations and
				1492	deallocations become much less stack-like, making allocation and deallocation
				1493	operations individually more expensive. Again, this could be mitigated by
				1494	switching these to the CFG-local allocator.