Blame - third_party/subzero/docs/DESIGN.rst - platform/external/swiftshader

blob: 363c19c6178a61dddf12fd95ca5bc2c996206339 [file] [log] [blame]

Jim Stichnoth	efb8971	2015-09-03 13:19:54 -0700	[diff] [blame]	1	Design of the Subzero fast code generator
				2	=========================================
				3
				4	Introduction
				5	------------
				6
				7	The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
				8	compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
				9	PNaCl toolchain to compile their application to architecture-neutral PNaCl
				10	bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
				11	possible. The ``.pexe`` file is downloaded to the user's browser where the
				12	PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
				13	`sandboxed
				14	<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
				15	native code. The translator uses architecture-specific optimizations as much as
				16	practical to generate good native code.
				17
				18	The native code can be cached by the browser to avoid repeating translation on
				19	future page loads. However, first-time user experience is hampered by long
				20	translation times. The LLVM-based PNaCl translator is pretty slow, even when
				21	using ``-O0`` to minimize optimizations, so delays are especially noticeable on
				22	slow browser platforms such as ARM-based Chromebooks.
				23
				24	Translator slowness can be mitigated or hidden in a number of ways.
				25
				26	- Parallel translation. However, slow machines where this matters most, e.g.
				27	ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
				28	and are likely to less memory available for multiple translation threads to
				29	use.
				30
				31	- Streaming translation, i.e. start translating as soon as the download starts.
				32	This doesn't help much when translation speed is 10× slower than download
				33	speed, or the ``.pexe`` file is already cached while the translated binary was
				34	flushed from the cache.
				35
				36	- Arrange the web page such that translation is done in parallel with
				37	downloading large assets.
				38
				39	- Arrange the web page to distract the user with `cat videos
				40	<https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
				41	progress.
				42
				43	Or, improve translator performance to something more reasonable.
				44
				45	This document describes Subzero's attempt to improve translation speed by an
				46	order of magnitude while rivaling LLVM's code quality. Subzero does this
				47	through minimal IR layering, lean data structures and passes, and a careful
				48	selection of fast optimization passes. It has two optimization recipes: full
				49	optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
				50	following (described in more detail below):
				51
Manasij Mukherjee	a41e9a1	2016-08-05 12:29:56 -0700	[diff] [blame]	52	+----------------------------------------+-----------------------------+
				53	\| O2 recipe \| Om1 recipe \|
				54	+========================================+=============================+
				55	\| Parse .pexe file \| Parse .pexe file \|
				56	+----------------------------------------+-----------------------------+
				57	\| Loop nest analysis \| \|
				58	+----------------------------------------+-----------------------------+
				59	\| Local common subexpression elimination \| \|
				60	+----------------------------------------+-----------------------------+
				61	\| Address mode inference \| \|
				62	+----------------------------------------+-----------------------------+
				63	\| Read-modify-write (RMW) transform \| \|
				64	+----------------------------------------+-----------------------------+
				65	\| Basic liveness analysis \| \|
				66	+----------------------------------------+-----------------------------+
				67	\| Load optimization \| \|
				68	+----------------------------------------+-----------------------------+
				69	\| \| Phi lowering (simple) \|
				70	+----------------------------------------+-----------------------------+
				71	\| Target lowering \| Target lowering \|
				72	+----------------------------------------+-----------------------------+
				73	\| Full liveness analysis \| \|
				74	+----------------------------------------+-----------------------------+
				75	\| Register allocation \| Minimal register allocation \|
				76	+----------------------------------------+-----------------------------+
				77	\| Phi lowering (advanced) \| \|
				78	+----------------------------------------+-----------------------------+
				79	\| Post-phi register allocation \| \|
				80	+----------------------------------------+-----------------------------+
				81	\| Branch optimization \| \|
				82	+----------------------------------------+-----------------------------+
				83	\| Code emission \| Code emission \|
				84	+----------------------------------------+-----------------------------+
Jim Stichnoth	efb8971	2015-09-03 13:19:54 -0700	[diff] [blame]	85
				86	Goals
				87	=====
				88
				89	Translation speed
				90	-----------------
				91
				92	We'd like to be able to translate a ``.pexe`` file as fast as download speed.
				93	Any faster is in a sense wasted effort. Download speed varies greatly, but
				94	we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
				95	slow machine. We observe a 3× single-thread performance difference between A15
				96	and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
				97	``.pexe`` file could be compressed to 50% on the web server using gzip transport
				98	compression, so we set the translation speed goal to 6 MB/sec on the high-end
				99	Xeon workstation.
				100
				101	Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
				102	⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
				103
				104	In other words, Subzero's goal is to improve over LLVM's translation speed by
				105	10×.
				106
				107	Code quality
				108	------------
				109
				110	Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
				111	code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On
				112	average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
				113
				114	It's important to note that the quality of Subzero-generated code depends on
				115	target-neutral optimizations and simplifications being run beforehand in the
				116	developer environment. The ``.pexe`` file reflects these optimizations. For
				117	example, Subzero assumes that the basic blocks are ordered topologically where
				118	possible (which makes liveness analysis converge fastest), and Subzero does not
				119	do any function inlining because it should already have been done.
				120
				121	Translator size
				122	---------------
				123
				124	The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
				125	We think 1 MB is a more reasonable size -- especially for such a component that
				126	is distributed to a billion Chrome users. Thus we target a 10× reduction in
				127	binary size.
				128
				129	For development, Subzero can be built for all target architectures, and all
				130	debugging and diagnostic options enabled. For a smaller translator, we restrict
				131	to a single target architecture, and define a ``MINIMAL`` build where
				132	unnecessary features are compiled out.
				133
				134	Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
				135	include directories, which have little impact on translator size. It also uses
				136	some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
				137	with little size impact. In non-``MINIMAL`` builds, the translator size is much
				138	larger due to including code for parsing text-format bitcode files and forming
				139	LLVM IR.
				140
				141	Memory footprint
				142	----------------
				143
				144	The current LLVM-based translator suffers from an issue in which some
				145	function-specific data has to be retained in memory until all translation
				146	completes, and therefore the memory footprint grows without bound. Large
				147	``.pexe`` files can lead to the translator process holding hundreds of MB of
				148	memory by the end. The translator runs in a separate process, so this memory
				149	growth doesn't directly affect other processes, but it does dirty the physical
				150	memory and contributes to a perception of bloat and sometimes a reality of
				151	out-of-memory tab killing, especially noticeable on weaker systems.
				152
				153	Subzero should maintain a stable memory footprint throughout translation. It's
				154	not really practical to set a specific limit, because there is not really a
				155	practical limit on a single function's size, but the footprint should be
				156	"reasonable" and be proportional to the largest input function size, not the
				157	total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or
				158	inexorable memory growth. (We use ASAN builds to test for leaks.)
				159
				160	Multithreaded translation
				161	-------------------------
				162
				163	It should be practical to translate different functions concurrently and see
				164	good scalability. Some locking may be needed, such as accessing output buffers
				165	or constant pools, but that should be fairly minimal. In contrast, LLVM was
				166	only designed for module-level parallelism, and as such, the PNaCl translator
				167	internally splits a ``.pexe`` file into several modules for concurrent
				168	translation. All output needs to be deterministic regardless of the level of
				169	multithreading, i.e. functions and data should always be output in the same
				170	order.
				171
				172	Target architectures
				173	--------------------
				174
				175	Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future
				176	targets include ARM64 and MIPS64, though these targets lack NaCl support
				177	including a sandbox model or a validator.
				178
				179	The first implementation is for x86-32, because it was expected to be
				180	particularly challenging, and thus more likely to draw out any design problems
				181	early:
				182
				183	- There are a number of special cases, asymmetries, and warts in the x86
				184	instruction set.
				185
				186	- Complex addressing modes may be leveraged for better code quality.
				187
				188	- 64-bit integer operations have to be lowered into longer sequences of 32-bit
				189	operations.
				190
				191	- Paucity of physical registers may reveal code quality issues early in the
				192	design.
				193
				194	Detailed design
				195	===============
				196
				197	Intermediate representation - ICE
				198	---------------------------------
				199
				200	Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's
				201	IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a
				202	representation of global variables and initializers, and a set of functions.
				203	Each function contains a list of basic blocks, and each basic block constains a
				204	list of instructions. Instructions that operate on stack and register variables
				205	do so using static single assignment (SSA) form.
				206
				207	The ``.pexe`` file is translated one function at a time (or in parallel by
				208	multiple translation threads). The recipe for optimization passes depends on
				209	the specific target and optimization level, and is described in detail below.
				210	Global variables (types and initializers) are simply and directly translated to
				211	object code, without any meaningful attempts at optimization.
				212
				213	A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
				214	Its key contents include:
				215
				216	- A list of ``CfgNode`` pointers, generally held in topological order.
				217
				218	- A list of ``Variable`` pointers corresponding to local variables used in the
				219	function plus compiler-generated temporaries.
				220
				221	A basic block is represented by the ``Ice::CfgNode`` class. Its key contents
				222	include:
				223
				224	- A linear list of instructions, in the same style as LLVM. The last
				225	instruction of the list is always a terminator instruction: branch, switch,
				226	return, unreachable.
				227
				228	- A list of Phi instructions, also in the same style as LLVM. They are held as
				229	a linear list for convenience, though per Phi semantics, they are executed "in
				230	parallel" without dependencies on each other.
				231
				232	- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
				233	another list for outgoing edges.
				234
				235	- The node's unique, 0-based index into the CFG's node list.
				236
				237	An instruction is represented by the ``Ice::Inst`` class. Its key contents
				238	include:
				239
				240	- A list of source operands.
				241
				242	- Its destination variable, if the instruction produces a result in an
				243	``Ice::Variable``.
				244
				245	- A bitvector indicating which variables' live ranges this instruction ends.
				246	This is computed during liveness analysis.
				247
				248	Instructions kinds are divided into high-level ICE instructions and low-level
				249	ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode
				250	instruction kinds. Each target architecture implementation extends the
				251	instruction space with its own set of low-level instructions. Generally,
				252	low-level instructions correspond to individual machine instructions. The
				253	high-level ICE instruction space includes a few additional instruction kinds
				254	that are not part of LLVM but are generally useful (e.g., an Assignment
				255	instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
				256	instructions for sandboxing).
				257
				258	Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
				259	ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
				260	<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
				261	the following:
				262
				263	- Alloca: allocate data on the stack
				264
				265	- Arithmetic: binary operations of the form ``A = B op C``
				266
				267	- Br: conditional or unconditional branch
				268
				269	- Call: function call
				270
				271	- Cast: unary type-conversion operations
				272
				273	- ExtractElement: extract a scalar element from a vector-type value
				274
				275	- Fcmp: floating-point comparison
				276
				277	- Icmp: integer comparison
				278
				279	- IntrinsicCall: call a known intrinsic
				280
				281	- InsertElement: insert a scalar element into a vector-type value
				282
				283	- Load: load a value from memory
				284
				285	- Phi: implement the SSA phi node
				286
				287	- Ret: return from the function
				288
				289	- Select: essentially the C language operation of the form ``X = C ? Y : Z``
				290
				291	- Store: store a value into memory
				292
				293	- Switch: generalized branch to multiple possible locations
				294
				295	- Unreachable: indicate that this portion of the code is unreachable
				296
				297	The additional high-level ICE instructions are the following:
				298
				299	- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi
				300	instructions to non-SSA assignments, before lowering to machine code.
				301
				302	- BundleLock, BundleUnlock. These are markers used for sandboxing, but are
				303	common across all targets and so they are elevated to the high-level
				304	instruction set.
				305
				306	- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in
				307	liveness analysis, elevated to the high-level because they are used by all
				308	targets. They are described in more detail at the end of this section.
				309
				310	- JumpTable: this represents the result of switch optimization analysis, where
				311	some switch instructions may use jump tables instead of cascading
				312	compare/branches.
				313
				314	An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an
				315	operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants
				316	include scalar integer constants, scalar floating point constants, Undef (an
				317	unspecified constant of a particular scalar or vector type), and symbol
				318	constants (essentially addresses of globals). Note that the PNaCl ABI does not
				319	include vector-type constants besides Undef, and as such, Subzero (so far) has
				320	no reason to represent vector-type constants internally. A variable represents
				321	a value allocated on the stack (though not including alloca-derived storage).
				322	Among other things, a variable holds its unique, 0-based index into the CFG's
				323	variable list.
				324
				325	Each target can extend the ``Constant`` and ``Variable`` classes for its own
				326	needs. In addition, the ``Operand`` class may be extended, e.g. to define an
				327	x86 ``MemOperand`` that encodes a base register, an index register, an index
				328	register shift amount, and a constant offset.
				329
				330	Register allocation and liveness analysis are restricted to Variable operands.
				331	Because of the importance of register allocation to code quality, and the
				332	translation-time cost of liveness analysis, Variable operands get some special
				333	treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate
				334	across all the Variables of an instruction. An instruction holds a list of
				335	operands, but an operand may contain 0, 1, or more Variables. As such, the
				336	``Operand`` class specially holds a list of Variables contained within, for
				337	quick access.
				338
				339	A Subzero transformation pass may work by deleting an existing instruction and
				340	replacing it with zero or more new instructions. Instead of actually deleting
				341	the existing instruction, we generally mark it as deleted and insert the new
				342	instructions right after the deleted instruction. When printing the IR for
				343	debugging, this is a big help because it makes it much more clear how the
				344	non-deleted instructions came about.
				345
				346	Subzero has a few special instructions to help with liveness analysis
				347	consistency.
				348
				349	- The FakeDef instruction gives a fake definition of some variable. For
				350	example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
				351	but an ICE instruction can represent only one destination variable. This is
				352	similar for multiply instructions, and for function calls that return a 64-bit
				353	integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``
				354	trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
				355
				356	- The FakeUse instruction registers a use of a variable, typically to prevent an
				357	earlier assignment to that variable from being dead-code eliminated. For
				358	example, lowering an operation like ``x=cc?y:z`` may be done using x86's
				359	conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a
				360	FakeUse of ``x`` between the two instructions, the liveness analysis pass may
				361	dead-code eliminate the first instruction.
				362
				363	- The FakeKill instruction is added after a call instruction, and is a quick way
				364	of indicating that caller-save registers are invalidated.
				365
				366	Pexe parsing
				367	------------
				368
				369	Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It
				370	parses the ``.pexe`` file function by function, ultimately constructing an ICE
				371	CFG for each function. After a function is parsed, its CFG is handed off to the
				372	translation phase. The bitcode parser also parses global initializer data and
				373	hands it off to be translated to data sections in the object file.
				374
				375	Subzero has another parsing strategy for testing/debugging. LLVM libraries can
				376	be used to parse a module into LLVM IR (though very slowly relative to Subzero
				377	native parsing). Then we iterate across the LLVM IR and construct high-level
				378	ICE, handing off each CFG to the translation phase.
				379
				380	Overview of lowering
				381	--------------------
				382
				383	In general, translation goes like this:
				384
				385	- Parse the next function from the ``.pexe`` file and construct a CFG consisting
				386	of high-level ICE.
				387
				388	- Do analysis passes and transformation passes on the high-level ICE, as
				389	desired.
				390
				391	- Lower each high-level ICE instruction into a sequence of zero or more
				392	low-level ICE instructions. Each high-level instruction is generally lowered
				393	independently, though the target lowering is allowed to look ahead in the
				394	CfgNode's instruction list if desired.
				395
				396	- Do more analysis and transformation passes on the low-level ICE, as desired.
				397
				398	- Assemble the low-level CFG into an ELF object file (alternatively, a textual
				399	assembly file that is later assembled by some external tool).
				400
				401	- Repeat for all functions, and also produce object code for data such as global
				402	initializers and internal constant pools.
				403
				404	Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,
				405	the intention is to apply all available optimizations to get the best code
				406	quality (though the initial code quality goal is measured against LLVM's ``O0``
				407	code quality). For ``Om1``, the intention is to apply as few optimizations as
				408	possible and produce code as quickly as possible, accepting poor code quality.
				409	``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
				410	"sub-zero".
				411
				412	High-level debuggability of generated code is so far not a design requirement.
				413	Subzero doesn't really do transformations that would obfuscate debugging; the
				414	main thing might be that register allocation (including stack slot coalescing
				415	for stack-allocated variables whose live ranges don't overlap) may render a
				416	variable's value unobtainable after its live range ends. This would not be an
				417	issue for ``Om1`` since it doesn't register-allocate program-level variables,
				418	nor does it coalesce stack slots. That said, fully supporting debuggability
				419	would require a few additions:
				420
				421	- DWARF support would need to be added to Subzero's ELF file emitter. Subzero
				422	propagates global symbol names, local variable names, and function-internal
				423	label names that are present in the ``.pexe`` file. This would allow a
				424	debugger to map addresses back to symbols in the ``.pexe`` file.
				425
				426	- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
				427	file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
				428	metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
				429	`instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
				430
				431	- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
				432	the toolchain would need to be modified to preserve it.
				433
				434	Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
				435	produces code with one third the code quality. ``Om1`` is good for testing and
				436	debugging -- during translation, it tends to expose errors in the basic lowering
				437	that might otherwise have been hidden by the register allocator or other
				438	optimization passes. It also helps determine whether a code correctness problem
				439	is a fundamental problem in the basic lowering, or an error in another
				440	optimization pass.
				441
				442	The implementation of target lowering also controls the recipe of passes used
				443	for ``Om1`` and ``O2`` translation. For example, address mode inference may
				444	only be relevant for x86.
				445
				446	Lowering strategy
				447	-----------------
				448
				449	The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
				450	each high-level instruction down to a sequence of low-level target-specific
				451	instructions, in a largely context-free setting. That is, each high-level
				452	instruction conceptually has a simple template expansion into low-level
				453	instructions, and lowering can in theory be done in any order. This may sound
				454	like a small effort, but quite a large number of templates may be needed because
				455	of the number of PNaCl types and instruction variants. Furthermore, there may
				456	be optimized templates, e.g. to take advantage of operator commutativity (for
				457	example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is
				458	similar to other template-based approaches in fast code generation or
				459	interpretation, though some decisions are deferred until after some global
				460	analysis passes, mostly related to register allocation, stack slot assignment,
				461	and specific choice of instruction variant and addressing mode.
				462
				463	The key idea for a lowering template is to produce valid low-level instructions
				464	that are guaranteed to meet address mode and other structural requirements of
				465	the instruction set. For example, on x86, the source operand of an integer
				466	store instruction must be an immediate or a physical register; a shift
				467	instruction's shift amount must be an immediate or in register ``%cl``; a
				468	function's integer return value is in ``%eax``; most x86 instructions are
				469	two-operand, in contrast to corresponding three-operand high-level instructions;
				470	etc.
				471
				472	Because target lowering runs before register allocation, there is no way to know
				473	whether a given ``Ice::Variable`` operand lives on the stack or in a physical
				474	register. When the low-level instruction calls for a physical register operand,
				475	the target lowering can create an infinite-weight Variable. This tells the
				476	register allocator to assign infinite weight when making decisions, effectively
				477	guaranteeing some physical register. Variables can also be pre-colored to a
				478	specific physical register (``cl`` in the shift example above), which also gives
				479	infinite weight.
				480
				481	To illustrate, consider a high-level arithmetic instruction on 32-bit integer
				482	operands::
				483
				484	A = B + C
				485
				486	X86 target lowering might produce the following::
				487
				488	T.inf = B // mov instruction
				489	T.inf += C // add instruction
				490	A = T.inf // mov instruction
				491
				492	Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a
				493	physical register, the three lowered instructions are all encodable regardless
				494	of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
				495	whether ``A`` is a physical register or in memory.
				496
				497	In this example, ``A`` must be a Variable and one may be tempted to simplify the
				498	lowering sequence by setting ``A`` as infinite-weight and using::
				499
				500	A = B // mov instruction
				501	A += C // add instruction
				502
				503	This has two problems. First, if the original instruction was actually ``A =
				504	B + A``, the result would be incorrect. Second, assigning ``A`` a physical
				505	register applies throughout ``A``'s entire live range. This is probably not
				506	what is intended, and may ultimately lead to a failure to allocate a register
				507	for an infinite-weight variable.
				508
				509	This style of lowering leads to many temporaries being generated, so in ``O2``
				510	mode, we rely on the register allocator to clean things up. For example, in the
				511	example above, if ``B`` ends up getting a physical register and its live range
				512	ends at this instruction, the register allocator is likely to reuse that
				513	register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
				514	copy, which is removed as an emission-time peephole optimization.
				515
				516	O2 lowering
				517	-----------
				518
				519	Currently, the ``O2`` lowering recipe is the following:
				520
				521	- Loop nest analysis
				522
Manasij Mukherjee	a41e9a1	2016-08-05 12:29:56 -0700	[diff] [blame]	523	- Local common subexpression elimination
				524
Jim Stichnoth	efb8971	2015-09-03 13:19:54 -0700	[diff] [blame]	525	- Address mode inference
				526
				527	- Read-modify-write (RMW) transformation
				528
				529	- Basic liveness analysis
				530
				531	- Load optimization
				532
				533	- Target lowering
				534
				535	- Full liveness analysis
				536
				537	- Register allocation
				538
				539	- Phi instruction lowering (advanced)
				540
				541	- Post-phi lowering register allocation
				542
				543	- Branch optimization
				544
				545	These passes are described in more detail below.
				546
				547	Om1 lowering
				548	------------
				549
				550	Currently, the ``Om1`` lowering recipe is the following:
				551
				552	- Phi instruction lowering (simple)
				553
				554	- Target lowering
				555
				556	- Register allocation (infinite-weight and pre-colored only)
				557
				558	Optimization passes
				559	-------------------
				560
				561	Liveness analysis
				562	^^^^^^^^^^^^^^^^^
				563
				564	Liveness analysis is a standard dataflow optimization, implemented as follows.
				565	For each node (basic block), its live-out set is computed as the union of the
				566	live-in sets of its successor nodes. Then the node's instructions are processed
				567	in reverse order, updating the live set, until the beginning of the node is
				568	reached, and the node's live-in set is recorded. If this iteration has changed
				569	the node's live-in set, the node's predecessors are marked for reprocessing.
				570	This continues until no more nodes need reprocessing. If nodes are processed in
				571	reverse topological order, the number of iterations over the CFG is generally
				572	equal to the maximum loop nest depth.
				573
				574	To implement this, each node records its live-in and live-out sets, initialized
				575	to the empty set. Each instruction records which of its Variables' live ranges
				576	end in that instruction, initialized to the empty set. A side effect of
				577	liveness analysis is dead instruction elimination. Each instruction can be
				578	marked as tentatively dead, and after the algorithm converges, the tentatively
				579	dead instructions are permanently deleted.
				580
				581	Optionally, after this liveness analysis completes, we can do live range
				582	construction, in which we calculate the live range of each variable in terms of
				583	instruction numbers. A live range is represented as a union of segments, where
				584	the segment endpoints are instruction numbers. Instruction numbers are required
				585	to be unique across the CFG, and monotonically increasing within a basic block.
				586	As a union of segments, live ranges can contain "gaps" and are therefore
				587	precise. Because of SSA properties, a variable's live range can start at most
				588	once in a basic block, and can end at most once in a basic block. Liveness
				589	analysis keeps track of which variable/instruction tuples begin live ranges and
				590	end live ranges, and combined with live-in and live-out sets, we can efficiently
				591	build up live ranges of all variables across all basic blocks.
				592
				593	A lot of care is taken to try to make liveness analysis fast and efficient.
				594	Because of the lowering strategy, the number of variables is generally
				595	proportional to the number of instructions, leading to an O(N^2) complexity
				596	algorithm if implemented naively. To improve things based on sparsity, we note
				597	that most variables are "local" and referenced in at most one basic block (in
				598	contrast to the "global" variables with multi-block usage), and therefore cannot
				599	be live across basic blocks. Therefore, the live-in and live-out sets,
				600	typically represented as bit vectors, can be limited to the set of global
				601	variables, and the intra-block liveness bit vector can be compacted to hold the
				602	global variables plus the local variables for that block.
				603
				604	Register allocation
				605	^^^^^^^^^^^^^^^^^^^
				606
				607	Subzero implements a simple linear-scan register allocator, based on the
				608	allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
				609	Register Allocation in the Context of SSA Form and Register Constraints
				610	<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has
				611	several nice features:
				612
				613	- Live ranges are represented as unions of segments, as described above, rather
				614	than a single start/end tuple.
				615
				616	- It allows pre-coloring of variables with specific physical registers.
				617
				618	- It applies equally well to pre-lowered Phi instructions.
				619
				620	The paper suggests an approach of aggressively coalescing variables across Phi
				621	instructions (i.e., trying to force Phi source and destination variables to have
				622	the same register assignment), but we reject that in favor of the more natural
				623	preference mechanism described below.
				624
				625	We enhance the algorithm in the paper with the capability of automatic inference
				626	of register preference, and with the capability of allowing overlapping live
				627	ranges to safely share the same register in certain circumstances. If we are
				628	considering register allocation for variable ``A``, and ``A`` has a single
				629	defining instruction ``A=B+C``, then the preferred register for ``A``, if
				630	available, would be the register assigned to ``B`` or ``C``, if any, provided
				631	that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this
				632	way we infer a good register preference for ``A``.
				633
				634	We allow overlapping live ranges to get the same register in certain cases.
				635	Suppose a high-level instruction like::
				636
				637	A = unary_op(B)
				638
				639	has been target-lowered like::
				640
				641	T.inf = B
				642	A = unary_op(T.inf)
				643
				644	Further, assume that ``B``'s live range continues beyond this instruction
				645	sequence, and that ``B`` has already been assigned some register. Normally, we
				646	might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
				647	turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
				648	different registers. But ``T.inf`` is just a read-only copy of ``B`` that is
				649	guaranteed to be in a register, so in theory these overlapping live ranges could
				650	safely have the same register. Our implementation allows this overlap as long
				651	as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
				652	modified within ``T.inf``'s live range.
				653
				654	Subzero's register allocator can be run in 3 configurations.
				655
				656	- Normal mode. All Variables are considered for register allocation. It
				657	requires full liveness analysis and live range construction as a prerequisite.
				658	This is used by ``O2`` lowering.
				659
				660	- Minimal mode. Only infinite-weight or pre-colored Variables are considered.
				661	All other Variables are stack-allocated. It does not require liveness
				662	analysis; instead, it quickly scans the instructions and records first
				663	definitions and last uses of all relevant Variables, using that to construct a
				664	single-segment live range. Although this includes most of the Variables, the
				665	live ranges are mostly simple, short, and rarely overlapping, which the
				666	register allocator handles efficiently. This is used by ``Om1`` lowering.
				667
				668	- Post-phi lowering mode. Advanced phi lowering is done after normal-mode
				669	register allocation, and may result in new infinite-weight Variables that need
				670	registers. One would like to just run something like minimal mode to assign
				671	registers to the new Variables while respecting existing register allocation
				672	decisions. However, it sometimes happens that there are no free registers.
				673	In this case, some register needs to be forcibly spilled to the stack and
				674	temporarily reassigned to the new Variable, and reloaded at the end of the new
				675	Variable's live range. The register must be one that has no explicit
				676	references during the Variable's live range. Since Subzero currently doesn't
				677	track def/use chains (though it does record the CfgNode where a Variable is
				678	defined), we just do a brute-force search across the CfgNode's instruction
				679	list for the instruction numbers of interest. This situation happens very
				680	rarely, so there's little point for now in improving its performance.
				681
				682	The basic linear-scan algorithm may, as it proceeds, rescind an early register
				683	allocation decision, leaving that Variable to be stack-allocated. Some of these
				684	times, it turns out that the Variable could have been given a different register
				685	without conflict, but by this time it's too late. The literature recognizes
				686	this situation and describes "second-chance bin-packing", which Subzero can do.
				687	We can rerun the register allocator in a mode that respects existing register
				688	allocation decisions, and sometimes it finds new non-conflicting opportunities.
				689	In fact, we can repeatedly run the register allocator until convergence.
				690	Unfortunately, in the current implementation, these subsequent register
				691	allocation passes end up being extremely expensive. This is because of the
				692	treatment of the "unhandled pre-colored" Variable set, which is normally very
				693	small but ends up being quite large on subsequent passes. Its performance can
				694	probably be made acceptable with a better choice of data structures, but for now
				695	this second-chance mechanism is disabled.
				696
				697	Future work is to implement LLVM's `Greedy
				698	<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
				699	register allocator as a replacement for the basic linear-scan algorithm, given
				700	LLVM's experience with its improvement in code quality. (The blog post claims
				701	that the Greedy allocator also improved maintainability because a lot of hacks
				702	could be removed, but Subzero is probably not yet to that level of hacks, and is
				703	less likely to see that particular benefit.)
				704
Manasij Mukherjee	a41e9a1	2016-08-05 12:29:56 -0700	[diff] [blame]	705	Local common subexpression elimination
				706	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				707
				708	The Local CSE implementation goes through each instruction and records a portion
				709	of each ``Seen`` instruction in a hashset-like container. That portion consists
				710	of the entire instruction except for any dest variable. That means ``A = X + Y``
				711	and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows
				712	us to detect common subexpressions.
				713
				714	Whenever a repetition is detected, the redundant variables are stored in a
				715	container mapping the replacee to the replacement. In the case above, it would
				716	be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.
				717
				718	At any point if a variable that has an entry in the replacement table is
				719	encountered, it is replaced with the variable it is mapped to. This ensures that
				720	the redundant variables will not have any uses in the basic block, allowing
				721	dead code elimination to clean up the redundant instruction.
				722
				723	With SSA, the information stored is never invalidated. However, non-SSA input is
				724	supported with the ``-lcse=no-ssa`` option. This has to maintain some extra
				725	dependency information to ensure proper invalidation on variable assignment.
				726	This is not rigorously tested because this pass is run at an early stage where
				727	it is safe to assume variables have a single definition. This is not enabled by
				728	default because it bumps the compile time overhead from 2% to 6%.
				729
				730	Loop-invariant code motion
				731	^^^^^^^^^^^^^^^^^^^^^^^^^^
				732
				733	This pass utilizes the loop analysis information to hoist invariant instructions
				734	to loop pre-headers. A loop must have a single entry node (header) and that node
				735	must have a single external predecessor for this optimization to work, as it is
				736	currently implemented.
				737
				738	The pass works by iterating over all instructions in the loop until the set of
				739	invariant instructions converges. In each iteration, a non-invariant instruction
				740	involving only constants or a variable known to be invariant is added to the
				741	result set. The destination variable of that instruction is added to the set of
				742	variables known to be invariant (which is initialized with the constant args).
				743
				744	Improving the loop-analysis infrastructure is likely to have significant impact
				745	on this optimization. Inserting an extra node to act as the pre-header when the
				746	header has multiple incoming edges from outside could also be a good idea.
				747	Expanding the initial invariant variable set to contain all variables that do
				748	not have definitions inside the loop does not seem to improve anything.
				749
				750	Short circuit evaluation
				751	^^^^^^^^^^^^^^^^^^^^^^^^
				752
				753	Short circuit evaluation splits nodes and introduces early jumps when the result
				754	of a logical operation can be determined early and there are no observable side
				755	effects of skipping the rest of the computation. The instructions considered
				756	backwards from the end of the basic blocks. When a definition of a variable
				757	involved in a conditional jump is found, an extra jump can be inserted in that
				758	location, moving the rest of the instructions in the node to a newly inserted
				759	node. Consider this example::
				760
				761	__N :
				762	a = <something>
				763	Instruction 1 without side effect
				764	... b = <something> ...
				765	Instruction N without side effect
				766	t1 = or a b
				767	br t1 __X __Y
				768
				769	is transformed to::
				770
				771	__N :
				772	a = <something>
				773	br a __X __N_ext
				774
				775	__N_ext :
				776	Instruction 1 without side effect
				777	... b = <something> ...
				778	Instruction N without side effect
				779	br b __X __Y
				780
				781	The logic for AND is analogous, the only difference is that the early jump is
				782	facilitated by a ``false`` value instead of ``true``.
				783
				784	Global Variable Splitting
				785	^^^^^^^^^^^^^^^^^^^^^^^^^
				786
				787	Global variable splitting (``-split-global-vars``) is run after register
				788	allocation. It works on the variables that did not manage to get registers (but
				789	are allowed to) and decomposes their live ranges into the individual segments
				790	(which span a single node at most). New variables are created (but not yet used)
				791	with these smaller live ranges and the register allocator is run again. This is
				792	not inefficient as the old variables that already had registers are now
				793	considered pre-colored.
				794
				795	The new variables that get registers replace their parent variables for their
				796	portion of its (parent's) live range. A copy from the old variable to the new
				797	is introduced before the first use and the reverse after the last def in the
				798	live range.
				799
Jim Stichnoth	efb8971	2015-09-03 13:19:54 -0700	[diff] [blame]	800	Basic phi lowering
				801	^^^^^^^^^^^^^^^^^^
				802
				803	The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
				804	implements it). Consider this example::
				805
				806	L1:
				807	...
				808	br L3
				809	L2:
				810	...
				811	br L3
				812	L3:
				813	A = phi [B, L1], [C, L2]
				814	X = phi [Y, L1], [Z, L2]
				815
				816	For each destination of a phi instruction, we can create a temporary and insert
				817	the temporary's assignment at the end of the predecessor block::
				818
				819	L1:
				820	...
				821	A' = B
				822	X' = Y
				823	br L3
				824	L2:
				825	...
				826	A' = C
				827	X' = Z
				828	br L3
				829	L2:
				830	A = A'
				831	X = X'
				832
				833	This transformation is very simple and reliable. It can be done before target
				834	lowering and register allocation, and it easily avoids the classic lost-copy and
				835	related problems. ``Om1`` lowering uses this strategy.
				836
				837	However, it has the disadvantage of initializing temporaries even for branches
				838	not taken, though that could be mitigated by splitting non-critical edges and
				839	putting assignments in the edge-split nodes. Another problem is that without
				840	extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
				841	specific ordering even though phi semantics are that the assignments are
				842	parallel or unordered. This sometimes imposes false live range overlaps and
				843	leads to poorer register allocation.
				844
				845	Advanced phi lowering
				846	^^^^^^^^^^^^^^^^^^^^^
				847
				848	``O2`` lowering defers phi lowering until after register allocation to avoid the
				849	problem of false live range overlaps. It works as follows. We split each
				850	incoming edge and move the (parallel) phi assignments into the split nodes. We
				851	linearize each set of assignments by finding a safe, topological ordering of the
				852	assignments, respecting register assignments as well. For example::
				853
				854	A = B
				855	X = Y
				856
				857	Normally these assignments could be executed in either order, but if ``B`` and
				858	``X`` are assigned the same physical register, we would want to use the above
				859	ordering. Dependency cycles are broken by introducing a temporary. For
				860	example::
				861
				862	A = B
				863	B = A
				864
				865	Here, a temporary breaks the cycle::
				866
				867	t = A
				868	A = B
				869	B = t
				870
				871	Finally, we use the existing target lowering to lower the assignments in this
				872	basic block, and once that is done for all basic blocks, we run the post-phi
				873	variant of register allocation on the edge-split basic blocks.
				874
				875	When computing a topological order, we try to first schedule assignments whose
				876	source has a physical register, and last schedule assignments whose destination
				877	has a physical register. This helps reduce register pressure.
				878
				879	X86 address mode inference
				880	^^^^^^^^^^^^^^^^^^^^^^^^^^
				881
				882	We try to take advantage of the x86 addressing mode that includes a base
				883	register, an index register, an index register scale amount, and an immediate
				884	offset. We do this through simple pattern matching. Starting with a load or
				885	store instruction where the address is a variable, we initialize the base
				886	register to that variable, and look up the instruction where that variable is
				887	defined. If that is an add instruction of two variables and the index register
				888	hasn't been set, we replace the base and index register with those two
				889	variables. If instead it is an add instruction of a variable and a constant, we
				890	replace the base register with the variable and add the constant to the
				891	immediate offset.
				892
				893	There are several more patterns that can be matched. This pattern matching
				894	continues on the load or store instruction until no more matches are found.
				895	Because a program typically has few load and store instructions (not to be
				896	confused with instructions that manipulate stack variables), this address mode
				897	inference pass is fast.
				898
				899	X86 read-modify-write inference
				900	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				901
				902	A reasonably common bitcode pattern is a non-atomic update of a memory
				903	location::
				904
				905	x = load addr
				906	y = add x, 1
				907	store y, addr
				908
				909	On x86, with good register allocation, the Subzero passes described above
				910	generate code with only this quality::
				911
				912	mov [%ebx], %eax
				913	add $1, %eax
				914	mov %eax, [%ebx]
				915
				916	However, x86 allows for this kind of code::
				917
				918	add $1, [%ebx]
				919
				920	which requires fewer instructions, but perhaps more importantly, requires fewer
				921	physical registers.
				922
				923	It's also important to note that this transformation only makes sense if the
				924	store instruction ends ``x``'s live range.
				925
				926	Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
				927	opportunities via simple pattern matching. The only problem is that it is run
				928	before liveness analysis, which is needed to determine whether ``x``'s live
				929	range ends after the RMW. Since liveness analysis is one of the most expensive
				930	passes, it's not attractive to run it an extra time just for RMW analysis.
				931	Instead, we essentially generate both the RMW and the non-RMW versions, and then
				932	during lowering, the RMW version deletes itself if it finds x still live.
				933
				934	X86 compare-branch inference
				935	^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				936
				937	In the LLVM instruction set, the compare/branch pattern works like this::
				938
				939	cond = icmp eq a, b
				940	br cond, target
				941
				942	The result of the icmp instruction is a single bit, and a conditional branch
				943	tests that bit. By contrast, most target architectures use this pattern::
				944
				945	cmp a, b // implicitly sets various bits of FLAGS register
				946	br eq, target // branch on a particular FLAGS bit
				947
				948	A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
				949	``cond`` and conditionally branches. Subzero has a pass that identifies
				950	boolean-based operations like this and folds them into a single
				951	compare/branch-like operation. It is set up for more than just cmp/br though.
				952	Boolean producers include icmp (integer compare), fcmp (floating-point compare),
				953	and trunc (integer truncation when the destination has bool type). Boolean
				954	consumers include branch, select (the ternary operator from the C language), and
				955	sign-extend and zero-extend when the source has bool type.
				956
				957	Sandboxing
				958	^^^^^^^^^^
				959
				960	Native Client's sandbox model uses software fault isolation (SFI) to provide
				961	safety when running untrusted code in a browser or other environment. Subzero
				962	implements Native Client's `sandboxing
				963	<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
				964	to enable Subzero-translated executables to be run inside Chrome. Subzero also
				965	provides a fairly simple framework for investigating alternative sandbox models
				966	or other restrictions on the sandbox model.
				967
				968	Sandboxing in Subzero is not actually implemented as a separate pass, but is
				969	integrated into lowering and assembly.
				970
				971	- Indirect branches, including the ret instruction, are masked to a bundle
				972	boundary and bundle-locked.
				973
				974	- Call instructions are aligned to the end of the bundle so that the return
				975	address is bundle-aligned.
				976
				977	- Indirect branch targets, including function entry and targets in a switch
				978	statement jump table, are bundle-aligned.
				979
				980	- The intrinsic for reading the thread pointer is inlined appropriately.
				981
				982	- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
				983	base register. We reduce the aggressiveness of address mode inference to
				984	leave room for the sandbox base register during lowering. There are no memory
				985	sandboxing changes for x86-32.
				986
				987	Code emission
				988	-------------
				989
				990	Subzero's integrated assembler is derived from Dart's `assembler code
				991	<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass
				992	that iterates through the low-level ICE instructions and invokes the relevant
				993	assembler functions. Placeholders are added for later fixup of branch target
				994	offsets. (Backward branches use short offsets if possible; forward branches
				995	generally use long offsets unless it is an intra-block branch of "known" short
				996	length.) The assembler emits into a staging buffer. Once emission into the
				997	staging buffer for a function is complete, the data is emitted to the output
				998	file as an ELF object file, and metadata such as relocations, symbol table, and
				999	string table, are accumulated for emission at the end. Global data initializers
				1000	are emitted similarly. A key point is that at this point, the staging buffer
				1001	can be deallocated, and only a minimum of data needs to held until the end.
				1002
				1003	As a debugging alternative, Subzero can emit textual assembly code which can
				1004	then be run through an external assembler. This is of course super slow, but
				1005	quite valuable when bringing up a new target.
				1006
				1007	As another debugging option, the staging buffer can be emitted as textual
				1008	assembly, primarily in the form of ".byte" lines. This allows the assembler to
				1009	be tested separately from the ELF related code.
				1010
				1011	Memory management
				1012	-----------------
				1013
				1014	Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
				1015	LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects
				1016	allocated from the arena are never actually freed; instead, when the CFG
				1017	translation completes and the CFG is deleted, the entire arena memory is
				1018	reclaimed at once. This style of allocation works well in an environment like a
				1019	compiler where there are distinct phases with only easily-identifiable objects
				1020	living across phases. It frees the developer from having to manage object
				1021	deletion, and it amortizes deletion costs across just a single arena deletion at
				1022	the end of the phase. Furthermore, it helps scalability by allocating entirely
				1023	from thread-local memory pools, and minimizing global locking of the heap.
				1024
				1025	Instructions are probably the most heavily allocated complex class in Subzero.
				1026	We represent an instruction list as an intrusive doubly linked list, allocate
				1027	all instructions from the ``CfgLocalAllocator``, and we make sure each
				1028	instruction subclass is basically `POD
				1029	<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
				1030	trivial destructor. This way, when the CFG is finished, we don't need to
				1031	individually deallocate every instruction. We do similar for Variables, which
				1032	is probably the second most popular complex class.
				1033
				1034	There are some situations where passes need to use some `STL container class
				1035	<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the
				1036	``CfgLocalAllocator`` as the container allocator if this is needed.
				1037
				1038	Multithreaded translation
				1039	-------------------------
				1040
				1041	Subzero is designed to be able to translate functions in parallel. With the
				1042	``-threads=N`` command-line option, there is a 3-stage producer-consumer
				1043	pipeline:
				1044
				1045	- A single thread parses the ``.pexe`` file and produces a sequence of work
				1046	units. A work unit can be either a fully constructed CFG, or a set of global
				1047	initializers. The work unit includes its sequence number denoting its parse
				1048	order. Each work unit is added to the translation queue.
				1049
				1050	- There are N translation threads that draw work units from the translation
				1051	queue and lower them into assembler buffers. Each assembler buffer is added
				1052	to the emitter queue, tagged with its sequence number. The CFG and its
				1053	``CfgLocalAllocator`` are disposed of at this point.
				1054
				1055	- A single thread draws assembler buffers from the emitter queue and appends to
				1056	the output file. It uses the sequence numbers to reintegrate the assembler
				1057	buffers according to the original parse order, such that output order is
				1058	always deterministic.
				1059
				1060	This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
				1061	for a total of ``N+2`` execution threads, taking the parser and emitter threads
				1062	into account. For the special case of ``N=0``, execution is entirely sequential
				1063	-- the same thread parses, translates, and emits, one function at a time. This
				1064	is useful for performance measurements.
				1065
				1066	Ideally, we would like to get near-linear scalability as the number of
				1067	translation threads increases. We expect that ``-threads=1`` should be slightly
				1068	faster than ``-threads=0`` as the small amount of time spent parsing and
				1069	emitting is done largely in parallel with translation. With perfect
				1070	scalability, we see ``-threads=N`` translating ``N`` times as fast as
				1071	``-threads=1``, up until the point where parsing or emitting becomes the
				1072	bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory
				1073	performance would become a bottleneck and efficiency might peak at, say, 75%.
				1074
				1075	Currently, parsing takes about 11% of total sequential time. If translation
				1076	scalability ever gets so fast and awesomely scalable that parsing becomes a
				1077	bottleneck, it should be possible to make parsing multithreaded as well.
				1078
				1079	Internally, all shared, mutable data is held in the GlobalContext object, and
				1080	access to each field is guarded by a mutex.
				1081
				1082	Security
				1083	--------
				1084
				1085	Subzero includes a number of security features in the generated code, as well as
				1086	in the Subzero translator itself, which run on top of the existing Native Client
				1087	sandbox as well as Chrome's OS-level sandbox.
				1088
				1089	Sandboxed translator
				1090	^^^^^^^^^^^^^^^^^^^^
				1091
				1092	When running inside the browser, the Subzero translator executes as sandboxed,
				1093	untrusted code that is initially checked by the validator, just like the
				1094	LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no
				1095	more or less secure than the translator it replaces, from the point of view of
				1096	the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and
				1097	was designed from the start with security in mind, so one expects fewer attacker
				1098	opportunities here.
				1099
				1100	Code diversification
				1101	^^^^^^^^^^^^^^^^^^^^
				1102
				1103	`Return-oriented programming
				1104	<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
				1105	now-common technique for starting with e.g. a known buffer overflow situation
				1106	and launching it into a deeper exploit. The attacker scans the executable
				1107	looking for ROP gadgets, which are short sequences of code that happen to load
				1108	known values into known registers and then return. An attacker who manages to
				1109	overwrite parts of the stack can overwrite it with carefully chosen return
				1110	addresses such that certain ROP gadgets are effectively chained together to set
				1111	up the register state as desired, finally returning to some code that manages to
				1112	do something nasty based on those register values.
				1113
				1114	If there is a popular ``.pexe`` with a large install base, the attacker could
				1115	run Subzero on it and scan the executable for suitable ROP gadgets to use as
				1116	part of a potential exploit. Note that if the trusted validator is working
				1117	correctly, these ROP gadgets are limited to starting at a bundle boundary and
				1118	cannot use the trick of finding a gadget that happens to begin inside another
				1119	instruction. All the same, gadgets with these constraints still exist and the
				1120	attacker has access to them. This is the attack model we focus most on --
				1121	protecting the user against misuse of a "trusted" developer's application, as
				1122	opposed to mischief from a malicious ``.pexe`` file.
				1123
				1124	Subzero can mitigate these attacks to some degree through code diversification.
				1125	Specifically, we can apply some randomness to the code generation that makes ROP
				1126	gadgets less predictable. This randomness can have some compile-time cost, and
				1127	it can affect the code quality; and some diversifications may be more effective
				1128	than others. A more detailed treatment of hardening techniques may be found in
				1129	the Matasano report "`Attacking Clientside JIT Compilers
				1130	<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
				1131
				1132	To evaluate diversification effectiveness, we use a third-party ROP gadget
				1133	finder and limit its results to bundle-aligned addresses. For a given
				1134	diversification technique, we run it with a number of different random seeds,
				1135	find ROP gadgets for each version, and determine how persistent each ROP gadget
				1136	is across the different versions. A gadget is persistent if the same gadget is
				1137	found at the same code address. The best diversifications are ones with low
				1138	gadget persistence rates.
				1139
				1140	Subzero implements 7 different diversification techniques. Below is a
				1141	discussion of each technique, its effectiveness, and its cost. The discussions
				1142	of cost and effectiveness are for a single diversification technique; the
				1143	translation-time costs for multiple techniques are additive, but the effects of
				1144	multiple techniques on code quality and effectiveness are not yet known.
				1145
				1146	In Subzero's implementation, each randomization is "repeatable" in a sense.
				1147	Each pass that includes a randomization option gets its own private instance of
				1148	a random number generator (RNG). The RNG is seeded with a combination of a
				1149	global seed, the pass ID, and the function's sequence number. The global seed
				1150	is designed to be different across runs (perhaps based on the current time), but
				1151	for debugging, the global seed can be set to a specific value and the results
				1152	will be repeatable.
				1153
				1154	Subzero-generated code is subject to diversification once per translation, and
				1155	then Chrome caches the diversified binary for subsequent executions. An
				1156	attacker may attempt to run the binary multiple times hoping for
				1157	higher-probability combinations of ROP gadgets. When the attacker guesses
				1158	wrong, a likely outcome is an application crash. Chrome throttles creation of
				1159	crashy processes which reduces the likelihood of the attacker eventually gaining
				1160	a foothold.
				1161
				1162	Constant blinding
				1163	~~~~~~~~~~~~~~~~~
				1164
				1165	Here, we prevent attackers from controlling large immediates in the text
				1166	(executable) section. A random cookie is generated for each function, and if
				1167	the constant exceeds a specified threshold, the constant is obfuscated with the
				1168	cookie and equivalent code is generated. For example, instead of this x86
				1169	instruction::
				1170
				1171	mov $0x11223344, <%Reg/Mem>
				1172
				1173	the following code might be generated::
				1174
				1175	mov $(0x11223344+Cookie), %temp
				1176	lea -Cookie(%temp), %temp
				1177	mov %temp, <%Reg/Mem>
				1178
				1179	The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
				1180	prevent unintended effects on the flags register.
				1181
				1182	This transformation has almost no effect on translation time, and about 1%
				1183	impact on code quality, depending on the threshold chosen. It does little to
				1184	reduce gadget persistence, but it does remove a lot of potential opportunities
				1185	to construct intra-instruction ROP gadgets (which an attacker could use only if
				1186	a validator bug were discovered, since the Native Client sandbox and associated
				1187	validator force returns and other indirect branches to be to bundle-aligned
				1188	addresses).
				1189
				1190	Constant pooling
				1191	~~~~~~~~~~~~~~~~
				1192
				1193	This is similar to constant blinding, in that large immediates are removed from
				1194	the text section. In this case, each unique constant above the threshold is
				1195	stored in a read-only data section and the constant is accessed via a memory
				1196	load. For the above example, the following code might be generated::
				1197
				1198	mov $Label$1, %temp
				1199	mov %temp, <%Reg/Mem>
				1200
				1201	This has a similarly small impact on translation time and ROP gadget
				1202	persistence, and a smaller (better) impact on code quality. This is because it
				1203	uses fewer instructions, and in some cases good register allocation leads to no
				1204	increase in instruction count. Note that this still gives an attacker some
				1205	limited amount of control over some text section values, unless we randomize the
				1206	constant pool layout.
				1207
				1208	Static data reordering
				1209	~~~~~~~~~~~~~~~~~~~~~~
				1210
				1211	This transformation limits the attacker's ability to control bits in global data
				1212	address references. It simply permutes the order in memory of global variables
				1213	and internal constant pool entries. For the constant pool, we only permute
				1214	within a type (i.e., emit a randomized list of ints, followed by a randomized
				1215	list of floats, etc.) to maintain good packing in the face of alignment
				1216	constraints.
				1217
				1218	As might be expected, this has no impact on code quality, translation time, or
				1219	ROP gadget persistence (though as above, it limits opportunities for
				1220	intra-instruction ROP gadgets with a broken validator).
				1221
				1222	Basic block reordering
				1223	~~~~~~~~~~~~~~~~~~~~~~
				1224
				1225	Here, we randomize the order of basic blocks within a function, with the
				1226	constraint that we still want to maintain a topological order as much as
				1227	possible, to avoid making the code too branchy.
				1228
				1229	This has no impact on code quality, and about 1% impact on translation time, due
				1230	to a separate pass to recompute layout. It ends up having a huge effect on ROP
				1231	gadget persistence, tied for best with nop insertion, reducing ROP gadget
				1232	persistence to less than 5%.
				1233
				1234	Function reordering
				1235	~~~~~~~~~~~~~~~~~~~
				1236
				1237	Here, we permute the order that functions are emitted, primarily to shift ROP
				1238	gadgets around to less predictable locations. It may also change call address
				1239	offsets in case the attacker was trying to control that offset in the code.
				1240
				1241	To control latency and memory footprint, we don't arbitrarily permute functions.
				1242	Instead, for some relatively small value of N, we queue up N assembler buffers,
				1243	and then emit the N functions in random order, and repeat until all functions
				1244	are emitted.
				1245
				1246	Function reordering has no impact on translation time or code quality.
				1247	Measurements indicate that it reduces ROP gadget persistence to about 15%.
				1248
				1249	Nop insertion
				1250	~~~~~~~~~~~~~
				1251
				1252	This diversification randomly adds a nop instruction after each regular
				1253	instruction, with some probability. Nop instructions of different lengths may
				1254	be selected. Nop instructions are never added inside a bundle_lock region.
				1255	Note that when sandboxing is enabled, nop instructions are already being added
				1256	for bundle alignment, so the diversification nop instructions may simply be
				1257	taking the place of alignment nop instructions, though distributed differently
				1258	through the bundle.
				1259
				1260	In Subzero's currently implementation, nop insertion adds 3-5% to the
				1261	translation time, but this is probably because it is implemented as a separate
				1262	pass that adds actual nop instructions to the IR. The overhead would probably
				1263	be a lot less if it were integrated into the assembler pass. The code quality
				1264	is also reduced by 3-5%, making nop insertion the most expensive of the
				1265	diversification techniques.
				1266
				1267	Nop insertion is very effective in reducing ROP gadget persistence, at the same
				1268	level as basic block randomization (less than 5%). But given nop insertion's
				1269	impact on translation time and code quality, one would most likely prefer to use
				1270	basic block randomization instead (though the combined effects of the different
				1271	diversification techniques have not yet been studied).
				1272
				1273	Register allocation randomization
				1274	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1275
				1276	In this diversification, the register allocator tries to make different but
				1277	mostly functionally equivalent choices, while maintaining stable code quality.
				1278
				1279	A naive approach would be the following. Whenever the allocator has more than
				1280	one choice for assigning a register, choose randomly among those options. And
				1281	whenever there are no registers available and there is a tie for the
				1282	lowest-weight variable, randomly select one of the lowest-weight variables to
				1283	evict. Because of the one-pass nature of the linear-scan algorithm, this
				1284	randomization strategy can have a large impact on which variables are ultimately
				1285	assigned registers, with a corresponding large impact on code quality.
				1286
				1287	Instead, we choose an approach that tries to keep code quality stable regardless
				1288	of the random seed. We partition the set of physical registers into equivalence
				1289	classes. If a register is pre-colored in the function (i.e., referenced
				1290	explicitly by name), it forms its own equivalence class. The remaining
				1291	registers are partitioned according to their combination of attributes such as
				1292	integer versus floating-point, 8-bit versus 32-bit, caller-save versus
				1293	callee-saved, etc. Each equivalence class is randomly permuted, and the
				1294	complete permutation is applied to the final register assignments.
				1295
				1296	Register randomization reduces ROP gadget persistence to about 10% on average,
				1297	though there tends to be fairly high variance across functions and applications.
				1298	This probably has to do with the set of restrictions in the x86-32 instruction
				1299	set and ABI, such as few general-purpose registers, ``%eax`` used for return
				1300	values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As
				1301	intended, register randomization has no impact on code quality, and a slight
				1302	(0.5%) impact on translation time due to an extra scan over the variables to
				1303	identify pre-colored registers.
				1304
				1305	Fuzzing
				1306	^^^^^^^
				1307
				1308	We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
				1309	combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
				1310	<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to
				1311	find and fix cases where Subzero crashes or otherwise ungracefully fails on
				1312	unexpected inputs, and to do so automatically over a large range of unexpected
				1313	inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility
				1314	of an attacker exploiting these bugs.
				1315
				1316	Most of the problems found so far are ones most appropriately handled in the
				1317	parser. However, there have been a couple that have identified problems in the
				1318	lowering, or otherwise inappropriately triggered assertion failures and fatal
				1319	errors. We continue to dig into this area.
				1320
				1321	Future security work
				1322	^^^^^^^^^^^^^^^^^^^^
				1323
				1324	Subzero is well-positioned to explore other future security enhancements, e.g.:
				1325
				1326	- Tightening the Native Client sandbox. ABI changes, such as the previous work
				1327	on `hiding the sandbox base address
				1328	<https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
				1329	in x86-64, are easy to experiment with in Subzero.
				1330
				1331	- Making the executable code section read-only. This would prevent a PNaCl
				1332	application from inspecting its own binary and trying to find ROP gadgets even
				1333	after code diversification has been performed. It may still be susceptible to
				1334	`blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
				1335	security is still overall improved.
				1336
				1337	- Instruction selection diversification. It may be possible to lower a given
				1338	instruction in several largely equivalent ways, which gives more opportunities
				1339	for code randomization.
				1340
				1341	Chrome integration
				1342	------------------
				1343
				1344	Currently Subzero is available in Chrome for the x86-32 architecture, but under
				1345	a flag. When the flag is enabled, Subzero is used when the `manifest file
				1346	<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
				1347	linking to the ``.pexe`` file specifies the ``O0`` optimization level.
				1348
				1349	The next step is to remove the flag, i.e. invoke Subzero as the only translator
				1350	for ``O0``-specified manifest files.
				1351
				1352	Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
				1353	case Subzero could be used for all PNaCl translation.
				1354
				1355	Command line options
				1356	--------------------
				1357
				1358	Subzero has a number of command-line options for debugging and diagnostics.
				1359	Among the more interesting are the following.
				1360
				1361	- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
				1362	diagnostic output, with various levels of detail after each pass. Instruction
				1363	numbers can be printed or suppressed. Deleted instructions can be printed or
				1364	suppressed (they are retained in the instruction list, as discussed earlier,
				1365	because they can help explain how lower-level instructions originated).
				1366	Liveness information can be printed when available. Details of register
				1367	allocation can be printed as register allocator decisions are made. And more.
				1368
				1369	- Running Subzero with any level of verbosity produces an enormous amount of
				1370	output. When debugging a single function, verbose output can be suppressed
				1371	except for a particular function. The ``-verbose-focus`` flag suppresses
				1372	verbose output except for the specified function.
				1373
				1374	- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
				1375	at exit. Timing markers can be placed in the Subzero source code to demarcate
				1376	logical operations or passes of interest. Basic timing information plus
				1377	call-stack type timing information is printed at the end.
				1378
				1379	- Along with ``-timing``, the user can instead get a report on the overall
				1380	translation time for each function, to help focus on timing outliers. Also,
				1381	``-timing-focus`` limits the ``-timing`` reporting to a single function,
				1382	instead of aggregating pass timing across all functions.
				1383
				1384	- The ``-szstats`` option reports various statistics on each function, such as
				1385	stack frame size, static instruction count, etc. It may be helpful to track
				1386	these stats over time as Subzero is improved, as an approximate measure of
				1387	code quality.
				1388
				1389	- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
				1390	output, annotate the assembly output with register-focused liveness
				1391	information. In particular, each basic block is annotated with which
				1392	registers are live-in and live-out, and each instruction is annotated with
				1393	which registers' and stack locations' live ranges end at that instruction.
				1394	This is really useful when studying the generated code to find opportunities
				1395	for code quality improvements.
				1396
				1397	Testing and debugging
				1398	---------------------
				1399
				1400	LLVM lit tests
				1401	^^^^^^^^^^^^^^
				1402
				1403	For basic testing, Subzero uses LLVM's `lit
				1404	<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We
				1405	have a suite of hundreds of small functions where we test for particular
				1406	assembly code patterns across different target architectures.
				1407
				1408	Cross tests
				1409	^^^^^^^^^^^
				1410
				1411	Unfortunately, the lit tests don't do a great job of precisely testing the
				1412	correctness of the output. Much better are the cross tests, which are execution
				1413	tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
				1414	variety of interesting inputs. Each cross test consists of a set of C, C++,
				1415	and/or low-level bitcode files. The C and C++ source files are compiled down to
				1416	bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
				1417	Subzero mangles global symbol names with a special prefix to avoid duplicate
				1418	symbol errors. A driver program invokes both versions on a large set of
				1419	interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
				1420	differ. Cross tests turn out to be an excellent way of testing the basic
				1421	lowering patterns, but they are less useful for testing more global things like
				1422	liveness analysis and register allocation.
				1423
				1424	Bisection debugging
				1425	^^^^^^^^^^^^^^^^^^^
				1426
				1427	Sometimes with a new application, Subzero will end up producing incorrect code
				1428	that either crashes at runtime or otherwise produces the wrong results. When
				1429	this happens, we need to narrow it down to a single function (or small set of
				1430	functions) that yield incorrect behavior. For this, we have a bisection
				1431	debugging framework. Here, we initially translate the entire application once
				1432	with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to
				1433	selectively weaken symbols based on a whitelist or blacklist provided on the
				1434	command line. The two object files can then be linked together without link
				1435	errors, with the desired version of each method "winning". Then the binary is
				1436	tested, and bisection proceeds based on whether the binary produces correct
				1437	output.
				1438
				1439	When the bisection completes, we are left with a minimal set of
				1440	Subzero-translated functions that cause the failure. Usually it is a single
				1441	function, though sometimes it might require a combination of several functions
				1442	to cause a failure; this may be due to an incorrect call ABI, for example.
				1443	However, Murphy's Law implies that the single failing function is enormous and
				1444	impractical to debug. In that case, we can restart the bisection, explicitly
				1445	blacklisting the enormous function, and try to find another candidate to debug.
				1446	(Future work is to automate this to find all minimal sets of functions, so that
				1447	debugging can focus on the simplest example.)
				1448
				1449	Fuzz testing
				1450	^^^^^^^^^^^^
				1451
				1452	As described above, we try to find internal Subzero bugs using fuzz testing
				1453	techniques.
				1454
				1455	Sanitizers
				1456	^^^^^^^^^^
				1457
				1458	Subzero can be built with `AddressSanitizer
				1459	<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
				1460	<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is
				1461	done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,
				1462	multithreading has been simple enough that TSan hasn't found any bugs, but ASan
				1463	has found at least one memory leak which was subsequently fixed.
				1464	`UndefinedBehaviorSanitizer
				1465	<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
				1466	(UBSan) support is in progress. `Control flow integrity sanitization
				1467	<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
				1468	consideration.
				1469
				1470	Current status
				1471	==============
				1472
				1473	Target architectures
				1474	--------------------
				1475
				1476	Subzero is currently more or less complete for the x86-32 target. It has been
				1477	refactored and extended to handle x86-64 as well, and that is mostly complete at
				1478	this point.
				1479
				1480	ARM32 work is in progress. It currently lacks the testing level of x86, at
				1481	least in part because Subzero's register allocator needs modifications to handle
				1482	ARM's aliasing of floating point and vector registers. Specifically, a 64-bit
				1483	register is actually a gang of two consecutive and aligned 32-bit registers, and
				1484	a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
				1485	ARM64 work has not started; when it does, it will be native-only since the
				1486	Native Client sandbox model, validator, and other tools have never been defined.
				1487
				1488	An external contributor is adding MIPS support, in most part by following the
				1489	ARM work.
				1490
				1491	Translator performance
				1492	----------------------
				1493
				1494	Single-threaded translation speed is currently about 5× the ``pnacl-llc``
				1495	translation speed. For a large ``.pexe`` file, the time breaks down as:
				1496
				1497	- 11% for parsing and initial IR building
				1498
				1499	- 4% for emitting to /dev/null
				1500
				1501	- 27% for liveness analysis (two liveness passes plus live range construction)
				1502
				1503	- 15% for linear-scan register allocation
				1504
				1505	- 9% for basic lowering
				1506
				1507	- 10% for advanced phi lowering
				1508
				1509	- ~11% for other minor analysis
				1510
				1511	- ~10% measurement overhead to acquire these numbers
				1512
				1513	Some improvements could undoubtedly be made, but it will be hard to increase the
				1514	speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With
				1515	``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
				1516	``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
				1517
				1518	Code quality
				1519	------------
				1520
				1521	Measured across 16 components of spec2k, Subzero's code quality is uniformly
				1522	better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
				1523	between ``pnacl-llc`` ``-O0`` and ``-O2``.
				1524
				1525	Translator size
				1526	---------------
				1527
				1528	When built in MINIMAL mode, the x86-64 native translator size for the x86-32
				1529	target is about 700 KB, not including the size of functions referenced in
				1530	dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1
				1531	MB, and it is statically linked and also includes nop padding for bundling as
				1532	well as indirect branch masking.
				1533
				1534	Translator memory footprint
				1535	---------------------------
				1536
				1537	It's hard to draw firm conclusions about memory footprint, since the footprint
				1538	is at least proportional to the input function size, and there is no real limit
				1539	on the size of functions in the ``.pexe`` file.
				1540
				1541	That said, we looked at the memory footprint over time as Subzero translated
				1542	``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
				1543	disposal. One of LLVM's libraries that Subzero uses can report the current
				1544	malloc heap usage. With single-threaded translation, Subzero tends to hover
				1545	around 15 MB of memory usage. There are a couple of monstrous functions where
				1546	Subzero grows to around 100 MB, but then it drops back down after those
				1547	functions finish translating. In contrast, ``pnacl-llc`` grows larger and
				1548	larger throughout translation, reaching several hundred MB by the time it
				1549	completes.
				1550
				1551	It's a bit more interesting when we enable multithreaded translation. When
				1552	there are N translation threads, Subzero implements a policy that limits the
				1553	size of the translation queue to N entries -- if it is "full" when the parser
				1554	tries to add a new CFG, the parser blocks until one of the translation threads
				1555	removes a CFG. This means the number of in-memory CFGs can (and generally does)
				1556	reach 2*N+1, and so the memory footprint rises in proportion to the number of
				1557	threads. Adding to the pressure is the observation that the monstrous functions
				1558	also take proportionally longer time to translate, so there's a good chance many
				1559	of the monstrous functions will be active at the same time with multithreaded
				1560	translation. As a result, for N=32, Subzero's memory footprint peaks at about
				1561	260 MB, but drops back down as the large functions finish translating.
				1562
				1563	If this peak memory size becomes a problem, it might be possible for the parser
				1564	to resequence the functions to try to spread out the larger functions, or to
				1565	throttle the translation queue to prevent too many in-flight large functions.
				1566	It may also be possible to throttle based on memory pressure signaling from
				1567	Chrome.
				1568
				1569	Translator scalability
				1570	----------------------
				1571
				1572	Currently scalability is "not very good". Multiple translation threads lead to
				1573	faster translation, but not to the degree desired. We haven't dug in to
				1574	investigate yet.
				1575
				1576	There are a few areas to investigate. First, there may be contention on the
				1577	constant pool, which all threads access, and which requires locked access even
				1578	for reading. This could be mitigated by keeping a CFG-local cache of the most
				1579	common constants.
				1580
				1581	Second, there may be contention on memory allocation. While almost all CFG
				1582	objects are allocated from the CFG-local allocator, some passes use temporary
				1583	STL containers that use the default allocator, which may require global locking.
				1584	This could be mitigated by switching these to the CFG-local allocator.
				1585
				1586	Third, multithreading may make the default allocator strategy more expensive.
				1587	In a single-threaded environment, a pass will allocate its containers, run the
				1588	pass, and deallocate the containers. This results in stack-like allocation
				1589	behavior and makes the heap free list easier to manage, with less heap
				1590	fragmentation. But when multithreading is added, the allocations and
				1591	deallocations become much less stack-like, making allocation and deallocation
				1592	operations individually more expensive. Again, this could be mitigated by
				1593	switching these to the CFG-local allocator.