Jim Stichnoth | efb8971 | 2015-09-03 13:19:54 -0700 | [diff] [blame] | 1 | Design of the Subzero fast code generator |
| 2 | ========================================= |
| 3 | |
| 4 | Introduction |
| 5 | ------------ |
| 6 | |
| 7 | The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
| 8 | compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
| 9 | PNaCl toolchain to compile their application to architecture-neutral PNaCl |
| 10 | bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
| 11 | possible. The ``.pexe`` file is downloaded to the user's browser where the |
| 12 | PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to |
| 13 | `sandboxed |
| 14 | <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ |
| 15 | native code. The translator uses architecture-specific optimizations as much as |
| 16 | practical to generate good native code. |
| 17 | |
| 18 | The native code can be cached by the browser to avoid repeating translation on |
| 19 | future page loads. However, first-time user experience is hampered by long |
| 20 | translation times. The LLVM-based PNaCl translator is pretty slow, even when |
| 21 | using ``-O0`` to minimize optimizations, so delays are especially noticeable on |
| 22 | slow browser platforms such as ARM-based Chromebooks. |
| 23 | |
| 24 | Translator slowness can be mitigated or hidden in a number of ways. |
| 25 | |
| 26 | - Parallel translation. However, slow machines where this matters most, e.g. |
| 27 | ARM-based Chromebooks, are likely to have fewer cores to parallelize across, |
| 28 | and are likely to less memory available for multiple translation threads to |
| 29 | use. |
| 30 | |
| 31 | - Streaming translation, i.e. start translating as soon as the download starts. |
| 32 | This doesn't help much when translation speed is 10× slower than download |
| 33 | speed, or the ``.pexe`` file is already cached while the translated binary was |
| 34 | flushed from the cache. |
| 35 | |
| 36 | - Arrange the web page such that translation is done in parallel with |
| 37 | downloading large assets. |
| 38 | |
| 39 | - Arrange the web page to distract the user with `cat videos |
| 40 | <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in |
| 41 | progress. |
| 42 | |
| 43 | Or, improve translator performance to something more reasonable. |
| 44 | |
| 45 | This document describes Subzero's attempt to improve translation speed by an |
| 46 | order of magnitude while rivaling LLVM's code quality. Subzero does this |
| 47 | through minimal IR layering, lean data structures and passes, and a careful |
| 48 | selection of fast optimization passes. It has two optimization recipes: full |
| 49 | optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
| 50 | following (described in more detail below): |
| 51 | |
Manasij Mukherjee | a41e9a1 | 2016-08-05 12:29:56 -0700 | [diff] [blame] | 52 | +----------------------------------------+-----------------------------+ |
| 53 | | O2 recipe | Om1 recipe | |
| 54 | +========================================+=============================+ |
| 55 | | Parse .pexe file | Parse .pexe file | |
| 56 | +----------------------------------------+-----------------------------+ |
| 57 | | Loop nest analysis | | |
| 58 | +----------------------------------------+-----------------------------+ |
| 59 | | Local common subexpression elimination | | |
| 60 | +----------------------------------------+-----------------------------+ |
| 61 | | Address mode inference | | |
| 62 | +----------------------------------------+-----------------------------+ |
| 63 | | Read-modify-write (RMW) transform | | |
| 64 | +----------------------------------------+-----------------------------+ |
| 65 | | Basic liveness analysis | | |
| 66 | +----------------------------------------+-----------------------------+ |
| 67 | | Load optimization | | |
| 68 | +----------------------------------------+-----------------------------+ |
| 69 | | | Phi lowering (simple) | |
| 70 | +----------------------------------------+-----------------------------+ |
| 71 | | Target lowering | Target lowering | |
| 72 | +----------------------------------------+-----------------------------+ |
| 73 | | Full liveness analysis | | |
| 74 | +----------------------------------------+-----------------------------+ |
| 75 | | Register allocation | Minimal register allocation | |
| 76 | +----------------------------------------+-----------------------------+ |
| 77 | | Phi lowering (advanced) | | |
| 78 | +----------------------------------------+-----------------------------+ |
| 79 | | Post-phi register allocation | | |
| 80 | +----------------------------------------+-----------------------------+ |
| 81 | | Branch optimization | | |
| 82 | +----------------------------------------+-----------------------------+ |
| 83 | | Code emission | Code emission | |
| 84 | +----------------------------------------+-----------------------------+ |
Jim Stichnoth | efb8971 | 2015-09-03 13:19:54 -0700 | [diff] [blame] | 85 | |
| 86 | Goals |
| 87 | ===== |
| 88 | |
| 89 | Translation speed |
| 90 | ----------------- |
| 91 | |
| 92 | We'd like to be able to translate a ``.pexe`` file as fast as download speed. |
| 93 | Any faster is in a sense wasted effort. Download speed varies greatly, but |
| 94 | we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a |
| 95 | slow machine. We observe a 3× single-thread performance difference between A15 |
| 96 | and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a |
| 97 | ``.pexe`` file could be compressed to 50% on the web server using gzip transport |
| 98 | compression, so we set the translation speed goal to 6 MB/sec on the high-end |
| 99 | Xeon workstation. |
| 100 | |
| 101 | Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at |
| 102 | ⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode. |
| 103 | |
| 104 | In other words, Subzero's goal is to improve over LLVM's translation speed by |
| 105 | 10×. |
| 106 | |
| 107 | Code quality |
| 108 | ------------ |
| 109 | |
| 110 | Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` |
| 111 | code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On |
| 112 | average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. |
| 113 | |
| 114 | It's important to note that the quality of Subzero-generated code depends on |
| 115 | target-neutral optimizations and simplifications being run beforehand in the |
| 116 | developer environment. The ``.pexe`` file reflects these optimizations. For |
| 117 | example, Subzero assumes that the basic blocks are ordered topologically where |
| 118 | possible (which makes liveness analysis converge fastest), and Subzero does not |
| 119 | do any function inlining because it should already have been done. |
| 120 | |
| 121 | Translator size |
| 122 | --------------- |
| 123 | |
| 124 | The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size. |
| 125 | We think 1 MB is a more reasonable size -- especially for such a component that |
| 126 | is distributed to a billion Chrome users. Thus we target a 10× reduction in |
| 127 | binary size. |
| 128 | |
| 129 | For development, Subzero can be built for all target architectures, and all |
| 130 | debugging and diagnostic options enabled. For a smaller translator, we restrict |
| 131 | to a single target architecture, and define a ``MINIMAL`` build where |
| 132 | unnecessary features are compiled out. |
| 133 | |
| 134 | Subzero leverages some data structures from LLVM's ``ADT`` and ``Support`` |
| 135 | include directories, which have little impact on translator size. It also uses |
| 136 | some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again |
| 137 | with little size impact. In non-``MINIMAL`` builds, the translator size is much |
| 138 | larger due to including code for parsing text-format bitcode files and forming |
| 139 | LLVM IR. |
| 140 | |
| 141 | Memory footprint |
| 142 | ---------------- |
| 143 | |
| 144 | The current LLVM-based translator suffers from an issue in which some |
| 145 | function-specific data has to be retained in memory until all translation |
| 146 | completes, and therefore the memory footprint grows without bound. Large |
| 147 | ``.pexe`` files can lead to the translator process holding hundreds of MB of |
| 148 | memory by the end. The translator runs in a separate process, so this memory |
| 149 | growth doesn't *directly* affect other processes, but it does dirty the physical |
| 150 | memory and contributes to a perception of bloat and sometimes a reality of |
| 151 | out-of-memory tab killing, especially noticeable on weaker systems. |
| 152 | |
| 153 | Subzero should maintain a stable memory footprint throughout translation. It's |
| 154 | not really practical to set a specific limit, because there is not really a |
| 155 | practical limit on a single function's size, but the footprint should be |
| 156 | "reasonable" and be proportional to the largest input function size, not the |
| 157 | total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or |
| 158 | inexorable memory growth. (We use ASAN builds to test for leaks.) |
| 159 | |
| 160 | Multithreaded translation |
| 161 | ------------------------- |
| 162 | |
| 163 | It should be practical to translate different functions concurrently and see |
| 164 | good scalability. Some locking may be needed, such as accessing output buffers |
| 165 | or constant pools, but that should be fairly minimal. In contrast, LLVM was |
| 166 | only designed for module-level parallelism, and as such, the PNaCl translator |
| 167 | internally splits a ``.pexe`` file into several modules for concurrent |
| 168 | translation. All output needs to be deterministic regardless of the level of |
| 169 | multithreading, i.e. functions and data should always be output in the same |
| 170 | order. |
| 171 | |
| 172 | Target architectures |
| 173 | -------------------- |
| 174 | |
| 175 | Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future |
| 176 | targets include ARM64 and MIPS64, though these targets lack NaCl support |
| 177 | including a sandbox model or a validator. |
| 178 | |
| 179 | The first implementation is for x86-32, because it was expected to be |
| 180 | particularly challenging, and thus more likely to draw out any design problems |
| 181 | early: |
| 182 | |
| 183 | - There are a number of special cases, asymmetries, and warts in the x86 |
| 184 | instruction set. |
| 185 | |
| 186 | - Complex addressing modes may be leveraged for better code quality. |
| 187 | |
| 188 | - 64-bit integer operations have to be lowered into longer sequences of 32-bit |
| 189 | operations. |
| 190 | |
| 191 | - Paucity of physical registers may reveal code quality issues early in the |
| 192 | design. |
| 193 | |
| 194 | Detailed design |
| 195 | =============== |
| 196 | |
| 197 | Intermediate representation - ICE |
| 198 | --------------------------------- |
| 199 | |
| 200 | Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's |
| 201 | IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a |
| 202 | representation of global variables and initializers, and a set of functions. |
| 203 | Each function contains a list of basic blocks, and each basic block constains a |
| 204 | list of instructions. Instructions that operate on stack and register variables |
| 205 | do so using static single assignment (SSA) form. |
| 206 | |
| 207 | The ``.pexe`` file is translated one function at a time (or in parallel by |
| 208 | multiple translation threads). The recipe for optimization passes depends on |
| 209 | the specific target and optimization level, and is described in detail below. |
| 210 | Global variables (types and initializers) are simply and directly translated to |
| 211 | object code, without any meaningful attempts at optimization. |
| 212 | |
| 213 | A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class. |
| 214 | Its key contents include: |
| 215 | |
| 216 | - A list of ``CfgNode`` pointers, generally held in topological order. |
| 217 | |
| 218 | - A list of ``Variable`` pointers corresponding to local variables used in the |
| 219 | function plus compiler-generated temporaries. |
| 220 | |
| 221 | A basic block is represented by the ``Ice::CfgNode`` class. Its key contents |
| 222 | include: |
| 223 | |
| 224 | - A linear list of instructions, in the same style as LLVM. The last |
| 225 | instruction of the list is always a terminator instruction: branch, switch, |
| 226 | return, unreachable. |
| 227 | |
| 228 | - A list of Phi instructions, also in the same style as LLVM. They are held as |
| 229 | a linear list for convenience, though per Phi semantics, they are executed "in |
| 230 | parallel" without dependencies on each other. |
| 231 | |
| 232 | - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and |
| 233 | another list for outgoing edges. |
| 234 | |
| 235 | - The node's unique, 0-based index into the CFG's node list. |
| 236 | |
| 237 | An instruction is represented by the ``Ice::Inst`` class. Its key contents |
| 238 | include: |
| 239 | |
| 240 | - A list of source operands. |
| 241 | |
| 242 | - Its destination variable, if the instruction produces a result in an |
| 243 | ``Ice::Variable``. |
| 244 | |
| 245 | - A bitvector indicating which variables' live ranges this instruction ends. |
| 246 | This is computed during liveness analysis. |
| 247 | |
| 248 | Instructions kinds are divided into high-level ICE instructions and low-level |
| 249 | ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode |
| 250 | instruction kinds. Each target architecture implementation extends the |
| 251 | instruction space with its own set of low-level instructions. Generally, |
| 252 | low-level instructions correspond to individual machine instructions. The |
| 253 | high-level ICE instruction space includes a few additional instruction kinds |
| 254 | that are not part of LLVM but are generally useful (e.g., an Assignment |
| 255 | instruction), or are useful across targets (e.g., BundleLock and BundleUnlock |
| 256 | instructions for sandboxing). |
| 257 | |
| 258 | Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl |
| 259 | ABI restrictions as documented in the `PNaCl Bitcode Reference Manual |
| 260 | <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are |
| 261 | the following: |
| 262 | |
| 263 | - Alloca: allocate data on the stack |
| 264 | |
| 265 | - Arithmetic: binary operations of the form ``A = B op C`` |
| 266 | |
| 267 | - Br: conditional or unconditional branch |
| 268 | |
| 269 | - Call: function call |
| 270 | |
| 271 | - Cast: unary type-conversion operations |
| 272 | |
| 273 | - ExtractElement: extract a scalar element from a vector-type value |
| 274 | |
| 275 | - Fcmp: floating-point comparison |
| 276 | |
| 277 | - Icmp: integer comparison |
| 278 | |
| 279 | - IntrinsicCall: call a known intrinsic |
| 280 | |
| 281 | - InsertElement: insert a scalar element into a vector-type value |
| 282 | |
| 283 | - Load: load a value from memory |
| 284 | |
| 285 | - Phi: implement the SSA phi node |
| 286 | |
| 287 | - Ret: return from the function |
| 288 | |
| 289 | - Select: essentially the C language operation of the form ``X = C ? Y : Z`` |
| 290 | |
| 291 | - Store: store a value into memory |
| 292 | |
| 293 | - Switch: generalized branch to multiple possible locations |
| 294 | |
| 295 | - Unreachable: indicate that this portion of the code is unreachable |
| 296 | |
| 297 | The additional high-level ICE instructions are the following: |
| 298 | |
| 299 | - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi |
| 300 | instructions to non-SSA assignments, before lowering to machine code. |
| 301 | |
| 302 | - BundleLock, BundleUnlock. These are markers used for sandboxing, but are |
| 303 | common across all targets and so they are elevated to the high-level |
| 304 | instruction set. |
| 305 | |
| 306 | - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in |
| 307 | liveness analysis, elevated to the high-level because they are used by all |
| 308 | targets. They are described in more detail at the end of this section. |
| 309 | |
| 310 | - JumpTable: this represents the result of switch optimization analysis, where |
| 311 | some switch instructions may use jump tables instead of cascading |
| 312 | compare/branches. |
| 313 | |
| 314 | An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an |
| 315 | operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants |
| 316 | include scalar integer constants, scalar floating point constants, Undef (an |
| 317 | unspecified constant of a particular scalar or vector type), and symbol |
| 318 | constants (essentially addresses of globals). Note that the PNaCl ABI does not |
| 319 | include vector-type constants besides Undef, and as such, Subzero (so far) has |
| 320 | no reason to represent vector-type constants internally. A variable represents |
| 321 | a value allocated on the stack (though not including alloca-derived storage). |
| 322 | Among other things, a variable holds its unique, 0-based index into the CFG's |
| 323 | variable list. |
| 324 | |
| 325 | Each target can extend the ``Constant`` and ``Variable`` classes for its own |
| 326 | needs. In addition, the ``Operand`` class may be extended, e.g. to define an |
| 327 | x86 ``MemOperand`` that encodes a base register, an index register, an index |
| 328 | register shift amount, and a constant offset. |
| 329 | |
| 330 | Register allocation and liveness analysis are restricted to Variable operands. |
| 331 | Because of the importance of register allocation to code quality, and the |
| 332 | translation-time cost of liveness analysis, Variable operands get some special |
| 333 | treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate |
| 334 | across all the Variables of an instruction. An instruction holds a list of |
| 335 | operands, but an operand may contain 0, 1, or more Variables. As such, the |
| 336 | ``Operand`` class specially holds a list of Variables contained within, for |
| 337 | quick access. |
| 338 | |
| 339 | A Subzero transformation pass may work by deleting an existing instruction and |
| 340 | replacing it with zero or more new instructions. Instead of actually deleting |
| 341 | the existing instruction, we generally mark it as deleted and insert the new |
| 342 | instructions right after the deleted instruction. When printing the IR for |
| 343 | debugging, this is a big help because it makes it much more clear how the |
| 344 | non-deleted instructions came about. |
| 345 | |
| 346 | Subzero has a few special instructions to help with liveness analysis |
| 347 | consistency. |
| 348 | |
| 349 | - The FakeDef instruction gives a fake definition of some variable. For |
| 350 | example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx`` |
| 351 | but an ICE instruction can represent only one destination variable. This is |
| 352 | similar for multiply instructions, and for function calls that return a 64-bit |
| 353 | integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax`` |
| 354 | trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``. |
| 355 | |
| 356 | - The FakeUse instruction registers a use of a variable, typically to prevent an |
| 357 | earlier assignment to that variable from being dead-code eliminated. For |
| 358 | example, lowering an operation like ``x=cc?y:z`` may be done using x86's |
| 359 | conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a |
| 360 | FakeUse of ``x`` between the two instructions, the liveness analysis pass may |
| 361 | dead-code eliminate the first instruction. |
| 362 | |
| 363 | - The FakeKill instruction is added after a call instruction, and is a quick way |
| 364 | of indicating that caller-save registers are invalidated. |
| 365 | |
| 366 | Pexe parsing |
| 367 | ------------ |
| 368 | |
| 369 | Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It |
| 370 | parses the ``.pexe`` file function by function, ultimately constructing an ICE |
| 371 | CFG for each function. After a function is parsed, its CFG is handed off to the |
| 372 | translation phase. The bitcode parser also parses global initializer data and |
| 373 | hands it off to be translated to data sections in the object file. |
| 374 | |
| 375 | Subzero has another parsing strategy for testing/debugging. LLVM libraries can |
| 376 | be used to parse a module into LLVM IR (though very slowly relative to Subzero |
| 377 | native parsing). Then we iterate across the LLVM IR and construct high-level |
| 378 | ICE, handing off each CFG to the translation phase. |
| 379 | |
| 380 | Overview of lowering |
| 381 | -------------------- |
| 382 | |
| 383 | In general, translation goes like this: |
| 384 | |
| 385 | - Parse the next function from the ``.pexe`` file and construct a CFG consisting |
| 386 | of high-level ICE. |
| 387 | |
| 388 | - Do analysis passes and transformation passes on the high-level ICE, as |
| 389 | desired. |
| 390 | |
| 391 | - Lower each high-level ICE instruction into a sequence of zero or more |
| 392 | low-level ICE instructions. Each high-level instruction is generally lowered |
| 393 | independently, though the target lowering is allowed to look ahead in the |
| 394 | CfgNode's instruction list if desired. |
| 395 | |
| 396 | - Do more analysis and transformation passes on the low-level ICE, as desired. |
| 397 | |
| 398 | - Assemble the low-level CFG into an ELF object file (alternatively, a textual |
| 399 | assembly file that is later assembled by some external tool). |
| 400 | |
| 401 | - Repeat for all functions, and also produce object code for data such as global |
| 402 | initializers and internal constant pools. |
| 403 | |
| 404 | Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``, |
| 405 | the intention is to apply all available optimizations to get the best code |
| 406 | quality (though the initial code quality goal is measured against LLVM's ``O0`` |
| 407 | code quality). For ``Om1``, the intention is to apply as few optimizations as |
| 408 | possible and produce code as quickly as possible, accepting poor code quality. |
| 409 | ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, |
| 410 | "sub-zero". |
| 411 | |
| 412 | High-level debuggability of generated code is so far not a design requirement. |
| 413 | Subzero doesn't really do transformations that would obfuscate debugging; the |
| 414 | main thing might be that register allocation (including stack slot coalescing |
| 415 | for stack-allocated variables whose live ranges don't overlap) may render a |
| 416 | variable's value unobtainable after its live range ends. This would not be an |
| 417 | issue for ``Om1`` since it doesn't register-allocate program-level variables, |
| 418 | nor does it coalesce stack slots. That said, fully supporting debuggability |
| 419 | would require a few additions: |
| 420 | |
| 421 | - DWARF support would need to be added to Subzero's ELF file emitter. Subzero |
| 422 | propagates global symbol names, local variable names, and function-internal |
| 423 | label names that are present in the ``.pexe`` file. This would allow a |
| 424 | debugger to map addresses back to symbols in the ``.pexe`` file. |
| 425 | |
| 426 | - To map ``.pexe`` file symbols back to meaningful source-level symbol names, |
| 427 | file names, line numbers, etc., Subzero would need to handle `LLVM bitcode |
| 428 | metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg`` |
| 429 | `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_. |
| 430 | |
| 431 | - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so |
| 432 | the toolchain would need to be modified to preserve it. |
| 433 | |
| 434 | Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but |
| 435 | produces code with one third the code quality. ``Om1`` is good for testing and |
| 436 | debugging -- during translation, it tends to expose errors in the basic lowering |
| 437 | that might otherwise have been hidden by the register allocator or other |
| 438 | optimization passes. It also helps determine whether a code correctness problem |
| 439 | is a fundamental problem in the basic lowering, or an error in another |
| 440 | optimization pass. |
| 441 | |
| 442 | The implementation of target lowering also controls the recipe of passes used |
| 443 | for ``Om1`` and ``O2`` translation. For example, address mode inference may |
| 444 | only be relevant for x86. |
| 445 | |
| 446 | Lowering strategy |
| 447 | ----------------- |
| 448 | |
| 449 | The core of Subzero's lowering from high-level ICE to low-level ICE is to lower |
| 450 | each high-level instruction down to a sequence of low-level target-specific |
| 451 | instructions, in a largely context-free setting. That is, each high-level |
| 452 | instruction conceptually has a simple template expansion into low-level |
| 453 | instructions, and lowering can in theory be done in any order. This may sound |
| 454 | like a small effort, but quite a large number of templates may be needed because |
| 455 | of the number of PNaCl types and instruction variants. Furthermore, there may |
| 456 | be optimized templates, e.g. to take advantage of operator commutativity (for |
| 457 | example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is |
| 458 | similar to other template-based approaches in fast code generation or |
| 459 | interpretation, though some decisions are deferred until after some global |
| 460 | analysis passes, mostly related to register allocation, stack slot assignment, |
| 461 | and specific choice of instruction variant and addressing mode. |
| 462 | |
| 463 | The key idea for a lowering template is to produce valid low-level instructions |
| 464 | that are guaranteed to meet address mode and other structural requirements of |
| 465 | the instruction set. For example, on x86, the source operand of an integer |
| 466 | store instruction must be an immediate or a physical register; a shift |
| 467 | instruction's shift amount must be an immediate or in register ``%cl``; a |
| 468 | function's integer return value is in ``%eax``; most x86 instructions are |
| 469 | two-operand, in contrast to corresponding three-operand high-level instructions; |
| 470 | etc. |
| 471 | |
| 472 | Because target lowering runs before register allocation, there is no way to know |
| 473 | whether a given ``Ice::Variable`` operand lives on the stack or in a physical |
| 474 | register. When the low-level instruction calls for a physical register operand, |
| 475 | the target lowering can create an infinite-weight Variable. This tells the |
| 476 | register allocator to assign infinite weight when making decisions, effectively |
| 477 | guaranteeing some physical register. Variables can also be pre-colored to a |
| 478 | specific physical register (``cl`` in the shift example above), which also gives |
| 479 | infinite weight. |
| 480 | |
| 481 | To illustrate, consider a high-level arithmetic instruction on 32-bit integer |
| 482 | operands:: |
| 483 | |
| 484 | A = B + C |
| 485 | |
| 486 | X86 target lowering might produce the following:: |
| 487 | |
| 488 | T.inf = B // mov instruction |
| 489 | T.inf += C // add instruction |
| 490 | A = T.inf // mov instruction |
| 491 | |
| 492 | Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a |
| 493 | physical register, the three lowered instructions are all encodable regardless |
| 494 | of whether ``B`` and ``C`` are physical registers, memory, or immediates, and |
| 495 | whether ``A`` is a physical register or in memory. |
| 496 | |
| 497 | In this example, ``A`` must be a Variable and one may be tempted to simplify the |
| 498 | lowering sequence by setting ``A`` as infinite-weight and using:: |
| 499 | |
| 500 | A = B // mov instruction |
| 501 | A += C // add instruction |
| 502 | |
| 503 | This has two problems. First, if the original instruction was actually ``A = |
| 504 | B + A``, the result would be incorrect. Second, assigning ``A`` a physical |
| 505 | register applies throughout ``A``'s entire live range. This is probably not |
| 506 | what is intended, and may ultimately lead to a failure to allocate a register |
| 507 | for an infinite-weight variable. |
| 508 | |
| 509 | This style of lowering leads to many temporaries being generated, so in ``O2`` |
| 510 | mode, we rely on the register allocator to clean things up. For example, in the |
| 511 | example above, if ``B`` ends up getting a physical register and its live range |
| 512 | ends at this instruction, the register allocator is likely to reuse that |
| 513 | register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register |
| 514 | copy, which is removed as an emission-time peephole optimization. |
| 515 | |
| 516 | O2 lowering |
| 517 | ----------- |
| 518 | |
| 519 | Currently, the ``O2`` lowering recipe is the following: |
| 520 | |
| 521 | - Loop nest analysis |
| 522 | |
Manasij Mukherjee | a41e9a1 | 2016-08-05 12:29:56 -0700 | [diff] [blame] | 523 | - Local common subexpression elimination |
| 524 | |
Jim Stichnoth | efb8971 | 2015-09-03 13:19:54 -0700 | [diff] [blame] | 525 | - Address mode inference |
| 526 | |
| 527 | - Read-modify-write (RMW) transformation |
| 528 | |
| 529 | - Basic liveness analysis |
| 530 | |
| 531 | - Load optimization |
| 532 | |
| 533 | - Target lowering |
| 534 | |
| 535 | - Full liveness analysis |
| 536 | |
| 537 | - Register allocation |
| 538 | |
| 539 | - Phi instruction lowering (advanced) |
| 540 | |
| 541 | - Post-phi lowering register allocation |
| 542 | |
| 543 | - Branch optimization |
| 544 | |
| 545 | These passes are described in more detail below. |
| 546 | |
| 547 | Om1 lowering |
| 548 | ------------ |
| 549 | |
| 550 | Currently, the ``Om1`` lowering recipe is the following: |
| 551 | |
| 552 | - Phi instruction lowering (simple) |
| 553 | |
| 554 | - Target lowering |
| 555 | |
| 556 | - Register allocation (infinite-weight and pre-colored only) |
| 557 | |
| 558 | Optimization passes |
| 559 | ------------------- |
| 560 | |
| 561 | Liveness analysis |
| 562 | ^^^^^^^^^^^^^^^^^ |
| 563 | |
| 564 | Liveness analysis is a standard dataflow optimization, implemented as follows. |
| 565 | For each node (basic block), its live-out set is computed as the union of the |
| 566 | live-in sets of its successor nodes. Then the node's instructions are processed |
| 567 | in reverse order, updating the live set, until the beginning of the node is |
| 568 | reached, and the node's live-in set is recorded. If this iteration has changed |
| 569 | the node's live-in set, the node's predecessors are marked for reprocessing. |
| 570 | This continues until no more nodes need reprocessing. If nodes are processed in |
| 571 | reverse topological order, the number of iterations over the CFG is generally |
| 572 | equal to the maximum loop nest depth. |
| 573 | |
| 574 | To implement this, each node records its live-in and live-out sets, initialized |
| 575 | to the empty set. Each instruction records which of its Variables' live ranges |
| 576 | end in that instruction, initialized to the empty set. A side effect of |
| 577 | liveness analysis is dead instruction elimination. Each instruction can be |
| 578 | marked as tentatively dead, and after the algorithm converges, the tentatively |
| 579 | dead instructions are permanently deleted. |
| 580 | |
| 581 | Optionally, after this liveness analysis completes, we can do live range |
| 582 | construction, in which we calculate the live range of each variable in terms of |
| 583 | instruction numbers. A live range is represented as a union of segments, where |
| 584 | the segment endpoints are instruction numbers. Instruction numbers are required |
| 585 | to be unique across the CFG, and monotonically increasing within a basic block. |
| 586 | As a union of segments, live ranges can contain "gaps" and are therefore |
| 587 | precise. Because of SSA properties, a variable's live range can start at most |
| 588 | once in a basic block, and can end at most once in a basic block. Liveness |
| 589 | analysis keeps track of which variable/instruction tuples begin live ranges and |
| 590 | end live ranges, and combined with live-in and live-out sets, we can efficiently |
| 591 | build up live ranges of all variables across all basic blocks. |
| 592 | |
| 593 | A lot of care is taken to try to make liveness analysis fast and efficient. |
| 594 | Because of the lowering strategy, the number of variables is generally |
| 595 | proportional to the number of instructions, leading to an O(N^2) complexity |
| 596 | algorithm if implemented naively. To improve things based on sparsity, we note |
| 597 | that most variables are "local" and referenced in at most one basic block (in |
| 598 | contrast to the "global" variables with multi-block usage), and therefore cannot |
| 599 | be live across basic blocks. Therefore, the live-in and live-out sets, |
| 600 | typically represented as bit vectors, can be limited to the set of global |
| 601 | variables, and the intra-block liveness bit vector can be compacted to hold the |
| 602 | global variables plus the local variables for that block. |
| 603 | |
| 604 | Register allocation |
| 605 | ^^^^^^^^^^^^^^^^^^^ |
| 606 | |
| 607 | Subzero implements a simple linear-scan register allocator, based on the |
| 608 | allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan |
| 609 | Register Allocation in the Context of SSA Form and Register Constraints |
| 610 | <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has |
| 611 | several nice features: |
| 612 | |
| 613 | - Live ranges are represented as unions of segments, as described above, rather |
| 614 | than a single start/end tuple. |
| 615 | |
| 616 | - It allows pre-coloring of variables with specific physical registers. |
| 617 | |
| 618 | - It applies equally well to pre-lowered Phi instructions. |
| 619 | |
| 620 | The paper suggests an approach of aggressively coalescing variables across Phi |
| 621 | instructions (i.e., trying to force Phi source and destination variables to have |
| 622 | the same register assignment), but we reject that in favor of the more natural |
| 623 | preference mechanism described below. |
| 624 | |
| 625 | We enhance the algorithm in the paper with the capability of automatic inference |
| 626 | of register preference, and with the capability of allowing overlapping live |
| 627 | ranges to safely share the same register in certain circumstances. If we are |
| 628 | considering register allocation for variable ``A``, and ``A`` has a single |
| 629 | defining instruction ``A=B+C``, then the preferred register for ``A``, if |
| 630 | available, would be the register assigned to ``B`` or ``C``, if any, provided |
| 631 | that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this |
| 632 | way we infer a good register preference for ``A``. |
| 633 | |
| 634 | We allow overlapping live ranges to get the same register in certain cases. |
| 635 | Suppose a high-level instruction like:: |
| 636 | |
| 637 | A = unary_op(B) |
| 638 | |
| 639 | has been target-lowered like:: |
| 640 | |
| 641 | T.inf = B |
| 642 | A = unary_op(T.inf) |
| 643 | |
| 644 | Further, assume that ``B``'s live range continues beyond this instruction |
| 645 | sequence, and that ``B`` has already been assigned some register. Normally, we |
| 646 | might want to infer ``B``'s register as a good candidate for ``T.inf``, but it |
| 647 | turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have |
| 648 | different registers. But ``T.inf`` is just a read-only copy of ``B`` that is |
| 649 | guaranteed to be in a register, so in theory these overlapping live ranges could |
| 650 | safely have the same register. Our implementation allows this overlap as long |
| 651 | as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never |
| 652 | modified within ``T.inf``'s live range. |
| 653 | |
| 654 | Subzero's register allocator can be run in 3 configurations. |
| 655 | |
| 656 | - Normal mode. All Variables are considered for register allocation. It |
| 657 | requires full liveness analysis and live range construction as a prerequisite. |
| 658 | This is used by ``O2`` lowering. |
| 659 | |
| 660 | - Minimal mode. Only infinite-weight or pre-colored Variables are considered. |
| 661 | All other Variables are stack-allocated. It does not require liveness |
| 662 | analysis; instead, it quickly scans the instructions and records first |
| 663 | definitions and last uses of all relevant Variables, using that to construct a |
| 664 | single-segment live range. Although this includes most of the Variables, the |
| 665 | live ranges are mostly simple, short, and rarely overlapping, which the |
| 666 | register allocator handles efficiently. This is used by ``Om1`` lowering. |
| 667 | |
| 668 | - Post-phi lowering mode. Advanced phi lowering is done after normal-mode |
| 669 | register allocation, and may result in new infinite-weight Variables that need |
| 670 | registers. One would like to just run something like minimal mode to assign |
| 671 | registers to the new Variables while respecting existing register allocation |
| 672 | decisions. However, it sometimes happens that there are no free registers. |
| 673 | In this case, some register needs to be forcibly spilled to the stack and |
| 674 | temporarily reassigned to the new Variable, and reloaded at the end of the new |
| 675 | Variable's live range. The register must be one that has no explicit |
| 676 | references during the Variable's live range. Since Subzero currently doesn't |
| 677 | track def/use chains (though it does record the CfgNode where a Variable is |
| 678 | defined), we just do a brute-force search across the CfgNode's instruction |
| 679 | list for the instruction numbers of interest. This situation happens very |
| 680 | rarely, so there's little point for now in improving its performance. |
| 681 | |
| 682 | The basic linear-scan algorithm may, as it proceeds, rescind an early register |
| 683 | allocation decision, leaving that Variable to be stack-allocated. Some of these |
| 684 | times, it turns out that the Variable could have been given a different register |
| 685 | without conflict, but by this time it's too late. The literature recognizes |
| 686 | this situation and describes "second-chance bin-packing", which Subzero can do. |
| 687 | We can rerun the register allocator in a mode that respects existing register |
| 688 | allocation decisions, and sometimes it finds new non-conflicting opportunities. |
| 689 | In fact, we can repeatedly run the register allocator until convergence. |
| 690 | Unfortunately, in the current implementation, these subsequent register |
| 691 | allocation passes end up being extremely expensive. This is because of the |
| 692 | treatment of the "unhandled pre-colored" Variable set, which is normally very |
| 693 | small but ends up being quite large on subsequent passes. Its performance can |
| 694 | probably be made acceptable with a better choice of data structures, but for now |
| 695 | this second-chance mechanism is disabled. |
| 696 | |
| 697 | Future work is to implement LLVM's `Greedy |
| 698 | <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ |
| 699 | register allocator as a replacement for the basic linear-scan algorithm, given |
| 700 | LLVM's experience with its improvement in code quality. (The blog post claims |
| 701 | that the Greedy allocator also improved maintainability because a lot of hacks |
| 702 | could be removed, but Subzero is probably not yet to that level of hacks, and is |
| 703 | less likely to see that particular benefit.) |
| 704 | |
Manasij Mukherjee | a41e9a1 | 2016-08-05 12:29:56 -0700 | [diff] [blame] | 705 | Local common subexpression elimination |
| 706 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 707 | |
| 708 | The Local CSE implementation goes through each instruction and records a portion |
| 709 | of each ``Seen`` instruction in a hashset-like container. That portion consists |
| 710 | of the entire instruction except for any dest variable. That means ``A = X + Y`` |
| 711 | and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows |
| 712 | us to detect common subexpressions. |
| 713 | |
| 714 | Whenever a repetition is detected, the redundant variables are stored in a |
| 715 | container mapping the replacee to the replacement. In the case above, it would |
| 716 | be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. |
| 717 | |
| 718 | At any point if a variable that has an entry in the replacement table is |
| 719 | encountered, it is replaced with the variable it is mapped to. This ensures that |
| 720 | the redundant variables will not have any uses in the basic block, allowing |
| 721 | dead code elimination to clean up the redundant instruction. |
| 722 | |
| 723 | With SSA, the information stored is never invalidated. However, non-SSA input is |
| 724 | supported with the ``-lcse=no-ssa`` option. This has to maintain some extra |
| 725 | dependency information to ensure proper invalidation on variable assignment. |
| 726 | This is not rigorously tested because this pass is run at an early stage where |
| 727 | it is safe to assume variables have a single definition. This is not enabled by |
| 728 | default because it bumps the compile time overhead from 2% to 6%. |
| 729 | |
| 730 | Loop-invariant code motion |
| 731 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 732 | |
| 733 | This pass utilizes the loop analysis information to hoist invariant instructions |
| 734 | to loop pre-headers. A loop must have a single entry node (header) and that node |
| 735 | must have a single external predecessor for this optimization to work, as it is |
| 736 | currently implemented. |
| 737 | |
| 738 | The pass works by iterating over all instructions in the loop until the set of |
| 739 | invariant instructions converges. In each iteration, a non-invariant instruction |
| 740 | involving only constants or a variable known to be invariant is added to the |
| 741 | result set. The destination variable of that instruction is added to the set of |
| 742 | variables known to be invariant (which is initialized with the constant args). |
| 743 | |
| 744 | Improving the loop-analysis infrastructure is likely to have significant impact |
| 745 | on this optimization. Inserting an extra node to act as the pre-header when the |
| 746 | header has multiple incoming edges from outside could also be a good idea. |
| 747 | Expanding the initial invariant variable set to contain all variables that do |
| 748 | not have definitions inside the loop does not seem to improve anything. |
| 749 | |
| 750 | Short circuit evaluation |
| 751 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 752 | |
| 753 | Short circuit evaluation splits nodes and introduces early jumps when the result |
| 754 | of a logical operation can be determined early and there are no observable side |
| 755 | effects of skipping the rest of the computation. The instructions considered |
| 756 | backwards from the end of the basic blocks. When a definition of a variable |
| 757 | involved in a conditional jump is found, an extra jump can be inserted in that |
| 758 | location, moving the rest of the instructions in the node to a newly inserted |
| 759 | node. Consider this example:: |
| 760 | |
| 761 | __N : |
| 762 | a = <something> |
| 763 | Instruction 1 without side effect |
| 764 | ... b = <something> ... |
| 765 | Instruction N without side effect |
| 766 | t1 = or a b |
| 767 | br t1 __X __Y |
| 768 | |
| 769 | is transformed to:: |
| 770 | |
| 771 | __N : |
| 772 | a = <something> |
| 773 | br a __X __N_ext |
| 774 | |
| 775 | __N_ext : |
| 776 | Instruction 1 without side effect |
| 777 | ... b = <something> ... |
| 778 | Instruction N without side effect |
| 779 | br b __X __Y |
| 780 | |
| 781 | The logic for AND is analogous, the only difference is that the early jump is |
| 782 | facilitated by a ``false`` value instead of ``true``. |
| 783 | |
| 784 | Global Variable Splitting |
| 785 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 786 | |
| 787 | Global variable splitting (``-split-global-vars``) is run after register |
| 788 | allocation. It works on the variables that did not manage to get registers (but |
| 789 | are allowed to) and decomposes their live ranges into the individual segments |
| 790 | (which span a single node at most). New variables are created (but not yet used) |
| 791 | with these smaller live ranges and the register allocator is run again. This is |
| 792 | not inefficient as the old variables that already had registers are now |
| 793 | considered pre-colored. |
| 794 | |
| 795 | The new variables that get registers replace their parent variables for their |
| 796 | portion of its (parent's) live range. A copy from the old variable to the new |
| 797 | is introduced before the first use and the reverse after the last def in the |
| 798 | live range. |
| 799 | |
Jim Stichnoth | efb8971 | 2015-09-03 13:19:54 -0700 | [diff] [blame] | 800 | Basic phi lowering |
| 801 | ^^^^^^^^^^^^^^^^^^ |
| 802 | |
| 803 | The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |
| 804 | implements it). Consider this example:: |
| 805 | |
| 806 | L1: |
| 807 | ... |
| 808 | br L3 |
| 809 | L2: |
| 810 | ... |
| 811 | br L3 |
| 812 | L3: |
| 813 | A = phi [B, L1], [C, L2] |
| 814 | X = phi [Y, L1], [Z, L2] |
| 815 | |
| 816 | For each destination of a phi instruction, we can create a temporary and insert |
| 817 | the temporary's assignment at the end of the predecessor block:: |
| 818 | |
| 819 | L1: |
| 820 | ... |
| 821 | A' = B |
| 822 | X' = Y |
| 823 | br L3 |
| 824 | L2: |
| 825 | ... |
| 826 | A' = C |
| 827 | X' = Z |
| 828 | br L3 |
| 829 | L2: |
| 830 | A = A' |
| 831 | X = X' |
| 832 | |
| 833 | This transformation is very simple and reliable. It can be done before target |
| 834 | lowering and register allocation, and it easily avoids the classic lost-copy and |
| 835 | related problems. ``Om1`` lowering uses this strategy. |
| 836 | |
| 837 | However, it has the disadvantage of initializing temporaries even for branches |
| 838 | not taken, though that could be mitigated by splitting non-critical edges and |
| 839 | putting assignments in the edge-split nodes. Another problem is that without |
| 840 | extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a |
| 841 | specific ordering even though phi semantics are that the assignments are |
| 842 | parallel or unordered. This sometimes imposes false live range overlaps and |
| 843 | leads to poorer register allocation. |
| 844 | |
| 845 | Advanced phi lowering |
| 846 | ^^^^^^^^^^^^^^^^^^^^^ |
| 847 | |
| 848 | ``O2`` lowering defers phi lowering until after register allocation to avoid the |
| 849 | problem of false live range overlaps. It works as follows. We split each |
| 850 | incoming edge and move the (parallel) phi assignments into the split nodes. We |
| 851 | linearize each set of assignments by finding a safe, topological ordering of the |
| 852 | assignments, respecting register assignments as well. For example:: |
| 853 | |
| 854 | A = B |
| 855 | X = Y |
| 856 | |
| 857 | Normally these assignments could be executed in either order, but if ``B`` and |
| 858 | ``X`` are assigned the same physical register, we would want to use the above |
| 859 | ordering. Dependency cycles are broken by introducing a temporary. For |
| 860 | example:: |
| 861 | |
| 862 | A = B |
| 863 | B = A |
| 864 | |
| 865 | Here, a temporary breaks the cycle:: |
| 866 | |
| 867 | t = A |
| 868 | A = B |
| 869 | B = t |
| 870 | |
| 871 | Finally, we use the existing target lowering to lower the assignments in this |
| 872 | basic block, and once that is done for all basic blocks, we run the post-phi |
| 873 | variant of register allocation on the edge-split basic blocks. |
| 874 | |
| 875 | When computing a topological order, we try to first schedule assignments whose |
| 876 | source has a physical register, and last schedule assignments whose destination |
| 877 | has a physical register. This helps reduce register pressure. |
| 878 | |
| 879 | X86 address mode inference |
| 880 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 881 | |
| 882 | We try to take advantage of the x86 addressing mode that includes a base |
| 883 | register, an index register, an index register scale amount, and an immediate |
| 884 | offset. We do this through simple pattern matching. Starting with a load or |
| 885 | store instruction where the address is a variable, we initialize the base |
| 886 | register to that variable, and look up the instruction where that variable is |
| 887 | defined. If that is an add instruction of two variables and the index register |
| 888 | hasn't been set, we replace the base and index register with those two |
| 889 | variables. If instead it is an add instruction of a variable and a constant, we |
| 890 | replace the base register with the variable and add the constant to the |
| 891 | immediate offset. |
| 892 | |
| 893 | There are several more patterns that can be matched. This pattern matching |
| 894 | continues on the load or store instruction until no more matches are found. |
| 895 | Because a program typically has few load and store instructions (not to be |
| 896 | confused with instructions that manipulate stack variables), this address mode |
| 897 | inference pass is fast. |
| 898 | |
| 899 | X86 read-modify-write inference |
| 900 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 901 | |
| 902 | A reasonably common bitcode pattern is a non-atomic update of a memory |
| 903 | location:: |
| 904 | |
| 905 | x = load addr |
| 906 | y = add x, 1 |
| 907 | store y, addr |
| 908 | |
| 909 | On x86, with good register allocation, the Subzero passes described above |
| 910 | generate code with only this quality:: |
| 911 | |
| 912 | mov [%ebx], %eax |
| 913 | add $1, %eax |
| 914 | mov %eax, [%ebx] |
| 915 | |
| 916 | However, x86 allows for this kind of code:: |
| 917 | |
| 918 | add $1, [%ebx] |
| 919 | |
| 920 | which requires fewer instructions, but perhaps more importantly, requires fewer |
| 921 | physical registers. |
| 922 | |
| 923 | It's also important to note that this transformation only makes sense if the |
| 924 | store instruction ends ``x``'s live range. |
| 925 | |
| 926 | Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) |
| 927 | opportunities via simple pattern matching. The only problem is that it is run |
| 928 | before liveness analysis, which is needed to determine whether ``x``'s live |
| 929 | range ends after the RMW. Since liveness analysis is one of the most expensive |
| 930 | passes, it's not attractive to run it an extra time just for RMW analysis. |
| 931 | Instead, we essentially generate both the RMW and the non-RMW versions, and then |
| 932 | during lowering, the RMW version deletes itself if it finds x still live. |
| 933 | |
| 934 | X86 compare-branch inference |
| 935 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 936 | |
| 937 | In the LLVM instruction set, the compare/branch pattern works like this:: |
| 938 | |
| 939 | cond = icmp eq a, b |
| 940 | br cond, target |
| 941 | |
| 942 | The result of the icmp instruction is a single bit, and a conditional branch |
| 943 | tests that bit. By contrast, most target architectures use this pattern:: |
| 944 | |
| 945 | cmp a, b // implicitly sets various bits of FLAGS register |
| 946 | br eq, target // branch on a particular FLAGS bit |
| 947 | |
| 948 | A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests |
| 949 | ``cond`` and conditionally branches. Subzero has a pass that identifies |
| 950 | boolean-based operations like this and folds them into a single |
| 951 | compare/branch-like operation. It is set up for more than just cmp/br though. |
| 952 | Boolean producers include icmp (integer compare), fcmp (floating-point compare), |
| 953 | and trunc (integer truncation when the destination has bool type). Boolean |
| 954 | consumers include branch, select (the ternary operator from the C language), and |
| 955 | sign-extend and zero-extend when the source has bool type. |
| 956 | |
| 957 | Sandboxing |
| 958 | ^^^^^^^^^^ |
| 959 | |
| 960 | Native Client's sandbox model uses software fault isolation (SFI) to provide |
| 961 | safety when running untrusted code in a browser or other environment. Subzero |
| 962 | implements Native Client's `sandboxing |
| 963 | <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ |
| 964 | to enable Subzero-translated executables to be run inside Chrome. Subzero also |
| 965 | provides a fairly simple framework for investigating alternative sandbox models |
| 966 | or other restrictions on the sandbox model. |
| 967 | |
| 968 | Sandboxing in Subzero is not actually implemented as a separate pass, but is |
| 969 | integrated into lowering and assembly. |
| 970 | |
| 971 | - Indirect branches, including the ret instruction, are masked to a bundle |
| 972 | boundary and bundle-locked. |
| 973 | |
| 974 | - Call instructions are aligned to the end of the bundle so that the return |
| 975 | address is bundle-aligned. |
| 976 | |
| 977 | - Indirect branch targets, including function entry and targets in a switch |
| 978 | statement jump table, are bundle-aligned. |
| 979 | |
| 980 | - The intrinsic for reading the thread pointer is inlined appropriately. |
| 981 | |
| 982 | - For x86-64, non-stack memory accesses are with respect to the reserved sandbox |
| 983 | base register. We reduce the aggressiveness of address mode inference to |
| 984 | leave room for the sandbox base register during lowering. There are no memory |
| 985 | sandboxing changes for x86-32. |
| 986 | |
| 987 | Code emission |
| 988 | ------------- |
| 989 | |
| 990 | Subzero's integrated assembler is derived from Dart's `assembler code |
| 991 | <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass |
| 992 | that iterates through the low-level ICE instructions and invokes the relevant |
| 993 | assembler functions. Placeholders are added for later fixup of branch target |
| 994 | offsets. (Backward branches use short offsets if possible; forward branches |
| 995 | generally use long offsets unless it is an intra-block branch of "known" short |
| 996 | length.) The assembler emits into a staging buffer. Once emission into the |
| 997 | staging buffer for a function is complete, the data is emitted to the output |
| 998 | file as an ELF object file, and metadata such as relocations, symbol table, and |
| 999 | string table, are accumulated for emission at the end. Global data initializers |
| 1000 | are emitted similarly. A key point is that at this point, the staging buffer |
| 1001 | can be deallocated, and only a minimum of data needs to held until the end. |
| 1002 | |
| 1003 | As a debugging alternative, Subzero can emit textual assembly code which can |
| 1004 | then be run through an external assembler. This is of course super slow, but |
| 1005 | quite valuable when bringing up a new target. |
| 1006 | |
| 1007 | As another debugging option, the staging buffer can be emitted as textual |
| 1008 | assembly, primarily in the form of ".byte" lines. This allows the assembler to |
| 1009 | be tested separately from the ELF related code. |
| 1010 | |
| 1011 | Memory management |
| 1012 | ----------------- |
| 1013 | |
| 1014 | Where possible, we allocate from a ``CfgLocalAllocator`` which derives from |
| 1015 | LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects |
| 1016 | allocated from the arena are never actually freed; instead, when the CFG |
| 1017 | translation completes and the CFG is deleted, the entire arena memory is |
| 1018 | reclaimed at once. This style of allocation works well in an environment like a |
| 1019 | compiler where there are distinct phases with only easily-identifiable objects |
| 1020 | living across phases. It frees the developer from having to manage object |
| 1021 | deletion, and it amortizes deletion costs across just a single arena deletion at |
| 1022 | the end of the phase. Furthermore, it helps scalability by allocating entirely |
| 1023 | from thread-local memory pools, and minimizing global locking of the heap. |
| 1024 | |
| 1025 | Instructions are probably the most heavily allocated complex class in Subzero. |
| 1026 | We represent an instruction list as an intrusive doubly linked list, allocate |
| 1027 | all instructions from the ``CfgLocalAllocator``, and we make sure each |
| 1028 | instruction subclass is basically `POD |
| 1029 | <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a |
| 1030 | trivial destructor. This way, when the CFG is finished, we don't need to |
| 1031 | individually deallocate every instruction. We do similar for Variables, which |
| 1032 | is probably the second most popular complex class. |
| 1033 | |
| 1034 | There are some situations where passes need to use some `STL container class |
| 1035 | <http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the |
| 1036 | ``CfgLocalAllocator`` as the container allocator if this is needed. |
| 1037 | |
| 1038 | Multithreaded translation |
| 1039 | ------------------------- |
| 1040 | |
| 1041 | Subzero is designed to be able to translate functions in parallel. With the |
| 1042 | ``-threads=N`` command-line option, there is a 3-stage producer-consumer |
| 1043 | pipeline: |
| 1044 | |
| 1045 | - A single thread parses the ``.pexe`` file and produces a sequence of work |
| 1046 | units. A work unit can be either a fully constructed CFG, or a set of global |
| 1047 | initializers. The work unit includes its sequence number denoting its parse |
| 1048 | order. Each work unit is added to the translation queue. |
| 1049 | |
| 1050 | - There are N translation threads that draw work units from the translation |
| 1051 | queue and lower them into assembler buffers. Each assembler buffer is added |
| 1052 | to the emitter queue, tagged with its sequence number. The CFG and its |
| 1053 | ``CfgLocalAllocator`` are disposed of at this point. |
| 1054 | |
| 1055 | - A single thread draws assembler buffers from the emitter queue and appends to |
| 1056 | the output file. It uses the sequence numbers to reintegrate the assembler |
| 1057 | buffers according to the original parse order, such that output order is |
| 1058 | always deterministic. |
| 1059 | |
| 1060 | This means that with ``-threads=N``, there are actually ``N+1`` spawned threads |
| 1061 | for a total of ``N+2`` execution threads, taking the parser and emitter threads |
| 1062 | into account. For the special case of ``N=0``, execution is entirely sequential |
| 1063 | -- the same thread parses, translates, and emits, one function at a time. This |
| 1064 | is useful for performance measurements. |
| 1065 | |
| 1066 | Ideally, we would like to get near-linear scalability as the number of |
| 1067 | translation threads increases. We expect that ``-threads=1`` should be slightly |
| 1068 | faster than ``-threads=0`` as the small amount of time spent parsing and |
| 1069 | emitting is done largely in parallel with translation. With perfect |
| 1070 | scalability, we see ``-threads=N`` translating ``N`` times as fast as |
| 1071 | ``-threads=1``, up until the point where parsing or emitting becomes the |
| 1072 | bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory |
| 1073 | performance would become a bottleneck and efficiency might peak at, say, 75%. |
| 1074 | |
| 1075 | Currently, parsing takes about 11% of total sequential time. If translation |
| 1076 | scalability ever gets so fast and awesomely scalable that parsing becomes a |
| 1077 | bottleneck, it should be possible to make parsing multithreaded as well. |
| 1078 | |
| 1079 | Internally, all shared, mutable data is held in the GlobalContext object, and |
| 1080 | access to each field is guarded by a mutex. |
| 1081 | |
| 1082 | Security |
| 1083 | -------- |
| 1084 | |
| 1085 | Subzero includes a number of security features in the generated code, as well as |
| 1086 | in the Subzero translator itself, which run on top of the existing Native Client |
| 1087 | sandbox as well as Chrome's OS-level sandbox. |
| 1088 | |
| 1089 | Sandboxed translator |
| 1090 | ^^^^^^^^^^^^^^^^^^^^ |
| 1091 | |
| 1092 | When running inside the browser, the Subzero translator executes as sandboxed, |
| 1093 | untrusted code that is initially checked by the validator, just like the |
| 1094 | LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no |
| 1095 | more or less secure than the translator it replaces, from the point of view of |
| 1096 | the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and |
| 1097 | was designed from the start with security in mind, so one expects fewer attacker |
| 1098 | opportunities here. |
| 1099 | |
| 1100 | Code diversification |
| 1101 | ^^^^^^^^^^^^^^^^^^^^ |
| 1102 | |
| 1103 | `Return-oriented programming |
| 1104 | <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a |
| 1105 | now-common technique for starting with e.g. a known buffer overflow situation |
| 1106 | and launching it into a deeper exploit. The attacker scans the executable |
| 1107 | looking for ROP gadgets, which are short sequences of code that happen to load |
| 1108 | known values into known registers and then return. An attacker who manages to |
| 1109 | overwrite parts of the stack can overwrite it with carefully chosen return |
| 1110 | addresses such that certain ROP gadgets are effectively chained together to set |
| 1111 | up the register state as desired, finally returning to some code that manages to |
| 1112 | do something nasty based on those register values. |
| 1113 | |
| 1114 | If there is a popular ``.pexe`` with a large install base, the attacker could |
| 1115 | run Subzero on it and scan the executable for suitable ROP gadgets to use as |
| 1116 | part of a potential exploit. Note that if the trusted validator is working |
| 1117 | correctly, these ROP gadgets are limited to starting at a bundle boundary and |
| 1118 | cannot use the trick of finding a gadget that happens to begin inside another |
| 1119 | instruction. All the same, gadgets with these constraints still exist and the |
| 1120 | attacker has access to them. This is the attack model we focus most on -- |
| 1121 | protecting the user against misuse of a "trusted" developer's application, as |
| 1122 | opposed to mischief from a malicious ``.pexe`` file. |
| 1123 | |
| 1124 | Subzero can mitigate these attacks to some degree through code diversification. |
| 1125 | Specifically, we can apply some randomness to the code generation that makes ROP |
| 1126 | gadgets less predictable. This randomness can have some compile-time cost, and |
| 1127 | it can affect the code quality; and some diversifications may be more effective |
| 1128 | than others. A more detailed treatment of hardening techniques may be found in |
| 1129 | the Matasano report "`Attacking Clientside JIT Compilers |
| 1130 | <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_". |
| 1131 | |
| 1132 | To evaluate diversification effectiveness, we use a third-party ROP gadget |
| 1133 | finder and limit its results to bundle-aligned addresses. For a given |
| 1134 | diversification technique, we run it with a number of different random seeds, |
| 1135 | find ROP gadgets for each version, and determine how persistent each ROP gadget |
| 1136 | is across the different versions. A gadget is persistent if the same gadget is |
| 1137 | found at the same code address. The best diversifications are ones with low |
| 1138 | gadget persistence rates. |
| 1139 | |
| 1140 | Subzero implements 7 different diversification techniques. Below is a |
| 1141 | discussion of each technique, its effectiveness, and its cost. The discussions |
| 1142 | of cost and effectiveness are for a single diversification technique; the |
| 1143 | translation-time costs for multiple techniques are additive, but the effects of |
| 1144 | multiple techniques on code quality and effectiveness are not yet known. |
| 1145 | |
| 1146 | In Subzero's implementation, each randomization is "repeatable" in a sense. |
| 1147 | Each pass that includes a randomization option gets its own private instance of |
| 1148 | a random number generator (RNG). The RNG is seeded with a combination of a |
| 1149 | global seed, the pass ID, and the function's sequence number. The global seed |
| 1150 | is designed to be different across runs (perhaps based on the current time), but |
| 1151 | for debugging, the global seed can be set to a specific value and the results |
| 1152 | will be repeatable. |
| 1153 | |
| 1154 | Subzero-generated code is subject to diversification once per translation, and |
| 1155 | then Chrome caches the diversified binary for subsequent executions. An |
| 1156 | attacker may attempt to run the binary multiple times hoping for |
| 1157 | higher-probability combinations of ROP gadgets. When the attacker guesses |
| 1158 | wrong, a likely outcome is an application crash. Chrome throttles creation of |
| 1159 | crashy processes which reduces the likelihood of the attacker eventually gaining |
| 1160 | a foothold. |
| 1161 | |
| 1162 | Constant blinding |
| 1163 | ~~~~~~~~~~~~~~~~~ |
| 1164 | |
| 1165 | Here, we prevent attackers from controlling large immediates in the text |
| 1166 | (executable) section. A random cookie is generated for each function, and if |
| 1167 | the constant exceeds a specified threshold, the constant is obfuscated with the |
| 1168 | cookie and equivalent code is generated. For example, instead of this x86 |
| 1169 | instruction:: |
| 1170 | |
| 1171 | mov $0x11223344, <%Reg/Mem> |
| 1172 | |
| 1173 | the following code might be generated:: |
| 1174 | |
| 1175 | mov $(0x11223344+Cookie), %temp |
| 1176 | lea -Cookie(%temp), %temp |
| 1177 | mov %temp, <%Reg/Mem> |
| 1178 | |
| 1179 | The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to |
| 1180 | prevent unintended effects on the flags register. |
| 1181 | |
| 1182 | This transformation has almost no effect on translation time, and about 1% |
| 1183 | impact on code quality, depending on the threshold chosen. It does little to |
| 1184 | reduce gadget persistence, but it does remove a lot of potential opportunities |
| 1185 | to construct intra-instruction ROP gadgets (which an attacker could use only if |
| 1186 | a validator bug were discovered, since the Native Client sandbox and associated |
| 1187 | validator force returns and other indirect branches to be to bundle-aligned |
| 1188 | addresses). |
| 1189 | |
| 1190 | Constant pooling |
| 1191 | ~~~~~~~~~~~~~~~~ |
| 1192 | |
| 1193 | This is similar to constant blinding, in that large immediates are removed from |
| 1194 | the text section. In this case, each unique constant above the threshold is |
| 1195 | stored in a read-only data section and the constant is accessed via a memory |
| 1196 | load. For the above example, the following code might be generated:: |
| 1197 | |
| 1198 | mov $Label$1, %temp |
| 1199 | mov %temp, <%Reg/Mem> |
| 1200 | |
| 1201 | This has a similarly small impact on translation time and ROP gadget |
| 1202 | persistence, and a smaller (better) impact on code quality. This is because it |
| 1203 | uses fewer instructions, and in some cases good register allocation leads to no |
| 1204 | increase in instruction count. Note that this still gives an attacker some |
| 1205 | limited amount of control over some text section values, unless we randomize the |
| 1206 | constant pool layout. |
| 1207 | |
| 1208 | Static data reordering |
| 1209 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 1210 | |
| 1211 | This transformation limits the attacker's ability to control bits in global data |
| 1212 | address references. It simply permutes the order in memory of global variables |
| 1213 | and internal constant pool entries. For the constant pool, we only permute |
| 1214 | within a type (i.e., emit a randomized list of ints, followed by a randomized |
| 1215 | list of floats, etc.) to maintain good packing in the face of alignment |
| 1216 | constraints. |
| 1217 | |
| 1218 | As might be expected, this has no impact on code quality, translation time, or |
| 1219 | ROP gadget persistence (though as above, it limits opportunities for |
| 1220 | intra-instruction ROP gadgets with a broken validator). |
| 1221 | |
| 1222 | Basic block reordering |
| 1223 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 1224 | |
| 1225 | Here, we randomize the order of basic blocks within a function, with the |
| 1226 | constraint that we still want to maintain a topological order as much as |
| 1227 | possible, to avoid making the code too branchy. |
| 1228 | |
| 1229 | This has no impact on code quality, and about 1% impact on translation time, due |
| 1230 | to a separate pass to recompute layout. It ends up having a huge effect on ROP |
| 1231 | gadget persistence, tied for best with nop insertion, reducing ROP gadget |
| 1232 | persistence to less than 5%. |
| 1233 | |
| 1234 | Function reordering |
| 1235 | ~~~~~~~~~~~~~~~~~~~ |
| 1236 | |
| 1237 | Here, we permute the order that functions are emitted, primarily to shift ROP |
| 1238 | gadgets around to less predictable locations. It may also change call address |
| 1239 | offsets in case the attacker was trying to control that offset in the code. |
| 1240 | |
| 1241 | To control latency and memory footprint, we don't arbitrarily permute functions. |
| 1242 | Instead, for some relatively small value of N, we queue up N assembler buffers, |
| 1243 | and then emit the N functions in random order, and repeat until all functions |
| 1244 | are emitted. |
| 1245 | |
| 1246 | Function reordering has no impact on translation time or code quality. |
| 1247 | Measurements indicate that it reduces ROP gadget persistence to about 15%. |
| 1248 | |
| 1249 | Nop insertion |
| 1250 | ~~~~~~~~~~~~~ |
| 1251 | |
| 1252 | This diversification randomly adds a nop instruction after each regular |
| 1253 | instruction, with some probability. Nop instructions of different lengths may |
| 1254 | be selected. Nop instructions are never added inside a bundle_lock region. |
| 1255 | Note that when sandboxing is enabled, nop instructions are already being added |
| 1256 | for bundle alignment, so the diversification nop instructions may simply be |
| 1257 | taking the place of alignment nop instructions, though distributed differently |
| 1258 | through the bundle. |
| 1259 | |
| 1260 | In Subzero's currently implementation, nop insertion adds 3-5% to the |
| 1261 | translation time, but this is probably because it is implemented as a separate |
| 1262 | pass that adds actual nop instructions to the IR. The overhead would probably |
| 1263 | be a lot less if it were integrated into the assembler pass. The code quality |
| 1264 | is also reduced by 3-5%, making nop insertion the most expensive of the |
| 1265 | diversification techniques. |
| 1266 | |
| 1267 | Nop insertion is very effective in reducing ROP gadget persistence, at the same |
| 1268 | level as basic block randomization (less than 5%). But given nop insertion's |
| 1269 | impact on translation time and code quality, one would most likely prefer to use |
| 1270 | basic block randomization instead (though the combined effects of the different |
| 1271 | diversification techniques have not yet been studied). |
| 1272 | |
| 1273 | Register allocation randomization |
| 1274 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 1275 | |
| 1276 | In this diversification, the register allocator tries to make different but |
| 1277 | mostly functionally equivalent choices, while maintaining stable code quality. |
| 1278 | |
| 1279 | A naive approach would be the following. Whenever the allocator has more than |
| 1280 | one choice for assigning a register, choose randomly among those options. And |
| 1281 | whenever there are no registers available and there is a tie for the |
| 1282 | lowest-weight variable, randomly select one of the lowest-weight variables to |
| 1283 | evict. Because of the one-pass nature of the linear-scan algorithm, this |
| 1284 | randomization strategy can have a large impact on which variables are ultimately |
| 1285 | assigned registers, with a corresponding large impact on code quality. |
| 1286 | |
| 1287 | Instead, we choose an approach that tries to keep code quality stable regardless |
| 1288 | of the random seed. We partition the set of physical registers into equivalence |
| 1289 | classes. If a register is pre-colored in the function (i.e., referenced |
| 1290 | explicitly by name), it forms its own equivalence class. The remaining |
| 1291 | registers are partitioned according to their combination of attributes such as |
| 1292 | integer versus floating-point, 8-bit versus 32-bit, caller-save versus |
| 1293 | callee-saved, etc. Each equivalence class is randomly permuted, and the |
| 1294 | complete permutation is applied to the final register assignments. |
| 1295 | |
| 1296 | Register randomization reduces ROP gadget persistence to about 10% on average, |
| 1297 | though there tends to be fairly high variance across functions and applications. |
| 1298 | This probably has to do with the set of restrictions in the x86-32 instruction |
| 1299 | set and ABI, such as few general-purpose registers, ``%eax`` used for return |
| 1300 | values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As |
| 1301 | intended, register randomization has no impact on code quality, and a slight |
| 1302 | (0.5%) impact on translation time due to an extra scan over the variables to |
| 1303 | identify pre-colored registers. |
| 1304 | |
| 1305 | Fuzzing |
| 1306 | ^^^^^^^ |
| 1307 | |
| 1308 | We have started fuzz-testing the ``.pexe`` files input to Subzero, using a |
| 1309 | combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer |
| 1310 | <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to |
| 1311 | find and fix cases where Subzero crashes or otherwise ungracefully fails on |
| 1312 | unexpected inputs, and to do so automatically over a large range of unexpected |
| 1313 | inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility |
| 1314 | of an attacker exploiting these bugs. |
| 1315 | |
| 1316 | Most of the problems found so far are ones most appropriately handled in the |
| 1317 | parser. However, there have been a couple that have identified problems in the |
| 1318 | lowering, or otherwise inappropriately triggered assertion failures and fatal |
| 1319 | errors. We continue to dig into this area. |
| 1320 | |
| 1321 | Future security work |
| 1322 | ^^^^^^^^^^^^^^^^^^^^ |
| 1323 | |
| 1324 | Subzero is well-positioned to explore other future security enhancements, e.g.: |
| 1325 | |
| 1326 | - Tightening the Native Client sandbox. ABI changes, such as the previous work |
| 1327 | on `hiding the sandbox base address |
| 1328 | <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_ |
| 1329 | in x86-64, are easy to experiment with in Subzero. |
| 1330 | |
| 1331 | - Making the executable code section read-only. This would prevent a PNaCl |
| 1332 | application from inspecting its own binary and trying to find ROP gadgets even |
| 1333 | after code diversification has been performed. It may still be susceptible to |
| 1334 | `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks, |
| 1335 | security is still overall improved. |
| 1336 | |
| 1337 | - Instruction selection diversification. It may be possible to lower a given |
| 1338 | instruction in several largely equivalent ways, which gives more opportunities |
| 1339 | for code randomization. |
| 1340 | |
| 1341 | Chrome integration |
| 1342 | ------------------ |
| 1343 | |
| 1344 | Currently Subzero is available in Chrome for the x86-32 architecture, but under |
| 1345 | a flag. When the flag is enabled, Subzero is used when the `manifest file |
| 1346 | <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_ |
| 1347 | linking to the ``.pexe`` file specifies the ``O0`` optimization level. |
| 1348 | |
| 1349 | The next step is to remove the flag, i.e. invoke Subzero as the only translator |
| 1350 | for ``O0``-specified manifest files. |
| 1351 | |
| 1352 | Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which |
| 1353 | case Subzero could be used for all PNaCl translation. |
| 1354 | |
| 1355 | Command line options |
| 1356 | -------------------- |
| 1357 | |
| 1358 | Subzero has a number of command-line options for debugging and diagnostics. |
| 1359 | Among the more interesting are the following. |
| 1360 | |
| 1361 | - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other |
| 1362 | diagnostic output, with various levels of detail after each pass. Instruction |
| 1363 | numbers can be printed or suppressed. Deleted instructions can be printed or |
| 1364 | suppressed (they are retained in the instruction list, as discussed earlier, |
| 1365 | because they can help explain how lower-level instructions originated). |
| 1366 | Liveness information can be printed when available. Details of register |
| 1367 | allocation can be printed as register allocator decisions are made. And more. |
| 1368 | |
| 1369 | - Running Subzero with any level of verbosity produces an enormous amount of |
| 1370 | output. When debugging a single function, verbose output can be suppressed |
| 1371 | except for a particular function. The ``-verbose-focus`` flag suppresses |
| 1372 | verbose output except for the specified function. |
| 1373 | |
| 1374 | - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing |
| 1375 | at exit. Timing markers can be placed in the Subzero source code to demarcate |
| 1376 | logical operations or passes of interest. Basic timing information plus |
| 1377 | call-stack type timing information is printed at the end. |
| 1378 | |
| 1379 | - Along with ``-timing``, the user can instead get a report on the overall |
| 1380 | translation time for each function, to help focus on timing outliers. Also, |
| 1381 | ``-timing-focus`` limits the ``-timing`` reporting to a single function, |
| 1382 | instead of aggregating pass timing across all functions. |
| 1383 | |
| 1384 | - The ``-szstats`` option reports various statistics on each function, such as |
| 1385 | stack frame size, static instruction count, etc. It may be helpful to track |
| 1386 | these stats over time as Subzero is improved, as an approximate measure of |
| 1387 | code quality. |
| 1388 | |
| 1389 | - The flag ``-asm-verbose``, in conjunction with emitting textual assembly |
| 1390 | output, annotate the assembly output with register-focused liveness |
| 1391 | information. In particular, each basic block is annotated with which |
| 1392 | registers are live-in and live-out, and each instruction is annotated with |
| 1393 | which registers' and stack locations' live ranges end at that instruction. |
| 1394 | This is really useful when studying the generated code to find opportunities |
| 1395 | for code quality improvements. |
| 1396 | |
| 1397 | Testing and debugging |
| 1398 | --------------------- |
| 1399 | |
| 1400 | LLVM lit tests |
| 1401 | ^^^^^^^^^^^^^^ |
| 1402 | |
| 1403 | For basic testing, Subzero uses LLVM's `lit |
| 1404 | <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We |
| 1405 | have a suite of hundreds of small functions where we test for particular |
| 1406 | assembly code patterns across different target architectures. |
| 1407 | |
| 1408 | Cross tests |
| 1409 | ^^^^^^^^^^^ |
| 1410 | |
| 1411 | Unfortunately, the lit tests don't do a great job of precisely testing the |
| 1412 | correctness of the output. Much better are the cross tests, which are execution |
| 1413 | tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide |
| 1414 | variety of interesting inputs. Each cross test consists of a set of C, C++, |
| 1415 | and/or low-level bitcode files. The C and C++ source files are compiled down to |
| 1416 | bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero. |
| 1417 | Subzero mangles global symbol names with a special prefix to avoid duplicate |
| 1418 | symbol errors. A driver program invokes both versions on a large set of |
| 1419 | interesting inputs, and reports when the Subzero and ``pnacl-llc`` results |
| 1420 | differ. Cross tests turn out to be an excellent way of testing the basic |
| 1421 | lowering patterns, but they are less useful for testing more global things like |
| 1422 | liveness analysis and register allocation. |
| 1423 | |
| 1424 | Bisection debugging |
| 1425 | ^^^^^^^^^^^^^^^^^^^ |
| 1426 | |
| 1427 | Sometimes with a new application, Subzero will end up producing incorrect code |
| 1428 | that either crashes at runtime or otherwise produces the wrong results. When |
| 1429 | this happens, we need to narrow it down to a single function (or small set of |
| 1430 | functions) that yield incorrect behavior. For this, we have a bisection |
| 1431 | debugging framework. Here, we initially translate the entire application once |
| 1432 | with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to |
| 1433 | selectively weaken symbols based on a whitelist or blacklist provided on the |
| 1434 | command line. The two object files can then be linked together without link |
| 1435 | errors, with the desired version of each method "winning". Then the binary is |
| 1436 | tested, and bisection proceeds based on whether the binary produces correct |
| 1437 | output. |
| 1438 | |
| 1439 | When the bisection completes, we are left with a minimal set of |
| 1440 | Subzero-translated functions that cause the failure. Usually it is a single |
| 1441 | function, though sometimes it might require a combination of several functions |
| 1442 | to cause a failure; this may be due to an incorrect call ABI, for example. |
| 1443 | However, Murphy's Law implies that the single failing function is enormous and |
| 1444 | impractical to debug. In that case, we can restart the bisection, explicitly |
| 1445 | blacklisting the enormous function, and try to find another candidate to debug. |
| 1446 | (Future work is to automate this to find all minimal sets of functions, so that |
| 1447 | debugging can focus on the simplest example.) |
| 1448 | |
| 1449 | Fuzz testing |
| 1450 | ^^^^^^^^^^^^ |
| 1451 | |
| 1452 | As described above, we try to find internal Subzero bugs using fuzz testing |
| 1453 | techniques. |
| 1454 | |
| 1455 | Sanitizers |
| 1456 | ^^^^^^^^^^ |
| 1457 | |
| 1458 | Subzero can be built with `AddressSanitizer |
| 1459 | <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer |
| 1460 | <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is |
| 1461 | done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far, |
| 1462 | multithreading has been simple enough that TSan hasn't found any bugs, but ASan |
| 1463 | has found at least one memory leak which was subsequently fixed. |
| 1464 | `UndefinedBehaviorSanitizer |
| 1465 | <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_ |
| 1466 | (UBSan) support is in progress. `Control flow integrity sanitization |
| 1467 | <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under |
| 1468 | consideration. |
| 1469 | |
| 1470 | Current status |
| 1471 | ============== |
| 1472 | |
| 1473 | Target architectures |
| 1474 | -------------------- |
| 1475 | |
| 1476 | Subzero is currently more or less complete for the x86-32 target. It has been |
| 1477 | refactored and extended to handle x86-64 as well, and that is mostly complete at |
| 1478 | this point. |
| 1479 | |
| 1480 | ARM32 work is in progress. It currently lacks the testing level of x86, at |
| 1481 | least in part because Subzero's register allocator needs modifications to handle |
| 1482 | ARM's aliasing of floating point and vector registers. Specifically, a 64-bit |
| 1483 | register is actually a gang of two consecutive and aligned 32-bit registers, and |
| 1484 | a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers. |
| 1485 | ARM64 work has not started; when it does, it will be native-only since the |
| 1486 | Native Client sandbox model, validator, and other tools have never been defined. |
| 1487 | |
| 1488 | An external contributor is adding MIPS support, in most part by following the |
| 1489 | ARM work. |
| 1490 | |
| 1491 | Translator performance |
| 1492 | ---------------------- |
| 1493 | |
| 1494 | Single-threaded translation speed is currently about 5× the ``pnacl-llc`` |
| 1495 | translation speed. For a large ``.pexe`` file, the time breaks down as: |
| 1496 | |
| 1497 | - 11% for parsing and initial IR building |
| 1498 | |
| 1499 | - 4% for emitting to /dev/null |
| 1500 | |
| 1501 | - 27% for liveness analysis (two liveness passes plus live range construction) |
| 1502 | |
| 1503 | - 15% for linear-scan register allocation |
| 1504 | |
| 1505 | - 9% for basic lowering |
| 1506 | |
| 1507 | - 10% for advanced phi lowering |
| 1508 | |
| 1509 | - ~11% for other minor analysis |
| 1510 | |
| 1511 | - ~10% measurement overhead to acquire these numbers |
| 1512 | |
| 1513 | Some improvements could undoubtedly be made, but it will be hard to increase the |
| 1514 | speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With |
| 1515 | ``-Om1`` (lack of) optimization, we do actually achieve roughly 10× |
| 1516 | ``pnacl-llc`` translation speed, but code quality drops by a factor of 3. |
| 1517 | |
| 1518 | Code quality |
| 1519 | ------------ |
| 1520 | |
| 1521 | Measured across 16 components of spec2k, Subzero's code quality is uniformly |
| 1522 | better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly |
| 1523 | between ``pnacl-llc`` ``-O0`` and ``-O2``. |
| 1524 | |
| 1525 | Translator size |
| 1526 | --------------- |
| 1527 | |
| 1528 | When built in MINIMAL mode, the x86-64 native translator size for the x86-32 |
| 1529 | target is about 700 KB, not including the size of functions referenced in |
| 1530 | dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1 |
| 1531 | MB, and it is statically linked and also includes nop padding for bundling as |
| 1532 | well as indirect branch masking. |
| 1533 | |
| 1534 | Translator memory footprint |
| 1535 | --------------------------- |
| 1536 | |
| 1537 | It's hard to draw firm conclusions about memory footprint, since the footprint |
| 1538 | is at least proportional to the input function size, and there is no real limit |
| 1539 | on the size of functions in the ``.pexe`` file. |
| 1540 | |
| 1541 | That said, we looked at the memory footprint over time as Subzero translated |
| 1542 | ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our |
| 1543 | disposal. One of LLVM's libraries that Subzero uses can report the current |
| 1544 | malloc heap usage. With single-threaded translation, Subzero tends to hover |
| 1545 | around 15 MB of memory usage. There are a couple of monstrous functions where |
| 1546 | Subzero grows to around 100 MB, but then it drops back down after those |
| 1547 | functions finish translating. In contrast, ``pnacl-llc`` grows larger and |
| 1548 | larger throughout translation, reaching several hundred MB by the time it |
| 1549 | completes. |
| 1550 | |
| 1551 | It's a bit more interesting when we enable multithreaded translation. When |
| 1552 | there are N translation threads, Subzero implements a policy that limits the |
| 1553 | size of the translation queue to N entries -- if it is "full" when the parser |
| 1554 | tries to add a new CFG, the parser blocks until one of the translation threads |
| 1555 | removes a CFG. This means the number of in-memory CFGs can (and generally does) |
| 1556 | reach 2*N+1, and so the memory footprint rises in proportion to the number of |
| 1557 | threads. Adding to the pressure is the observation that the monstrous functions |
| 1558 | also take proportionally longer time to translate, so there's a good chance many |
| 1559 | of the monstrous functions will be active at the same time with multithreaded |
| 1560 | translation. As a result, for N=32, Subzero's memory footprint peaks at about |
| 1561 | 260 MB, but drops back down as the large functions finish translating. |
| 1562 | |
| 1563 | If this peak memory size becomes a problem, it might be possible for the parser |
| 1564 | to resequence the functions to try to spread out the larger functions, or to |
| 1565 | throttle the translation queue to prevent too many in-flight large functions. |
| 1566 | It may also be possible to throttle based on memory pressure signaling from |
| 1567 | Chrome. |
| 1568 | |
| 1569 | Translator scalability |
| 1570 | ---------------------- |
| 1571 | |
| 1572 | Currently scalability is "not very good". Multiple translation threads lead to |
| 1573 | faster translation, but not to the degree desired. We haven't dug in to |
| 1574 | investigate yet. |
| 1575 | |
| 1576 | There are a few areas to investigate. First, there may be contention on the |
| 1577 | constant pool, which all threads access, and which requires locked access even |
| 1578 | for reading. This could be mitigated by keeping a CFG-local cache of the most |
| 1579 | common constants. |
| 1580 | |
| 1581 | Second, there may be contention on memory allocation. While almost all CFG |
| 1582 | objects are allocated from the CFG-local allocator, some passes use temporary |
| 1583 | STL containers that use the default allocator, which may require global locking. |
| 1584 | This could be mitigated by switching these to the CFG-local allocator. |
| 1585 | |
| 1586 | Third, multithreading may make the default allocator strategy more expensive. |
| 1587 | In a single-threaded environment, a pass will allocate its containers, run the |
| 1588 | pass, and deallocate the containers. This results in stack-like allocation |
| 1589 | behavior and makes the heap free list easier to manage, with less heap |
| 1590 | fragmentation. But when multithreading is added, the allocations and |
| 1591 | deallocations become much less stack-like, making allocation and deallocation |
| 1592 | operations individually more expensive. Again, this could be mitigated by |
| 1593 | switching these to the CFG-local allocator. |