Chandler Carruth | f244fad | 2018-07-18 14:05:14 +0000 | [diff] [blame] | 1 | # Speculative Load Hardening |
| 2 | |
| 3 | ### A Spectre Variant #1 Mitigation Technique |
| 4 | |
| 5 | Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com) |
| 6 | |
| 7 | ## Problem Statement |
| 8 | |
| 9 | Recently, Google Project Zero and other researchers have found information leak |
| 10 | vulnerabilities by exploiting speculative execution in modern CPUs. These |
| 11 | exploits are currently broken down into three variants: |
| 12 | * GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass |
| 13 | * GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection |
| 14 | * GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load |
| 15 | |
| 16 | For more details, see the Google Project Zero blog post and the Spectre research |
| 17 | paper: |
| 18 | * https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html |
| 19 | * https://spectreattack.com/spectre.pdf |
| 20 | |
| 21 | The core problem of GPZ Variant #1 is that speculative execution uses branch |
| 22 | prediction to select the path of instructions speculatively executed. This path |
| 23 | is speculatively executed with the available data, and may load from memory and |
| 24 | leak the loaded values through various side channels that survive even when the |
| 25 | speculative execution is unwound due to being incorrect. Mispredicted paths can |
| 26 | cause code to be executed with data inputs that never occur in correct |
| 27 | executions, making checks against malicious inputs ineffective and allowing |
| 28 | attackers to use malicious data inputs to leak secret data. Here is an example, |
| 29 | extracted and simplified from the Project Zero paper: |
| 30 | ``` |
| 31 | struct array { |
| 32 | unsigned long length; |
| 33 | unsigned char data[]; |
| 34 | }; |
| 35 | struct array *arr1 = ...; // small array |
| 36 | struct array *arr2 = ...; // array of size 0x400 |
| 37 | unsigned long untrusted_offset_from_caller = ...; |
| 38 | if (untrusted_offset_from_caller < arr1->length) { |
| 39 | unsigned char value = arr1->data[untrusted_offset_from_caller]; |
| 40 | unsigned long index2 = ((value&1)*0x100)+0x200; |
| 41 | unsigned char value2 = arr2->data[index2]; |
| 42 | } |
| 43 | ``` |
| 44 | |
| 45 | The key of the attack is to call this with `untrusted_offset_from_caller` that |
| 46 | is far outside of the bounds when the branch predictor will predict that it |
| 47 | will be in-bounds. In that case, the body of the `if` will be executed |
| 48 | speculatively, and may read secret data into `value` and leak it via a |
| 49 | cache-timing side channel when a dependent access is made to populate `value2`. |
| 50 | |
| 51 | ## High Level Mitigation Approach |
| 52 | |
| 53 | While several approaches are being actively pursued to mitigate specific |
| 54 | branches and/or loads inside especially risky software (most notably various OS |
| 55 | kernels), these approaches require manual and/or static analysis aided auditing |
| 56 | of code and explicit source changes to apply the mitigation. They are unlikely |
| 57 | to scale well to large applications. We are proposing a comprehensive |
| 58 | mitigation approach that would apply automatically across an entire program |
| 59 | rather than through manual changes to the code. While this is likely to have a |
| 60 | high performance cost, some applications may be in a good position to take this |
| 61 | performance / security tradeoff. |
| 62 | |
| 63 | The specific technique we propose is to cause loads to be checked using |
| 64 | branchless code to ensure that they are executing along a valid control flow |
| 65 | path. Consider the following C-pseudo-code representing the core idea of a |
| 66 | predicate guarding potentially invalid loads: |
| 67 | ``` |
| 68 | void leak(int data); |
| 69 | void example(int* pointer1, int* pointer2) { |
| 70 | if (condition) { |
| 71 | // ... lots of code ... |
| 72 | leak(*pointer1); |
| 73 | } else { |
| 74 | // ... more code ... |
| 75 | leak(*pointer2); |
| 76 | } |
| 77 | } |
| 78 | ``` |
| 79 | |
| 80 | This would get transformed into something resembling the following: |
| 81 | ``` |
| 82 | uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max(); |
| 83 | uintptr_t all_zeros_mask = 0; |
| 84 | void leak(int data); |
| 85 | void example(int* pointer1, int* pointer2) { |
| 86 | uintptr_t predicate_state = all_ones_mask; |
| 87 | if (condition) { |
| 88 | // Assuming ?: is implemented using branchless logic... |
| 89 | predicate_state = !condition ? all_zeros_mask : predicate_state; |
| 90 | // ... lots of code ... |
| 91 | // |
| 92 | // Harden the pointer so it can't be loaded |
| 93 | pointer1 &= predicate_state; |
| 94 | leak(*pointer1); |
| 95 | } else { |
| 96 | predicate_state = condition ? all_zeros_mask : predicate_state; |
| 97 | // ... more code ... |
| 98 | // |
| 99 | // Alternative: Harden the loaded value |
| 100 | int value2 = *pointer2 & predicate_state; |
| 101 | leak(value2); |
| 102 | } |
| 103 | } |
| 104 | ``` |
| 105 | |
| 106 | The result should be that if the `if (condition) {` branch is mis-predicted, |
| 107 | there is a *data* dependency on the condition used to zero out any pointers |
| 108 | prior to loading through them or to zero out all of the loaded bits. Even |
| 109 | though this code pattern may still execute speculatively, *invalid* speculative |
| 110 | executions are prevented from leaking secret data from memory (but note that |
| 111 | this data might still be loaded in safe ways, and some regions of memory are |
| 112 | required to not hold secrets, see below for detailed limitations). This |
| 113 | approach only requires the underlying hardware have a way to implement a |
| 114 | branchless and unpredicted conditional update of a register's value. All modern |
| 115 | architectures have support for this, and in fact such support is necessary to |
| 116 | correctly implement constant time cryptographic primitives. |
| 117 | |
| 118 | Crucial properties of this approach: |
| 119 | * It is not preventing any particular side-channel from working. This is |
| 120 | important as there are an unknown number of potential side channels and we |
| 121 | expect to continue discovering more. Instead, it prevents the observation of |
| 122 | secret data in the first place. |
| 123 | * It accumulates the predicate state, protecting even in the face of nested |
| 124 | *correctly* predicted control flows. |
| 125 | * It passes this predicate state across function boundaries to provide |
| 126 | [interprocedural protection](#interprocedural-checking). |
| 127 | * When hardening the address of a load, it uses a *destructive* or |
| 128 | *non-reversible* modification of the address to prevent an attacker from |
| 129 | reversing the check using attacker-controlled inputs. |
| 130 | * It does not completely block speculative execution, and merely prevents |
| 131 | *mis*-speculated paths from leaking secrets from memory (and stalls |
| 132 | speculation until this can be determined). |
| 133 | * It is completely general and makes no fundamental assumptions about the |
| 134 | underlying architecture other than the ability to do branchless conditional |
| 135 | data updates and a lack of value prediction. |
| 136 | * It does not require programmers to identify all possible secret data using |
| 137 | static source code annotations or code vulnerable to a variant #1 style |
| 138 | attack. |
| 139 | |
| 140 | Limitations of this approach: |
| 141 | * It requires re-compiling source code to insert hardening instruction |
| 142 | sequences. Only software compiled in this mode is protected. |
| 143 | * The performance is heavily dependent on a particular architecture's |
| 144 | implementation strategy. We outline a potential x86 implementation below and |
| 145 | characterize its performance. |
| 146 | * It does not defend against secret data already loaded from memory and |
| 147 | residing in registers or leaked through other side-channels in |
| 148 | non-speculative execution. Code dealing with this, e.g cryptographic |
| 149 | routines, already uses constant-time algorithms and code to prevent |
| 150 | side-channels. Such code should also scrub registers of secret data following |
| 151 | [these |
| 152 | guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md). |
| 153 | * To achieve reasonable performance, many loads may not be checked, such as |
| 154 | those with compile-time fixed addresses. This primarily consists of accesses |
| 155 | at compile-time constant offsets of global and local variables. Code which |
| 156 | needs this protection and intentionally stores secret data must ensure the |
| 157 | memory regions used for secret data are necessarily dynamic mappings or heap |
| 158 | allocations. This is an area which can be tuned to provide more comprehensive |
| 159 | protection at the cost of performance. |
| 160 | * [Hardened loads](#hardening-the-address-of-the-load) may still load data from |
| 161 | _valid_ addresses if not _attacker-controlled_ addresses. To prevent these |
| 162 | from reading secret data, the low 2gb of the address space and 2gb above and |
| 163 | below any executable pages should be protected. |
| 164 | |
| 165 | Credit: |
| 166 | * The core idea of tracing misspeculation through data and marking pointers to |
| 167 | block misspeculated loads was developed as part of a HACS 2018 discussion |
| 168 | between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other |
| 169 | individuals. |
| 170 | * Core idea of masking out loaded bits was part of the original mitigation |
| 171 | suggested by Jann Horn when these attacks were reported. |
| 172 | |
| 173 | |
| 174 | ### Indirect Branches, Calls, and Returns |
| 175 | |
| 176 | It is possible to attack control flow other than conditional branches with |
| 177 | variant #1 style mispredictions. |
| 178 | * A prediction towards a hot call target of a virtual method can lead to it |
| 179 | being speculatively executed when an expected type is used (often called |
| 180 | "type confusion"). |
| 181 | * A hot case may be speculatively executed due to prediction instead of the |
| 182 | correct case for a switch statement implemented as a jump table. |
| 183 | * A hot common return address may be predicted incorrectly when returning from |
| 184 | a function. |
| 185 | |
| 186 | These code patterns are also vulnerable to Spectre variant #2, and as such are |
| 187 | best mitigated with a |
| 188 | [retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms. |
| 189 | When a mitigation technique like retpoline is used, speculation simply cannot |
| 190 | proceed through an indirect control flow edge (or it cannot be mispredicted in |
| 191 | the case of a filled RSB) and so it is also protected from variant #1 style |
| 192 | attacks. However, some architectures, micro-architectures, or vendors do not |
| 193 | employ the retpoline mitigation, and on future x86 hardware (both Intel and |
| 194 | AMD) it is expected to become unnecessary due to hardware-based mitigation. |
| 195 | |
| 196 | When not using a retpoline, these edges will need independent protection from |
| 197 | variant #1 style attacks. The analogous approach to that used for conditional |
| 198 | control flow should work: |
| 199 | ``` |
| 200 | uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max(); |
| 201 | uintptr_t all_zeros_mask = 0; |
| 202 | void leak(int data); |
| 203 | void example(int* pointer1, int* pointer2) { |
| 204 | uintptr_t predicate_state = all_ones_mask; |
| 205 | switch (condition) { |
| 206 | case 0: |
| 207 | // Assuming ?: is implemented using branchless logic... |
| 208 | predicate_state = (condition != 0) ? all_zeros_mask : predicate_state; |
| 209 | // ... lots of code ... |
| 210 | // |
| 211 | // Harden the pointer so it can't be loaded |
| 212 | pointer1 &= predicate_state; |
| 213 | leak(*pointer1); |
| 214 | break; |
| 215 | |
| 216 | case 1: |
| 217 | predicate_state = (condition != 1) ? all_zeros_mask : predicate_state; |
| 218 | // ... more code ... |
| 219 | // |
| 220 | // Alternative: Harden the loaded value |
| 221 | int value2 = *pointer2 & predicate_state; |
| 222 | leak(value2); |
| 223 | break; |
| 224 | |
| 225 | // ... |
| 226 | } |
| 227 | } |
| 228 | ``` |
| 229 | |
| 230 | The core idea remains the same: validate the control flow using data-flow and |
| 231 | use that validation to check that loads cannot leak information along |
| 232 | misspeculated paths. Typically this involves passing the desired target of such |
| 233 | control flow across the edge and checking that it is correct afterwards. Note |
| 234 | that while it is tempting to think that this mitigates variant #2 attacks, it |
| 235 | does not. Those attacks go to arbitrary gadgets that don't include the checks. |
| 236 | |
| 237 | |
| 238 | ### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store" |
| 239 | |
| 240 | Beyond the core variant #1 attack, there are techniques to extend this attack. |
| 241 | The primary technique is known as "Bounds Check Bypass Store" and is discussed |
| 242 | in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf |
| 243 | |
| 244 | We will analyze these two variants independently. First, variant #1.1 works by |
| 245 | speculatively storing over the return address after a bounds check bypass. This |
| 246 | speculative store then ends up being used by the CPU during speculative |
| 247 | execution of the return, potentially directing speculative execution to |
| 248 | arbitrary gadgets in the binary. Let's look at an example. |
| 249 | ``` |
| 250 | unsigned char local_buffer[4]; |
| 251 | unsigned char *untrusted_data_from_caller = ...; |
| 252 | unsigned long untrusted_size_from_caller = ...; |
| 253 | if (untrusted_size_from_caller < sizeof(local_buffer)) { |
| 254 | // Speculative execution enters here with a too-large size. |
| 255 | memcpy(local_buffer, untrusted_data_from_caller, |
| 256 | untrusted_size_from_caller); |
| 257 | // The stack has now been smashed, writing an attacker-controlled |
Eli Friedman | f0e6768 | 2019-01-28 23:03:41 +0000 | [diff] [blame] | 258 | // address over the return address. |
Chandler Carruth | f244fad | 2018-07-18 14:05:14 +0000 | [diff] [blame] | 259 | minor_processing(local_buffer); |
| 260 | return; |
| 261 | // Control will speculate to the attacker-written address. |
| 262 | } |
| 263 | ``` |
| 264 | |
| 265 | However, this can be mitigated by hardening the load of the return address just |
| 266 | like any other load. This is sometimes complicated because x86 for example |
| 267 | *implicitly* loads the return address off the stack. However, the |
| 268 | implementation technique below is specifically designed to mitigate this |
| 269 | implicit load by using the stack pointer to communicate misspeculation between |
| 270 | functions. This additionally causes a misspeculation to have an invalid stack |
| 271 | pointer and never be able to read the speculatively stored return address. See |
| 272 | the detailed discussion below. |
| 273 | |
| 274 | For variant #1.2, the attacker speculatively stores into the vtable or jump |
| 275 | table used to implement an indirect call or indirect jump. Because this is |
| 276 | speculative, this will often be possible even when these are stored in |
| 277 | read-only pages. For example: |
| 278 | ``` |
| 279 | class FancyObject : public BaseObject { |
| 280 | public: |
| 281 | void DoSomething() override; |
| 282 | }; |
| 283 | void f(unsigned long attacker_offset, unsigned long attacker_data) { |
| 284 | FancyObject object = getMyObject(); |
| 285 | unsigned long *arr[4] = getFourDataPointers(); |
| 286 | if (attacker_offset < 4) { |
| 287 | // We have bypassed the bounds check speculatively. |
| 288 | unsigned long *data = arr[attacker_offset]; |
| 289 | // Now we have computed a pointer inside of `object`, the vptr. |
| 290 | *data = attacker_data; |
| 291 | // The vptr points to the virtual table and we speculatively clobber that. |
| 292 | g(object); // Hand the object to some other routine. |
| 293 | } |
| 294 | } |
| 295 | // In another file, we call a method on the object. |
| 296 | void g(BaseObject &object) { |
| 297 | object.DoSomething(); |
| 298 | // This speculatively calls the address stored over the vtable. |
| 299 | } |
| 300 | ``` |
| 301 | |
| 302 | Mitigating this requires hardening loads from these locations, or mitigating |
| 303 | the indirect call or indirect jump. Any of these are sufficient to block the |
| 304 | call or jump from using a speculatively stored value that has been read back. |
| 305 | |
| 306 | For both of these, using retpolines would be equally sufficient. One possible |
| 307 | hybrid approach is to use retpolines for indirect call and jump, while relying |
| 308 | on SLH to mitigate returns. |
| 309 | |
| 310 | Another approach that is sufficient for both of these is to harden all of the |
| 311 | speculative stores. However, as most stores aren't interesting and don't |
| 312 | inherently leak data, this is expected to be prohibitively expensive given the |
| 313 | attack it is defending against. |
| 314 | |
| 315 | |
| 316 | ## Implementation Details |
| 317 | |
| 318 | There are a number of complex details impacting the implementation of this |
| 319 | technique, both on a particular architecture and within a particular compiler. |
| 320 | We discuss proposed implementation techniques for the x86 architecture and the |
| 321 | LLVM compiler. These are primarily to serve as an example, as other |
| 322 | implementation techniques are very possible. |
| 323 | |
| 324 | |
| 325 | ### x86 Implementation Details |
| 326 | |
| 327 | On the x86 platform we break down the implementation into three core |
| 328 | components: accumulating the predicate state through the control flow graph, |
| 329 | checking the loads, and checking control transfers between procedures. |
| 330 | |
| 331 | |
| 332 | #### Accumulating Predicate State |
| 333 | |
| 334 | Consider baseline x86 instructions like the following, which test three |
| 335 | conditions and if all pass, loads data from memory and potentially leaks it |
| 336 | through some side channel: |
| 337 | ``` |
| 338 | # %bb.0: # %entry |
| 339 | pushq %rax |
| 340 | testl %edi, %edi |
| 341 | jne .LBB0_4 |
| 342 | # %bb.1: # %then1 |
| 343 | testl %esi, %esi |
| 344 | jne .LBB0_4 |
| 345 | # %bb.2: # %then2 |
| 346 | testl %edx, %edx |
| 347 | je .LBB0_3 |
| 348 | .LBB0_4: # %exit |
| 349 | popq %rax |
| 350 | retq |
| 351 | .LBB0_3: # %danger |
| 352 | movl (%rcx), %edi |
| 353 | callq leak |
| 354 | popq %rax |
| 355 | retq |
| 356 | ``` |
| 357 | |
| 358 | When we go to speculatively execute the load, we want to know whether any of |
| 359 | the dynamically executed predicates have been misspeculated. To track that, |
| 360 | along each conditional edge, we need to track the data which would allow that |
| 361 | edge to be taken. On x86, this data is stored in the flags register used by the |
| 362 | conditional jump instruction. Along both edges after this fork in control flow, |
| 363 | the flags register remains alive and contains data that we can use to build up |
| 364 | our accumulated predicate state. We accumulate it using the x86 conditional |
| 365 | move instruction which also reads the flag registers where the state resides. |
| 366 | These conditional move instructions are known to not be predicted on any x86 |
| 367 | processors, making them immune to misprediction that could reintroduce the |
| 368 | vulnerability. When we insert the conditional moves, the code ends up looking |
| 369 | like the following: |
| 370 | ``` |
| 371 | # %bb.0: # %entry |
| 372 | pushq %rax |
| 373 | xorl %eax, %eax # Zero out initial predicate state. |
| 374 | movq $-1, %r8 # Put all-ones mask into a register. |
| 375 | testl %edi, %edi |
| 376 | jne .LBB0_1 |
| 377 | # %bb.2: # %then1 |
| 378 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 379 | testl %esi, %esi |
| 380 | jne .LBB0_1 |
| 381 | # %bb.3: # %then2 |
| 382 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 383 | testl %edx, %edx |
| 384 | je .LBB0_4 |
| 385 | .LBB0_1: |
| 386 | cmoveq %r8, %rax # Conditionally update predicate state. |
| 387 | popq %rax |
| 388 | retq |
| 389 | .LBB0_4: # %danger |
| 390 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 391 | ... |
| 392 | ``` |
| 393 | |
| 394 | Here we create the "empty" or "correct execution" predicate state by zeroing |
| 395 | `%rax`, and we create a constant "incorrect execution" predicate value by |
| 396 | putting `-1` into `%r8`. Then, along each edge coming out of a conditional |
| 397 | branch we do a conditional move that in a correct execution will be a no-op, |
| 398 | but if misspeculated, will replace the `%rax` with the value of `%r8`. |
| 399 | Misspeculating any one of the three predicates will cause `%rax` to hold the |
| 400 | "incorrect execution" value from `%r8` as we preserve incoming values when |
| 401 | execution is correct rather than overwriting it. |
| 402 | |
| 403 | We now have a value in `%rax` in each basic block that indicates if at some |
| 404 | point previously a predicate was mispredicted. And we have arranged for that |
| 405 | value to be particularly effective when used below to harden loads. |
| 406 | |
| 407 | |
| 408 | ##### Indirect Call, Branch, and Return Predicates |
| 409 | |
Chandler Carruth | f244fad | 2018-07-18 14:05:14 +0000 | [diff] [blame] | 410 | There is no analogous flag to use when tracing indirect calls, branches, and |
| 411 | returns. The predicate state must be accumulated through some other means. |
| 412 | Fundamentally, this is the reverse of the problem posed in CFI: we need to |
| 413 | check where we came from rather than where we are going. For function-local |
| 414 | jump tables, this is easily arranged by testing the input to the jump table |
Chandler Carruth | 219888d | 2018-09-04 10:59:10 +0000 | [diff] [blame] | 415 | within each destination (not yet implemented, use retpolines): |
Chandler Carruth | f244fad | 2018-07-18 14:05:14 +0000 | [diff] [blame] | 416 | ``` |
| 417 | pushq %rax |
| 418 | xorl %eax, %eax # Zero out initial predicate state. |
| 419 | movq $-1, %r8 # Put all-ones mask into a register. |
| 420 | jmpq *.LJTI0_0(,%rdi,8) # Indirect jump through table. |
| 421 | .LBB0_2: # %sw.bb |
| 422 | testq $0, %rdi # Validate index used for jump table. |
| 423 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 424 | ... |
| 425 | jmp _Z4leaki # TAILCALL |
| 426 | |
| 427 | .LBB0_3: # %sw.bb1 |
| 428 | testq $1, %rdi # Validate index used for jump table. |
| 429 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 430 | ... |
| 431 | jmp _Z4leaki # TAILCALL |
| 432 | |
| 433 | .LBB0_5: # %sw.bb10 |
| 434 | testq $2, %rdi # Validate index used for jump table. |
| 435 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 436 | ... |
| 437 | jmp _Z4leaki # TAILCALL |
| 438 | ... |
| 439 | |
| 440 | .section .rodata,"a",@progbits |
| 441 | .p2align 3 |
| 442 | .LJTI0_0: |
| 443 | .quad .LBB0_2 |
| 444 | .quad .LBB0_3 |
| 445 | .quad .LBB0_5 |
| 446 | ... |
| 447 | ``` |
| 448 | |
| 449 | Returns have a simple mitigation technique on x86-64 (or other ABIs which have |
| 450 | what is called a "red zone" region beyond the end of the stack). This region is |
| 451 | guaranteed to be preserved across interrupts and context switches, making the |
| 452 | return address used in returning to the current code remain on the stack and |
| 453 | valid to read. We can emit code in the caller to verify that a return edge was |
| 454 | not mispredicted: |
| 455 | ``` |
| 456 | callq other_function |
| 457 | return_addr: |
| 458 | testq -8(%rsp), return_addr # Validate return address. |
| 459 | cmovneq %r8, %rax # Update predicate state. |
| 460 | ``` |
| 461 | |
| 462 | For an ABI without a "red zone" (and thus unable to read the return address |
Chandler Carruth | 219888d | 2018-09-04 10:59:10 +0000 | [diff] [blame] | 463 | from the stack), we can compute the expected return address prior to the call |
| 464 | into a register preserved across the call and use that similarly to the above. |
Chandler Carruth | f244fad | 2018-07-18 14:05:14 +0000 | [diff] [blame] | 465 | |
| 466 | Indirect calls (and returns in the absence of a red zone ABI) pose the most |
| 467 | significant challenge to propagate. The simplest technique would be to define a |
| 468 | new ABI such that the intended call target is passed into the called function |
| 469 | and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy |
| 470 | in C and C++. While the target function could be passed in TLS, we would still |
| 471 | require complex logic to handle a mixture of functions compiled with and |
| 472 | without this extra logic (essentially, making the ABI backwards compatible). |
| 473 | Currently, we suggest using retpolines here and will continue to investigate |
| 474 | ways of mitigating this. |
| 475 | |
| 476 | |
| 477 | ##### Optimizations, Alternatives, and Tradeoffs |
| 478 | |
| 479 | Merely accumulating predicate state involves significant cost. There are |
| 480 | several key optimizations we employ to minimize this and various alternatives |
| 481 | that present different tradeoffs in the generated code. |
| 482 | |
| 483 | First, we work to reduce the number of instructions used to track the state: |
| 484 | * Rather than inserting a `cmovCC` instruction along every conditional edge in |
| 485 | the original program, we track each set of condition flags we need to capture |
| 486 | prior to entering each basic block and reuse a common `cmovCC` sequence for |
| 487 | those. |
| 488 | * We could further reuse suffixes when there are multiple `cmovCC` |
| 489 | instructions required to capture the set of flags. Currently this is |
| 490 | believed to not be worth the cost as paired flags are relatively rare and |
| 491 | suffixes of them are exceedingly rare. |
| 492 | * A common pattern in x86 is to have multiple conditional jump instructions |
| 493 | that use the same flags but handle different conditions. Naively, we could |
| 494 | consider each fallthrough between them an "edge" but this causes a much more |
| 495 | complex control flow graph. Instead, we accumulate the set of conditions |
| 496 | necessary for fallthrough and use a sequence of `cmovCC` instructions in a |
| 497 | single fallthrough edge to track it. |
| 498 | |
| 499 | Second, we trade register pressure for simpler `cmovCC` instructions by |
| 500 | allocating a register for the "bad" state. We could read that value from memory |
| 501 | as part of the conditional move instruction, however, this creates more |
| 502 | micro-ops and requires the load-store unit to be involved. Currently, we place |
| 503 | the value into a virtual register and allow the register allocator to decide |
| 504 | when the register pressure is sufficient to make it worth spilling to memory |
| 505 | and reloading. |
| 506 | |
| 507 | |
| 508 | #### Hardening Loads |
| 509 | |
| 510 | Once we have the predicate accumulated into a special value for correct vs. |
| 511 | misspeculated, we need to apply this to loads in a way that ensures they do not |
| 512 | leak secret data. There are two primary techniques for this: we can either |
| 513 | harden the loaded value to prevent observation, or we can harden the address |
| 514 | itself to prevent the load from occuring. These have significantly different |
| 515 | performance tradeoffs. |
| 516 | |
| 517 | |
| 518 | ##### Hardening loaded values |
| 519 | |
| 520 | The most appealing way to harden loads is to mask out all of the bits loaded. |
| 521 | The key requirement is that for each bit loaded, along the misspeculated path |
| 522 | that bit is always fixed at either 0 or 1 regardless of the value of the bit |
| 523 | loaded. The most obvious implementation uses either an `and` instruction with |
| 524 | an all-zero mask along misspeculated paths and an all-one mask along correct |
| 525 | paths, or an `or` instruction with an all-one mask along misspeculated paths |
| 526 | and an all-zero mask along correct paths. Other options become less appealing |
| 527 | such as multiplying by zero, or multiple shift instructions. For reasons we |
| 528 | elaborate on below, we end up suggesting you use `or` with an all-ones mask, |
| 529 | making the x86 instruction sequence look like the following: |
| 530 | ``` |
| 531 | ... |
| 532 | |
| 533 | .LBB0_4: # %danger |
| 534 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 535 | movl (%rsi), %edi # Load potentially secret data from %rsi. |
| 536 | orl %eax, %edi |
| 537 | ``` |
| 538 | |
| 539 | Other useful patterns may be to fold the load into the `or` instruction itself |
| 540 | at the cost of a register-to-register copy. |
| 541 | |
| 542 | There are some challenges with deploying this approach: |
| 543 | 1. Many loads on x86 are folded into other instructions. Separating them would |
| 544 | add very significant and costly register pressure with prohibitive |
| 545 | performance cost. |
| 546 | 1. Loads may not target a general purpose register requiring extra instructions |
| 547 | to map the state value into the correct register class, and potentially more |
| 548 | expensive instructions to mask the value in some way. |
| 549 | 1. The flags registers on x86 are very likely to be live, and challenging to |
| 550 | preserve cheaply. |
| 551 | 1. There are many more values loaded than pointers & indices used for loads. As |
| 552 | a consequence, hardening the result of a load requires substantially more |
| 553 | instructions than hardening the address of the load (see below). |
| 554 | |
| 555 | Despite these challenges, hardening the result of the load critically allows |
| 556 | the load to proceed and thus has dramatically less impact on the total |
| 557 | speculative / out-of-order potential of the execution. There are also several |
| 558 | interesting techniques to try and mitigate these challenges and make hardening |
| 559 | the results of loads viable in at least some cases. However, we generally |
| 560 | expect to fall back when unprofitable from hardening the loaded value to the |
| 561 | next approach of hardening the address itself. |
| 562 | |
| 563 | |
| 564 | ###### Loads folded into data-invariant operations can be hardened after the operation |
| 565 | |
| 566 | The first key to making this feasible is to recognize that many operations on |
| 567 | x86 are "data-invariant". That is, they have no (known) observable behavior |
| 568 | differences due to the particular input data. These instructions are often used |
| 569 | when implementing cryptographic primitives dealing with private key data |
| 570 | because they are not believed to provide any side-channels. Similarly, we can |
| 571 | defer hardening until after them as they will not in-and-of-themselves |
| 572 | introduce a speculative execution side-channel. This results in code sequences |
| 573 | that look like: |
| 574 | ``` |
| 575 | ... |
| 576 | |
| 577 | .LBB0_4: # %danger |
| 578 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 579 | addl (%rsi), %edi # Load and accumulate without leaking. |
| 580 | orl %eax, %edi |
| 581 | ``` |
| 582 | |
| 583 | While an addition happens to the loaded (potentially secret) value, that |
| 584 | doesn't leak any data and we then immediately harden it. |
| 585 | |
| 586 | |
| 587 | ###### Hardening of loaded values deferred down the data-invariant expression graph |
| 588 | |
| 589 | We can generalize the previous idea and sink the hardening down the expression |
| 590 | graph across as many data-invariant operations as desirable. This can use very |
| 591 | conservative rules for whether something is data-invariant. The primary goal |
| 592 | should be to handle multiple loads with a single hardening instruction: |
| 593 | ``` |
| 594 | ... |
| 595 | |
| 596 | .LBB0_4: # %danger |
| 597 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 598 | addl (%rsi), %edi # Load and accumulate without leaking. |
| 599 | addl 4(%rsi), %edi # Continue without leaking. |
| 600 | addl 8(%rsi), %edi |
| 601 | orl %eax, %edi # Mask out bits from all three loads. |
| 602 | ``` |
| 603 | |
| 604 | |
| 605 | ###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors |
| 606 | |
| 607 | Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits |
| 608 | without touching the flag registers. However, we can harden loaded values that |
| 609 | are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than |
| 610 | 64-bits on 64-bit systems) by zero-extending the value to the full word size |
| 611 | and then shifting right by at least the number of original bits using the BMI2 |
| 612 | `shrx` instruction: |
| 613 | ``` |
| 614 | ... |
| 615 | |
| 616 | .LBB0_4: # %danger |
| 617 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 618 | addl (%rsi), %edi # Load and accumulate 32 bits of data. |
| 619 | shrxq %rax, %rdi, %rdi # Shift out all 32 bits loaded. |
| 620 | ``` |
| 621 | |
| 622 | Because on x86 the zero-extend is free, this can efficiently harden the loaded |
| 623 | value. |
| 624 | |
| 625 | |
| 626 | ##### Hardening the address of the load |
| 627 | |
| 628 | When hardening the loaded value is inapplicable, most often because the |
| 629 | instruction directly leaks information (like `cmp` or `jmpq`), we switch to |
| 630 | hardening the _address_ of the load instead of the loaded value. This avoids |
| 631 | increasing register pressure by unfolding the load or paying some other high |
| 632 | cost. |
| 633 | |
| 634 | To understand how this works in practice, we need to examine the exact |
| 635 | semantics of the x86 addressing modes which, in its fully general form, looks |
| 636 | like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit |
| 637 | registers that can potentially be any value, and may be attacker controlled, |
| 638 | and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`, |
| 639 | `4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact |
| 640 | computation performed to find the address is then: `%base + (scale * %index) + |
| 641 | offset` under 64-bit 2's complement modular arithmetic. |
| 642 | |
| 643 | One issue with this approach is that, after hardening, the `%base + (scale * |
| 644 | %index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and |
| 645 | then a large, positive `offset` will index into memory within the first two |
| 646 | gigabytes of address space. While these offsets are not attacker controlled, |
| 647 | the attacker could chose to attack a load which happens to have the desired |
| 648 | offset and then successfully read memory in that region. This significantly |
| 649 | raises the burden on the attacker and limits the scope of attack but does not |
| 650 | eliminate it. To fully close the attack we must work with the operating system |
| 651 | to preclude mapping memory in the low two gigabytes of address space. |
| 652 | |
| 653 | |
| 654 | ###### 64-bit load checking instructions |
| 655 | |
| 656 | We can use the following instruction sequences to check loads. We set up `%r8` |
| 657 | in these examples to hold the special value of `-1` which will be `cmov`ed over |
| 658 | `%rax` in misspeculated paths. |
| 659 | |
| 660 | Single register addressing mode: |
| 661 | ``` |
| 662 | ... |
| 663 | |
| 664 | .LBB0_4: # %danger |
| 665 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 666 | orq %rax, %rsi # Mask the pointer if misspeculating. |
| 667 | movl (%rsi), %edi |
| 668 | ``` |
| 669 | |
| 670 | Two register addressing mode: |
| 671 | ``` |
| 672 | ... |
| 673 | |
| 674 | .LBB0_4: # %danger |
| 675 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 676 | orq %rax, %rsi # Mask the pointer if misspeculating. |
| 677 | orq %rax, %rcx # Mask the index if misspeculating. |
| 678 | movl (%rsi,%rcx), %edi |
| 679 | ``` |
| 680 | |
| 681 | This will result in a negative address near zero or in `offset` wrapping the |
| 682 | address space back to a small positive address. Small, negative addresses will |
| 683 | fault in user-mode for most operating systems, but targets which need the high |
| 684 | address space to be user accessible may need to adjust the exact sequence used |
| 685 | above. Additionally, the low addresses will need to be marked unreadable by the |
| 686 | OS to fully harden the load. |
| 687 | |
| 688 | |
| 689 | ###### RIP-relative addressing is even easier to break |
| 690 | |
| 691 | There is a common addressing mode idiom that is substantially harder to check: |
| 692 | addressing relative to the instruction pointer. We cannot change the value of |
| 693 | the instruction pointer register and so we have the harder problem of forcing |
| 694 | `%base + scale * %index + offset` to be an invalid address, by *only* changing |
| 695 | `%index`. The only advantage we have is that the attacker also cannot modify |
| 696 | `%base`. If we use the fast instruction sequence above, but only apply it to |
| 697 | the index, we will always access `%rip + (scale * -1) + offset`. If the |
| 698 | attacker can find a load which with this address happens to point to secret |
| 699 | data, then they can reach it. However, the loader and base libraries can also |
| 700 | simply refuse to map the heap, data segments, or stack within 2gb of any of the |
| 701 | text in the program, much like it can reserve the low 2gb of address space. |
| 702 | |
| 703 | |
| 704 | ###### The flag registers again make everything hard |
| 705 | |
| 706 | Unfortunately, the technique of using `orq`-instructions has a serious flaw on |
| 707 | x86. The very thing that makes it easy to accumulate state, the flag registers |
| 708 | containing predicates, causes serious problems here because they may be alive |
| 709 | and used by the loading instruction or subsequent instructions. On x86, the |
| 710 | `orq` instruction **sets** the flags and will override anything already there. |
| 711 | This makes inserting them into the instruction stream very hazardous. |
| 712 | Unfortunately, unlike when hardening the loaded value, we have no fallback here |
| 713 | and so we must have a fully general approach available. |
| 714 | |
| 715 | The first thing we must do when generating these sequences is try to analyze |
| 716 | the surrounding code to prove that the flags are not in fact alive or being |
| 717 | used. Typically, it has been set by some other instruction which just happens |
| 718 | to set the flags register (much like ours!) with no actual dependency. In those |
| 719 | cases, it is safe to directly insert these instructions. Alternatively we may |
| 720 | be able to move them earlier to avoid clobbering the used value. |
| 721 | |
| 722 | However, this may ultimately be impossible. In that case, we need to preserve |
| 723 | the flags around these instructions: |
| 724 | ``` |
| 725 | ... |
| 726 | |
| 727 | .LBB0_4: # %danger |
| 728 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 729 | pushfq |
| 730 | orq %rax, %rcx # Mask the pointer if misspeculating. |
| 731 | orq %rax, %rdx # Mask the index if misspeculating. |
| 732 | popfq |
| 733 | movl (%rcx,%rdx), %edi |
| 734 | ``` |
| 735 | |
| 736 | Using the `pushf` and `popf` instructions saves the flags register around our |
| 737 | inserted code, but comes at a high cost. First, we must store the flags to the |
| 738 | stack and reload them. Second, this causes the stack pointer to be adjusted |
| 739 | dynamically, requiring a frame pointer be used for referring to temporaries |
| 740 | spilled to the stack, etc. |
| 741 | |
| 742 | On newer x86 processors we can use the `lahf` and `sahf` instructions to save |
| 743 | all of the flags besides the overflow flag in a register rather than on the |
| 744 | stack. We can then use `seto` and `add` to save and restore the overflow flag |
| 745 | in a register. Combined, this will save and restore flags in the same manner as |
| 746 | above but using two registers rather than the stack. That is still very |
| 747 | expensive if slightly less expensive than `pushf` and `popf` in most cases. |
| 748 | |
| 749 | |
| 750 | ###### A flag-less alternative on Haswell, Zen and newer processors |
| 751 | |
| 752 | Starting with the BMI2 x86 instruction set extensions available on Haswell and |
| 753 | Zen processors, there is an instruction for shifting that does not set any |
| 754 | flags: `shrx`. We can use this and the `lea` instruction to implement analogous |
| 755 | code sequences to the above ones. However, these are still very marginally |
| 756 | slower, as there are fewer ports able to dispatch shift instructions in most |
| 757 | modern x86 processors than there are for `or` instructions. |
| 758 | |
| 759 | Fast, single register addressing mode: |
| 760 | ``` |
| 761 | ... |
| 762 | |
| 763 | .LBB0_4: # %danger |
| 764 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 765 | shrxq %rax, %rsi, %rsi # Shift away bits if misspeculating. |
| 766 | movl (%rsi), %edi |
| 767 | ``` |
| 768 | |
| 769 | This will collapse the register to zero or one, and everything but the offset |
| 770 | in the addressing mode to be less than or equal to 9. This means the full |
| 771 | address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish |
| 772 | to protect an extra page of the low address space to account for this |
| 773 | |
| 774 | |
| 775 | ##### Optimizations |
| 776 | |
| 777 | A very large portion of the cost for this approach comes from checking loads in |
| 778 | this way, so it is important to work to optimize this. However, beyond making |
| 779 | the instruction sequences to *apply* the checks efficient (for example by |
| 780 | avoiding `pushfq` and `popfq` sequences), the only significant optimization is |
| 781 | to check fewer loads without introducing a vulnerability. We apply several |
| 782 | techniques to accomplish that. |
| 783 | |
| 784 | |
| 785 | ###### Don't check loads from compile-time constant stack offsets |
| 786 | |
| 787 | We implement this optimization on x86 by skipping the checking of loads which |
| 788 | use a fixed frame pointer offset. |
| 789 | |
| 790 | The result of this optimization is that patterns like reloading a spilled |
| 791 | register or accessing a global field don't get checked. This is a very |
| 792 | significant performance win. |
| 793 | |
| 794 | |
| 795 | ###### Don't check dependent loads |
| 796 | |
| 797 | A core part of why this mitigation strategy works is that it establishes a |
| 798 | data-flow check on the loaded address. However, this means that if the address |
| 799 | itself was already loaded using a checked load, there is no need to check a |
| 800 | dependent load provided it is within the same basic block as the checked load, |
| 801 | and therefore has no additional predicates guarding it. Consider code like the |
| 802 | following: |
| 803 | ``` |
| 804 | ... |
| 805 | |
| 806 | .LBB0_4: # %danger |
| 807 | movq (%rcx), %rdi |
| 808 | movl (%rdi), %edx |
| 809 | ``` |
| 810 | |
| 811 | This will get transformed into: |
| 812 | ``` |
| 813 | ... |
| 814 | |
| 815 | .LBB0_4: # %danger |
| 816 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 817 | orq %rax, %rcx # Mask the pointer if misspeculating. |
| 818 | movq (%rcx), %rdi # Hardened load. |
| 819 | movl (%rdi), %edx # Unhardened load due to dependent addr. |
| 820 | ``` |
| 821 | |
| 822 | This doesn't check the load through `%rdi` as that pointer is dependent on a |
| 823 | checked load already. |
| 824 | |
| 825 | |
| 826 | ###### Protect large, load-heavy blocks with a single lfence |
| 827 | |
| 828 | It may be worth using a single `lfence` instruction at the start of a block |
| 829 | which begins with a (very) large number of loads that require independent |
| 830 | protection *and* which require hardening the address of the load. However, this |
| 831 | is unlikely to be profitable in practice. The latency hit of the hardening |
| 832 | would need to exceed that of an `lfence` when *correctly* speculatively |
| 833 | executed. But in that case, the `lfence` cost is a complete loss of speculative |
| 834 | execution (at a minimum). So far, the evidence we have of the performance cost |
| 835 | of using `lfence` indicates few if any hot code patterns where this trade off |
| 836 | would make sense. |
| 837 | |
| 838 | |
| 839 | ###### Tempting optimizations that break the security model |
| 840 | |
| 841 | Several optimizations were considered which didn't pan out due to failure to |
| 842 | uphold the security model. One in particular is worth discussing as many others |
| 843 | will reduce to it. |
| 844 | |
| 845 | We wondered whether only the *first* load in a basic block could be checked. If |
| 846 | the check works as intended, it forms an invalid pointer that doesn't even |
| 847 | virtual-address translate in the hardware. It should fault very early on in its |
| 848 | processing. Maybe that would stop things in time for the misspeculated path to |
| 849 | fail to leak any secrets. This doesn't end up working because the processor is |
| 850 | fundamentally out-of-order, even in its speculative domain. As a consequence, |
| 851 | the attacker could cause the initial address computation itself to stall and |
| 852 | allow an arbitrary number of unrelated loads (including attacked loads of |
| 853 | secret data) to pass through. |
| 854 | |
| 855 | |
| 856 | #### Interprocedural Checking |
| 857 | |
| 858 | Modern x86 processors may speculate into called functions and out of functions |
| 859 | to their return address. As a consequence, we need a way to check loads that |
| 860 | occur after a misspeculated predicate but where the load and the misspeculated |
| 861 | predicate are in different functions. In essence, we need some interprocedural |
| 862 | generalization of the predicate state tracking. A primary challenge to passing |
| 863 | the predicate state between functions is that we would like to not require a |
| 864 | change to the ABI or calling convention in order to make this mitigation more |
| 865 | deployable, and further would like code mitigated in this way to be easily |
| 866 | mixed with code not mitigated in this way and without completely losing the |
| 867 | value of the mitigation. |
| 868 | |
| 869 | |
| 870 | ##### Embed the predicate state into the high bit(s) of the stack pointer |
| 871 | |
| 872 | We can use the same technique that allows hardening pointers to pass the |
| 873 | predicate state into and out of functions. The stack pointer is trivially |
| 874 | passed between functions and we can test for it having the high bits set to |
| 875 | detect when it has been marked due to misspeculation. The callsite instruction |
| 876 | sequence looks like (assuming a misspeculated state value of `-1`): |
| 877 | ``` |
| 878 | ... |
| 879 | |
| 880 | .LBB0_4: # %danger |
| 881 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 882 | shlq $47, %rax |
| 883 | orq %rax, %rsp |
| 884 | callq other_function |
| 885 | movq %rsp, %rax |
| 886 | sarq 63, %rax # Sign extend the high bit to all bits. |
| 887 | ``` |
| 888 | |
| 889 | This first puts the predicate state into the high bits of `%rsp` before calling |
| 890 | the function and then reads it back out of high bits of `%rsp` afterward. When |
| 891 | correctly executing (speculatively or not), these are all no-ops. When |
| 892 | misspeculating, the stack pointer will end up negative. We arrange for it to |
| 893 | remain a canonical address, but otherwise leave the low bits alone to allow |
| 894 | stack adjustments to proceed normally without disrupting this. Within the |
| 895 | called function, we can extract this predicate state and then reset it on |
| 896 | return: |
| 897 | ``` |
| 898 | other_function: |
| 899 | # prolog |
| 900 | callq other_function |
| 901 | movq %rsp, %rax |
| 902 | sarq 63, %rax # Sign extend the high bit to all bits. |
| 903 | # ... |
| 904 | |
| 905 | .LBB0_N: |
| 906 | cmovneq %r8, %rax # Conditionally update predicate state. |
| 907 | shlq $47, %rax |
| 908 | orq %rax, %rsp |
| 909 | retq |
| 910 | ``` |
| 911 | |
| 912 | This approach is effective when all code is mitigated in this fashion, and can |
| 913 | even survive very limited reaches into unmitigated code (the state will |
| 914 | round-trip in and back out of an unmitigated function, it just won't be |
| 915 | updated). But it does have some limitations. There is a cost to merging the |
| 916 | state into `%rsp` and it doesn't insulate mitigated code from misspeculation in |
| 917 | an unmitigated caller. |
| 918 | |
| 919 | There is also an advantage to using this form of interprocedural mitigation: by |
| 920 | forming these invalid stack pointer addresses we can prevent speculative |
| 921 | returns from successfully reading speculatively written values to the actual |
| 922 | stack. This works first by forming a data-dependency between computing the |
| 923 | address of the return address on the stack and our predicate state. And even |
| 924 | when satisfied, if a misprediction causes the state to be poisoned the |
| 925 | resulting stack pointer will be invalid. |
| 926 | |
| 927 | |
| 928 | ##### Rewrite API of internal functions to directly propagate predicate state |
| 929 | |
| 930 | (Not yet implemented.) |
| 931 | |
| 932 | We have the option with internal functions to directly adjust their API to |
| 933 | accept the predicate as an argument and return it. This is likely to be |
| 934 | marginally cheaper than embedding into `%rsp` for entering functions. |
| 935 | |
| 936 | |
| 937 | ##### Use `lfence` to guard function transitions |
| 938 | |
| 939 | An `lfence` instruction can be used to prevent subsequent loads from |
| 940 | speculatively executing until all prior mispredicted predicates have resolved. |
| 941 | We can use this broader barrier to speculative loads executing between |
| 942 | functions. We emit it in the entry block to handle calls, and prior to each |
| 943 | return. This approach also has the advantage of providing the strongest degree |
| 944 | of mitigation when mixed with unmitigated code by halting all misspeculation |
| 945 | entering a function which is mitigated, regardless of what occured in the |
| 946 | caller. However, such a mixture is inherently more risky. Whether this kind of |
| 947 | mixture is a sufficient mitigation requires careful analysis. |
| 948 | |
| 949 | Unfortunately, experimental results indicate that the performance overhead of |
| 950 | this approach is very high for certain patterns of code. A classic example is |
| 951 | any form of recursive evaluation engine. The hot, rapid call and return |
| 952 | sequences exhibit dramatic performance loss when mitigated with `lfence`. This |
| 953 | component alone can regress performance by 2x or more, making it an unpleasant |
| 954 | tradeoff even when only used in a mixture of code. |
| 955 | |
| 956 | |
| 957 | ##### Use an internal TLS location to pass predicate state |
| 958 | |
| 959 | We can define a special thread-local value to hold the predicate state between |
| 960 | functions. This avoids direct ABI implications by using a side channel between |
| 961 | callers and callees to communicate the predicate state. It also allows implicit |
| 962 | zero-initialization of the state, which allows non-checked code to be the first |
| 963 | code executed. |
| 964 | |
| 965 | However, this requires a load from TLS in the entry block, a store to TLS |
| 966 | before every call and every ret, and a load from TLS after every call. As a |
| 967 | consequence it is expected to be substantially more expensive even than using |
| 968 | `%rsp` and potentially `lfence` within the function entry block. |
| 969 | |
| 970 | |
| 971 | ##### Define a new ABI and/or calling convention |
| 972 | |
| 973 | We could define a new ABI and/or calling convention to explicitly pass the |
| 974 | predicate state in and out of functions. This may be interesting if none of the |
| 975 | alternatives have adequate performance, but it makes deployment and adoption |
| 976 | dramatically more complex, and potentially infeasible. |
| 977 | |
| 978 | |
| 979 | ## High-Level Alternative Mitigation Strategies |
| 980 | |
| 981 | There are completely different alternative approaches to mitigating variant 1 |
| 982 | attacks. [Most](https://lwn.net/Articles/743265/) |
| 983 | [discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating |
| 984 | specific known attackable components in the Linux kernel (or other kernels) by |
| 985 | manually rewriting the code to contain an instruction sequence that is not |
| 986 | vulnerable. For x86 systems this is done by either injecting an `lfence` |
| 987 | instruction along the code path which would leak data if executed speculatively |
| 988 | or by rewriting memory accesses to have branch-less masking to a known safe |
| 989 | region. On Intel systems, `lfence` [will prevent the speculative load of secret |
| 990 | data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf). |
| 991 | On AMD systems `lfence` is currently a no-op, but can be made |
| 992 | dispatch-serializing by setting an MSR, and thus preclude misspeculation of the |
| 993 | code path ([mitigation G-2 + |
| 994 | V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)). |
| 995 | |
| 996 | However, this relies on finding and enumerating all possible points in code |
| 997 | which could be attacked to leak information. While in some cases static |
| 998 | analysis is effective at doing this at scale, in many cases it still relies on |
| 999 | human judgement to evaluate whether code might be vulnerable. Especially for |
| 1000 | software systems which receive less detailed scrutiny but remain sensitive to |
| 1001 | these attacks, this seems like an impractical security model. We need an |
| 1002 | automatic and systematic mitigation strategy. |
| 1003 | |
| 1004 | |
| 1005 | ### Automatic `lfence` on Conditional Edges |
| 1006 | |
| 1007 | A natural way to scale up the existing hand-coded mitigations is simply to |
| 1008 | inject an `lfence` instruction into both the target and fallthrough |
| 1009 | destinations of every conditional branch. This ensures that no predicate or |
| 1010 | bounds check can be bypassed speculatively. However, the performance overhead |
| 1011 | of this approach is, simply put, catastrophic. Yet it remains the only truly |
| 1012 | "secure by default" approach known prior to this effort and serves as the |
| 1013 | baseline for performance. |
| 1014 | |
| 1015 | One attempt to address the performance overhead of this and make it more |
| 1016 | realistic to deploy is [MSVC's /Qspectre |
| 1017 | switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/). |
| 1018 | Their technique is to use static analysis within the compiler to only insert |
| 1019 | `lfence` instructions into conditional edges at risk of attack. However, |
| 1020 | [initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/) |
| 1021 | [analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html) |
| 1022 | has shown that this approach is incomplete and only catches a small and limited |
| 1023 | subset of attackable patterns which happen to resemble very closely the initial |
| 1024 | proofs of concept. As such, while its performance is acceptable, it does not |
| 1025 | appear to be an adequate systematic mitigation. |
| 1026 | |
| 1027 | |
| 1028 | ## Performance Overhead |
| 1029 | |
| 1030 | The performance overhead of this style of comprehensive mitigation is very |
| 1031 | high. However, it compares very favorably with previously recommended |
| 1032 | approaches such as the `lfence` instruction. Just as users can restrict the |
| 1033 | scope of `lfence` to control its performance impact, this mitigation technique |
| 1034 | could be restricted in scope as well. |
| 1035 | |
| 1036 | However, it is important to understand what it would cost to get a fully |
| 1037 | mitigated baseline. Here we assume targeting a Haswell (or newer) processor and |
| 1038 | using all of the tricks to improve performance (so leaves the low 2gb |
| 1039 | unprotected and +/- 2gb surrounding any PC in the program). We ran both |
| 1040 | Google's microbenchmark suite and a large highly-tuned server built using |
| 1041 | ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2 |
| 1042 | instructions, and benchmarks were run on large Haswell servers. We collected |
| 1043 | data both with an `lfence`-based mitigation and load hardening as presented |
| 1044 | here. The summary is that mitigating with load hardening is 1.77x faster than |
| 1045 | mitigating with `lfence`, and the overhead of load hardening compared to a |
| 1046 | normal program is likely between a 10% overhead and a 50% overhead with most |
| 1047 | large applications seeing a 30% overhead or less. |
| 1048 | |
| 1049 | | Benchmark | `lfence` | Load Hardening | Mitigated Speedup | |
| 1050 | | -------------------------------------- | -------: | -------------: | ----------------: | |
| 1051 | | Google microbenchmark suite | -74.8% | -36.4% | **2.5x** | |
| 1052 | | Large server QPS (using ThinLTO & PGO) | -62% | -29% | **1.8x** | |
| 1053 | |
| 1054 | Below is a visualization of the microbenchmark suite results which helps show |
| 1055 | the distribution of results that is somewhat lost in the summary. The y-axis is |
| 1056 | a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster |
| 1057 | -> better). Each box-and-whiskers represents one microbenchmark which may have |
| 1058 | many different metrics measured. The red line marks the median, the box marks |
| 1059 | the first and third quartiles, and the whiskers mark the min and max. |
| 1060 | |
| 1061 |  |
| 1062 | |
| 1063 | We don't yet have benchmark data on SPEC or the LLVM test suite, but we can |
| 1064 | work on getting that. Still, the above should give a pretty clear |
| 1065 | characterization of the performance, and specific benchmarks are unlikely to |
| 1066 | reveal especially interesting properties. |
| 1067 | |
| 1068 | |
| 1069 | ### Future Work: Fine Grained Control and API-Integration |
| 1070 | |
| 1071 | The performance overhead of this technique is likely to be very significant and |
| 1072 | something users wish to control or reduce. There are interesting options here |
| 1073 | that impact the implementation strategy used. |
| 1074 | |
| 1075 | One particularly appealing option is to allow both opt-in and opt-out of this |
| 1076 | mitigation at reasonably fine granularity such as on a per-function basis, |
| 1077 | including intelligent handling of inlining decisions -- protected code can be |
| 1078 | prevented from inlining into unprotected code, and unprotected code will become |
| 1079 | protected when inlined into protected code. For systems where only a limited |
| 1080 | set of code is reachable by externally controlled inputs, it may be possible to |
| 1081 | limit the scope of mitigation through such mechanisms without compromising the |
| 1082 | application's overall security. The performance impact may also be focused in a |
| 1083 | few key functions that can be hand-mitigated in ways that have lower |
| 1084 | performance overhead while the remainder of the application receives automatic |
| 1085 | protection. |
| 1086 | |
| 1087 | For both limiting the scope of mitigation or manually mitigating hot functions, |
| 1088 | there needs to be some support for mixing mitigated and unmitigated code |
| 1089 | without completely defeating the mitigation. For the first use case, it would |
| 1090 | be particularly desirable that mitigated code remains safe when being called |
| 1091 | during misspeculation from unmitigated code. |
| 1092 | |
| 1093 | For the second use case, it may be important to connect the automatic |
| 1094 | mitigation technique to explicit mitigation APIs such as what is described in |
| 1095 | http://wg21.link/p0928 (or any other eventual API) so that there is a clean way |
| 1096 | to switch from automatic to manual mitigation without immediately exposing a |
| 1097 | hole. However, the design for how to do this is hard to come up with until the |
| 1098 | APIs are better established. We will revisit this as those APIs mature. |