Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 1 | ===================================== |
| 2 | Performance Tips for Frontend Authors |
| 3 | ===================================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | :depth: 2 |
| 8 | |
| 9 | Abstract |
| 10 | ======== |
| 11 | |
| 12 | The intended audience of this document is developers of language frontends |
| 13 | targeting LLVM IR. This document is home to a collection of tips on how to |
Philip Reames | 92aa8d6 | 2015-08-24 18:16:02 +0000 | [diff] [blame] | 14 | generate IR that optimizes well. |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 15 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 16 | IR Best Practices |
| 17 | ================= |
| 18 | |
Philip Reames | 92aa8d6 | 2015-08-24 18:16:02 +0000 | [diff] [blame] | 19 | As with any optimizer, LLVM has its strengths and weaknesses. In some cases, |
| 20 | surprisingly small changes in the source IR can have a large effect on the |
| 21 | generated code. |
| 22 | |
| 23 | Beyond the specific items on the list below, it's worth noting that the most |
| 24 | mature frontend for LLVM is Clang. As a result, the further your IR gets from what Clang might emit, the less likely it is to be effectively optimized. It |
| 25 | can often be useful to write a quick C program with the semantics you're trying |
| 26 | to model and see what decisions Clang's IRGen makes about what IR to emit. |
| 27 | Studying Clang's CodeGen directory can also be a good source of ideas. Note |
| 28 | that Clang and LLVM are explicitly version locked so you'll need to make sure |
| 29 | you're using a Clang built from the same svn revision or release as the LLVM |
| 30 | library you're using. As always, it's *strongly* recommended that you track |
| 31 | tip of tree development, particularly during bring up of a new project. |
| 32 | |
| 33 | The Basics |
| 34 | ^^^^^^^^^^^ |
| 35 | |
| 36 | #. Make sure that your Modules contain both a data layout specification and |
| 37 | target triple. Without these pieces, non of the target specific optimization |
| 38 | will be enabled. This can have a major effect on the generated code quality. |
| 39 | |
| 40 | #. For each function or global emitted, use the most private linkage type |
| 41 | possible (private, internal or linkonce_odr preferably). Doing so will |
| 42 | make LLVM's inter-procedural optimizations much more effective. |
| 43 | |
| 44 | #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds |
| 45 | of predecessors). Among other issues, the register allocator is known to |
| 46 | perform badly with confronted with such structures. The only exception to |
| 47 | this guidance is that a unified return block with high in-degree is fine. |
| 48 | |
Philip Reames | fba81bc | 2015-09-10 17:03:10 +0000 | [diff] [blame] | 49 | Use of allocas |
| 50 | ^^^^^^^^^^^^^^ |
| 51 | |
| 52 | An alloca instruction can be used to represent a function scoped stack slot, |
| 53 | but can also represent dynamic frame expansion. When representing function |
| 54 | scoped variables or locations, placing alloca instructions at the beginning of |
| 55 | the entry block should be preferred. In particular, place them before any |
| 56 | call instructions. Call instructions might get inlined and replaced with |
| 57 | multiple basic blocks. The end result is that a following alloca instruction |
| 58 | would no longer be in the entry basic block afterward. |
| 59 | |
| 60 | The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt |
| 61 | to eliminate alloca instructions that are in the entry basic block. Given |
| 62 | SSA is the canonical form expected by much of the optimizer; if allocas can |
| 63 | not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less |
| 64 | effective than it could be. |
Philip Reames | 92aa8d6 | 2015-08-24 18:16:02 +0000 | [diff] [blame] | 65 | |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 66 | Avoid loads and stores of large aggregate type |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 67 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 68 | |
| 69 | LLVM currently does not optimize well loads and stores of large :ref:`aggregate |
| 70 | types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider |
| 71 | loading individual fields from memory. |
| 72 | |
| 73 | Aggregates that are smaller than the largest (performant) load or store |
| 74 | instruction supported by the targeted hardware are well supported. These can |
| 75 | be an effective way to represent collections of small packed fields. |
| 76 | |
| 77 | Prefer zext over sext when legal |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 78 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 79 | |
| 80 | On some architectures (X86_64 is one), sign extension can involve an extra |
| 81 | instruction whereas zero extension can be folded into a load. LLVM will try to |
| 82 | replace a sext with a zext when it can be proven safe, but if you have |
| 83 | information in your source language about the range of a integer value, it can |
| 84 | be profitable to use a zext rather than a sext. |
| 85 | |
| 86 | Alternatively, you can :ref:`specify the range of the value using metadata |
| 87 | <range-metadata>` and LLVM can do the sext to zext conversion for you. |
| 88 | |
| 89 | Zext GEP indices to machine register width |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 90 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 91 | |
| 92 | Internally, LLVM often promotes the width of GEP indices to machine register |
| 93 | width. When it does so, it will default to using sign extension (sext) |
| 94 | operations for safety. If your source language provides information about |
| 95 | the range of the index, you may wish to manually extend indices to machine |
| 96 | register width using a zext instruction. |
| 97 | |
Philip Reames | fba81bc | 2015-09-10 17:03:10 +0000 | [diff] [blame] | 98 | When to specify alignment |
| 99 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 100 | LLVM will always generate correct code if you don’t specify alignment, but may |
| 101 | generate inefficient code. For example, if you are targeting MIPS (or older |
| 102 | ARM ISAs) then the hardware does not handle unaligned loads and stores, and |
| 103 | so you will enter a trap-and-emulate path if you do a load or store with |
| 104 | lower-than-natural alignment. To avoid this, LLVM will emit a slower |
| 105 | sequence of loads, shifts and masks (or load-right + load-left on MIPS) for |
| 106 | all cases where the load / store does not have a sufficiently high alignment |
| 107 | in the IR. |
| 108 | |
| 109 | The alignment is used to guarantee the alignment on allocas and globals, |
| 110 | though in most cases this is unnecessary (most targets have a sufficiently |
| 111 | high default alignment that they’ll be fine). It is also used to provide a |
| 112 | contract to the back end saying ‘either this load/store has this alignment, or |
| 113 | it is undefined behavior’. This means that the back end is free to emit |
| 114 | instructions that rely on that alignment (and mid-level optimizers are free to |
| 115 | perform transforms that require that alignment). For x86, it doesn’t make |
| 116 | much difference, as almost all instructions are alignment-independent. For |
| 117 | MIPS, it can make a big difference. |
| 118 | |
| 119 | Note that if your loads and stores are atomic, the backend will be unable to |
| 120 | lower an under aligned access into a sequence of natively aligned accesses. |
| 121 | As a result, alignment is mandatory for atomic loads and stores. |
| 122 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 123 | Other Things to Consider |
| 124 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 125 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 126 | #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing |
| 127 | analysis), prefer GEPs |
| 128 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 129 | #. Prefer globals over inttoptr of a constant address - this gives you |
| 130 | dereferencability information. In MCJIT, use getSymbolAddress to provide |
| 131 | actual address. |
| 132 | |
| 133 | #. Be wary of ordered and atomic memory operations. They are hard to optimize |
| 134 | and may not be well optimized by the current optimizer. Depending on your |
| 135 | source language, you may consider using fences instead. |
| 136 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 137 | #. If calling a function which is known to throw an exception (unwind), use |
| 138 | an invoke with a normal destination which contains an unreachable |
| 139 | instruction. This form conveys to the optimizer that the call returns |
| 140 | abnormally. For an invoke which neither returns normally or requires unwind |
| 141 | code in the current function, you can use a noreturn call instruction if |
| 142 | desired. This is generally not required because the optimizer will convert |
| 143 | an invoke with an unreachable unwind destination to a call instruction. |
| 144 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 145 | #. Use profile metadata to indicate statically known cold paths, even if |
| 146 | dynamic profiling information is not available. This can make a large |
| 147 | difference in code placement and thus the performance of tight loops. |
| 148 | |
| 149 | #. When generating code for loops, try to avoid terminating the header block of |
| 150 | the loop earlier than necessary. If the terminator of the loop header |
| 151 | block is a loop exiting conditional branch, the effectiveness of LICM will |
| 152 | be limited for loads not in the header. (This is due to the fact that LLVM |
| 153 | may not know such a load is safe to speculatively execute and thus can't |
| 154 | lift an otherwise loop invariant load unless it can prove the exiting |
| 155 | condition is not taken.) It can be profitable, in some cases, to emit such |
| 156 | instructions into the header even if they are not used along a rarely |
| 157 | executed path that exits the loop. This guidance specifically does not |
| 158 | apply if the condition which terminates the loop header is itself invariant, |
| 159 | or can be easily discharged by inspecting the loop index variables. |
| 160 | |
| 161 | #. In hot loops, consider duplicating instructions from small basic blocks |
| 162 | which end in highly predictable terminators into their successor blocks. |
| 163 | If a hot successor block contains instructions which can be vectorized |
| 164 | with the duplicated ones, this can provide a noticeable throughput |
| 165 | improvement. Note that this is not always profitable and does involve a |
| 166 | potentially large increase in code size. |
| 167 | |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 168 | #. When checking a value against a constant, emit the check using a consistent |
Philip Reames | 5b07572 | 2015-04-26 22:25:29 +0000 | [diff] [blame] | 169 | comparison type. The GVN pass *will* optimize redundant equalities even if |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 170 | the type of comparison is inverted, but GVN only runs late in the pipeline. |
Philip Reames | e0e9083 | 2015-04-26 22:23:12 +0000 | [diff] [blame] | 171 | As a result, you may miss the opportunity to run other important |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 172 | optimizations. Improvements to EarlyCSE to remove this issue are tracked in |
| 173 | Bug 23333. |
| 174 | |
Philip Reames | 5b07572 | 2015-04-26 22:25:29 +0000 | [diff] [blame] | 175 | #. Avoid using arithmetic intrinsics unless you are *required* by your source |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 176 | language specification to emit a particular code sequence. The optimizer |
| 177 | is quite good at reasoning about general control flow and arithmetic, it is |
| 178 | not anywhere near as strong at reasoning about the various intrinsics. If |
| 179 | profitable for code generation purposes, the optimizer will likely form the |
Philip Reames | 5b07572 | 2015-04-26 22:25:29 +0000 | [diff] [blame] | 180 | intrinsics itself late in the optimization pipeline. It is *very* rarely |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 181 | profitable to emit these directly in the language frontend. This item |
| 182 | explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`. |
| 183 | |
Philip Reames | e0e9083 | 2015-04-26 22:23:12 +0000 | [diff] [blame] | 184 | #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've |
| 185 | established that a) there's no other way to express the given fact and b) |
| 186 | that fact is critical for optimization purposes. Assumes are a great |
| 187 | prototyping mechanism, but they can have negative effects on both compile |
| 188 | time and optimization effectiveness. The former is fixable with enough |
| 189 | effort, but the later is fairly fundamental to their designed purpose. |
| 190 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 191 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 192 | Describing Language Specific Properties |
| 193 | ======================================= |
| 194 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 195 | When translating a source language to LLVM, finding ways to express concepts |
| 196 | and guarantees available in your source language which are not natively |
| 197 | provided by LLVM IR will greatly improve LLVM's ability to optimize your code. |
| 198 | As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes |
| 199 | a long way to assisting the optimizer in reasoning about loop induction |
| 200 | variables and thus generating more optimal code for loops. |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 201 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 202 | The LLVM LangRef includes a number of mechanisms for annotating the IR with |
| 203 | additional semantic information. It is *strongly* recommended that you become |
| 204 | highly familiar with this document. The list below is intended to highlight a |
| 205 | couple of items of particular interest, but is by no means exhaustive. |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 206 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 207 | Restricted Operation Semantics |
| 208 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 209 | #. Add nsw/nuw flags as appropriate. Reasoning about overflow is |
| 210 | generally hard for an optimizer so providing these facts from the frontend |
| 211 | can be very impactful. |
| 212 | |
| 213 | #. Use fast-math flags on floating point operations if legal. If you don't |
| 214 | need strict IEEE floating point semantics, there are a number of additional |
| 215 | optimizations that can be performed. This can be highly impactful for |
| 216 | floating point intensive computations. |
| 217 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 218 | Describing Aliasing Properties |
| 219 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 220 | |
| 221 | #. Add noalias/align/dereferenceable/nonnull to function arguments and return |
| 222 | values as appropriate |
| 223 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 224 | #. Use pointer aliasing metadata, especially tbaa metadata, to communicate |
| 225 | otherwise-non-deducible pointer aliasing facts |
| 226 | |
| 227 | #. Use inbounds on geps. This can help to disambiguate some aliasing queries. |
| 228 | |
| 229 | |
| 230 | Modeling Memory Effects |
| 231 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 232 | |
| 233 | #. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when |
| 234 | known. The optimizer will try to infer these flags, but may not always be |
| 235 | able to. Manual annotations are particularly important for external |
| 236 | functions that the optimizer can not analyze. |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 237 | |
| 238 | #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end |
| 239 | intrinsics where possible. Common profitable uses are for stack like data |
| 240 | structures (thus allowing dead store elimination) and for describing |
| 241 | life times of allocas (thus allowing smaller stack sizes). |
| 242 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 243 | #. Mark invariant locations using !invariant.load and TBAA's constant flags |
| 244 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 245 | Pass Ordering |
| 246 | ^^^^^^^^^^^^^ |
| 247 | |
| 248 | One of the most common mistakes made by new language frontend projects is to |
| 249 | use the existing -O2 or -O3 pass pipelines as is. These pass pipelines make a |
| 250 | good starting point for an optimizing compiler for any language, but they have |
| 251 | been carefully tuned for C and C++, not your target language. You will almost |
| 252 | certainly need to use a custom pass order to achieve optimal performance. A |
| 253 | couple specific suggestions: |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 254 | |
| 255 | #. For languages with numerous rarely executed guard conditions (e.g. null |
| 256 | checks, type checks, range checks) consider adding an extra execution or |
| 257 | two of LoopUnswith and LICM to your pass order. The standard pass order, |
| 258 | which is tuned for C and C++ applications, may not be sufficient to remove |
| 259 | all dischargeable checks from loops. |
| 260 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 261 | #. If you language uses range checks, consider using the IRCE pass. It is not |
| 262 | currently part of the standard pass order. |
| 263 | |
| 264 | #. A useful sanity check to run is to run your optimized IR back through the |
| 265 | -O2 pipeline again. If you see noticeable improvement in the resulting IR, |
| 266 | you likely need to adjust your pass order. |
| 267 | |
| 268 | |
| 269 | I Still Can't Find What I'm Looking For |
| 270 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 271 | |
| 272 | If you didn't find what you were looking for above, consider proposing an piece |
| 273 | of metadata which provides the optimization hint you need. Such extensions are |
| 274 | relatively common and are generally well received by the community. You will |
| 275 | need to ensure that your proposal is sufficiently general so that it benefits |
| 276 | others if you wish to contribute it upstream. |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 277 | |
Philip Reames | 7223a7f | 2015-08-24 17:46:11 +0000 | [diff] [blame] | 278 | You should also consider describing the problem you're facing on `llvm-dev |
| 279 | <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice. |
| 280 | It's entirely possible someone has encountered your problem before and can |
| 281 | give good advice. If there are multiple interested parties, that also |
| 282 | increases the chances that a metadata extension would be well received by the |
| 283 | community as a whole. |
| 284 | |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 285 | Adding to this document |
| 286 | ======================= |
| 287 | |
| 288 | If you run across a case that you feel deserves to be covered here, please send |
| 289 | a patch to `llvm-commits |
Tanya Lattner | 0d28f80 | 2015-08-05 03:51:17 +0000 | [diff] [blame] | 290 | <http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review. |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 291 | |
Tanya Lattner | 0d28f80 | 2015-08-05 03:51:17 +0000 | [diff] [blame] | 292 | If you have questions on these items, please direct them to `llvm-dev |
| 293 | <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_. The more relevant |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 294 | context you are able to give to your question, the more likely it is to be |
| 295 | answered. |
| 296 | |