Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 1 | ===================================== |
| 2 | Performance Tips for Frontend Authors |
| 3 | ===================================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | :depth: 2 |
| 8 | |
| 9 | Abstract |
| 10 | ======== |
| 11 | |
| 12 | The intended audience of this document is developers of language frontends |
| 13 | targeting LLVM IR. This document is home to a collection of tips on how to |
Philip Reames | 92aa8d6 | 2015-08-24 18:16:02 +0000 | [diff] [blame^] | 14 | generate IR that optimizes well. |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 15 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 16 | IR Best Practices |
| 17 | ================= |
| 18 | |
Philip Reames | 92aa8d6 | 2015-08-24 18:16:02 +0000 | [diff] [blame^] | 19 | As with any optimizer, LLVM has its strengths and weaknesses. In some cases, |
| 20 | surprisingly small changes in the source IR can have a large effect on the |
| 21 | generated code. |
| 22 | |
| 23 | Beyond the specific items on the list below, it's worth noting that the most |
| 24 | mature frontend for LLVM is Clang. As a result, the further your IR gets from what Clang might emit, the less likely it is to be effectively optimized. It |
| 25 | can often be useful to write a quick C program with the semantics you're trying |
| 26 | to model and see what decisions Clang's IRGen makes about what IR to emit. |
| 27 | Studying Clang's CodeGen directory can also be a good source of ideas. Note |
| 28 | that Clang and LLVM are explicitly version locked so you'll need to make sure |
| 29 | you're using a Clang built from the same svn revision or release as the LLVM |
| 30 | library you're using. As always, it's *strongly* recommended that you track |
| 31 | tip of tree development, particularly during bring up of a new project. |
| 32 | |
| 33 | The Basics |
| 34 | ^^^^^^^^^^^ |
| 35 | |
| 36 | #. Make sure that your Modules contain both a data layout specification and |
| 37 | target triple. Without these pieces, non of the target specific optimization |
| 38 | will be enabled. This can have a major effect on the generated code quality. |
| 39 | |
| 40 | #. For each function or global emitted, use the most private linkage type |
| 41 | possible (private, internal or linkonce_odr preferably). Doing so will |
| 42 | make LLVM's inter-procedural optimizations much more effective. |
| 43 | |
| 44 | #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds |
| 45 | of predecessors). Among other issues, the register allocator is known to |
| 46 | perform badly with confronted with such structures. The only exception to |
| 47 | this guidance is that a unified return block with high in-degree is fine. |
| 48 | |
| 49 | |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 50 | Avoid loads and stores of large aggregate type |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 51 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 52 | |
| 53 | LLVM currently does not optimize well loads and stores of large :ref:`aggregate |
| 54 | types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider |
| 55 | loading individual fields from memory. |
| 56 | |
| 57 | Aggregates that are smaller than the largest (performant) load or store |
| 58 | instruction supported by the targeted hardware are well supported. These can |
| 59 | be an effective way to represent collections of small packed fields. |
| 60 | |
| 61 | Prefer zext over sext when legal |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 62 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 63 | |
| 64 | On some architectures (X86_64 is one), sign extension can involve an extra |
| 65 | instruction whereas zero extension can be folded into a load. LLVM will try to |
| 66 | replace a sext with a zext when it can be proven safe, but if you have |
| 67 | information in your source language about the range of a integer value, it can |
| 68 | be profitable to use a zext rather than a sext. |
| 69 | |
| 70 | Alternatively, you can :ref:`specify the range of the value using metadata |
| 71 | <range-metadata>` and LLVM can do the sext to zext conversion for you. |
| 72 | |
| 73 | Zext GEP indices to machine register width |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 74 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 75 | |
| 76 | Internally, LLVM often promotes the width of GEP indices to machine register |
| 77 | width. When it does so, it will default to using sign extension (sext) |
| 78 | operations for safety. If your source language provides information about |
| 79 | the range of the index, you may wish to manually extend indices to machine |
| 80 | register width using a zext instruction. |
| 81 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 82 | Other Things to Consider |
| 83 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 84 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 85 | #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing |
| 86 | analysis), prefer GEPs |
| 87 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 88 | #. Prefer globals over inttoptr of a constant address - this gives you |
| 89 | dereferencability information. In MCJIT, use getSymbolAddress to provide |
| 90 | actual address. |
| 91 | |
| 92 | #. Be wary of ordered and atomic memory operations. They are hard to optimize |
| 93 | and may not be well optimized by the current optimizer. Depending on your |
| 94 | source language, you may consider using fences instead. |
| 95 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 96 | #. If calling a function which is known to throw an exception (unwind), use |
| 97 | an invoke with a normal destination which contains an unreachable |
| 98 | instruction. This form conveys to the optimizer that the call returns |
| 99 | abnormally. For an invoke which neither returns normally or requires unwind |
| 100 | code in the current function, you can use a noreturn call instruction if |
| 101 | desired. This is generally not required because the optimizer will convert |
| 102 | an invoke with an unreachable unwind destination to a call instruction. |
| 103 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 104 | #. Use profile metadata to indicate statically known cold paths, even if |
| 105 | dynamic profiling information is not available. This can make a large |
| 106 | difference in code placement and thus the performance of tight loops. |
| 107 | |
| 108 | #. When generating code for loops, try to avoid terminating the header block of |
| 109 | the loop earlier than necessary. If the terminator of the loop header |
| 110 | block is a loop exiting conditional branch, the effectiveness of LICM will |
| 111 | be limited for loads not in the header. (This is due to the fact that LLVM |
| 112 | may not know such a load is safe to speculatively execute and thus can't |
| 113 | lift an otherwise loop invariant load unless it can prove the exiting |
| 114 | condition is not taken.) It can be profitable, in some cases, to emit such |
| 115 | instructions into the header even if they are not used along a rarely |
| 116 | executed path that exits the loop. This guidance specifically does not |
| 117 | apply if the condition which terminates the loop header is itself invariant, |
| 118 | or can be easily discharged by inspecting the loop index variables. |
| 119 | |
| 120 | #. In hot loops, consider duplicating instructions from small basic blocks |
| 121 | which end in highly predictable terminators into their successor blocks. |
| 122 | If a hot successor block contains instructions which can be vectorized |
| 123 | with the duplicated ones, this can provide a noticeable throughput |
| 124 | improvement. Note that this is not always profitable and does involve a |
| 125 | potentially large increase in code size. |
| 126 | |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 127 | #. When checking a value against a constant, emit the check using a consistent |
Philip Reames | 5b07572 | 2015-04-26 22:25:29 +0000 | [diff] [blame] | 128 | comparison type. The GVN pass *will* optimize redundant equalities even if |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 129 | the type of comparison is inverted, but GVN only runs late in the pipeline. |
Philip Reames | e0e9083 | 2015-04-26 22:23:12 +0000 | [diff] [blame] | 130 | As a result, you may miss the opportunity to run other important |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 131 | optimizations. Improvements to EarlyCSE to remove this issue are tracked in |
| 132 | Bug 23333. |
| 133 | |
Philip Reames | 5b07572 | 2015-04-26 22:25:29 +0000 | [diff] [blame] | 134 | #. Avoid using arithmetic intrinsics unless you are *required* by your source |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 135 | language specification to emit a particular code sequence. The optimizer |
| 136 | is quite good at reasoning about general control flow and arithmetic, it is |
| 137 | not anywhere near as strong at reasoning about the various intrinsics. If |
| 138 | profitable for code generation purposes, the optimizer will likely form the |
Philip Reames | 5b07572 | 2015-04-26 22:25:29 +0000 | [diff] [blame] | 139 | intrinsics itself late in the optimization pipeline. It is *very* rarely |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 140 | profitable to emit these directly in the language frontend. This item |
| 141 | explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`. |
| 142 | |
Philip Reames | e0e9083 | 2015-04-26 22:23:12 +0000 | [diff] [blame] | 143 | #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've |
| 144 | established that a) there's no other way to express the given fact and b) |
| 145 | that fact is critical for optimization purposes. Assumes are a great |
| 146 | prototyping mechanism, but they can have negative effects on both compile |
| 147 | time and optimization effectiveness. The former is fixable with enough |
| 148 | effort, but the later is fairly fundamental to their designed purpose. |
| 149 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 150 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 151 | Describing Language Specific Properties |
| 152 | ======================================= |
| 153 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 154 | When translating a source language to LLVM, finding ways to express concepts |
| 155 | and guarantees available in your source language which are not natively |
| 156 | provided by LLVM IR will greatly improve LLVM's ability to optimize your code. |
| 157 | As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes |
| 158 | a long way to assisting the optimizer in reasoning about loop induction |
| 159 | variables and thus generating more optimal code for loops. |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 160 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 161 | The LLVM LangRef includes a number of mechanisms for annotating the IR with |
| 162 | additional semantic information. It is *strongly* recommended that you become |
| 163 | highly familiar with this document. The list below is intended to highlight a |
| 164 | couple of items of particular interest, but is by no means exhaustive. |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 165 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 166 | Restricted Operation Semantics |
| 167 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 168 | #. Add nsw/nuw flags as appropriate. Reasoning about overflow is |
| 169 | generally hard for an optimizer so providing these facts from the frontend |
| 170 | can be very impactful. |
| 171 | |
| 172 | #. Use fast-math flags on floating point operations if legal. If you don't |
| 173 | need strict IEEE floating point semantics, there are a number of additional |
| 174 | optimizations that can be performed. This can be highly impactful for |
| 175 | floating point intensive computations. |
| 176 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 177 | Describing Aliasing Properties |
| 178 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 179 | |
| 180 | #. Add noalias/align/dereferenceable/nonnull to function arguments and return |
| 181 | values as appropriate |
| 182 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 183 | #. Use pointer aliasing metadata, especially tbaa metadata, to communicate |
| 184 | otherwise-non-deducible pointer aliasing facts |
| 185 | |
| 186 | #. Use inbounds on geps. This can help to disambiguate some aliasing queries. |
| 187 | |
| 188 | |
| 189 | Modeling Memory Effects |
| 190 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 191 | |
| 192 | #. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when |
| 193 | known. The optimizer will try to infer these flags, but may not always be |
| 194 | able to. Manual annotations are particularly important for external |
| 195 | functions that the optimizer can not analyze. |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 196 | |
| 197 | #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end |
| 198 | intrinsics where possible. Common profitable uses are for stack like data |
| 199 | structures (thus allowing dead store elimination) and for describing |
| 200 | life times of allocas (thus allowing smaller stack sizes). |
| 201 | |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 202 | #. Mark invariant locations using !invariant.load and TBAA's constant flags |
| 203 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 204 | Pass Ordering |
| 205 | ^^^^^^^^^^^^^ |
| 206 | |
| 207 | One of the most common mistakes made by new language frontend projects is to |
| 208 | use the existing -O2 or -O3 pass pipelines as is. These pass pipelines make a |
| 209 | good starting point for an optimizing compiler for any language, but they have |
| 210 | been carefully tuned for C and C++, not your target language. You will almost |
| 211 | certainly need to use a custom pass order to achieve optimal performance. A |
| 212 | couple specific suggestions: |
Philip Reames | a3bf52c | 2015-08-24 17:19:18 +0000 | [diff] [blame] | 213 | |
| 214 | #. For languages with numerous rarely executed guard conditions (e.g. null |
| 215 | checks, type checks, range checks) consider adding an extra execution or |
| 216 | two of LoopUnswith and LICM to your pass order. The standard pass order, |
| 217 | which is tuned for C and C++ applications, may not be sufficient to remove |
| 218 | all dischargeable checks from loops. |
| 219 | |
Philip Reames | aa297ea | 2015-08-24 17:38:58 +0000 | [diff] [blame] | 220 | #. If you language uses range checks, consider using the IRCE pass. It is not |
| 221 | currently part of the standard pass order. |
| 222 | |
| 223 | #. A useful sanity check to run is to run your optimized IR back through the |
| 224 | -O2 pipeline again. If you see noticeable improvement in the resulting IR, |
| 225 | you likely need to adjust your pass order. |
| 226 | |
| 227 | |
| 228 | I Still Can't Find What I'm Looking For |
| 229 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 230 | |
| 231 | If you didn't find what you were looking for above, consider proposing an piece |
| 232 | of metadata which provides the optimization hint you need. Such extensions are |
| 233 | relatively common and are generally well received by the community. You will |
| 234 | need to ensure that your proposal is sufficiently general so that it benefits |
| 235 | others if you wish to contribute it upstream. |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 236 | |
Philip Reames | 7223a7f | 2015-08-24 17:46:11 +0000 | [diff] [blame] | 237 | You should also consider describing the problem you're facing on `llvm-dev |
| 238 | <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice. |
| 239 | It's entirely possible someone has encountered your problem before and can |
| 240 | give good advice. If there are multiple interested parties, that also |
| 241 | increases the chances that a metadata extension would be well received by the |
| 242 | community as a whole. |
| 243 | |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 244 | Adding to this document |
| 245 | ======================= |
| 246 | |
| 247 | If you run across a case that you feel deserves to be covered here, please send |
| 248 | a patch to `llvm-commits |
Tanya Lattner | 0d28f80 | 2015-08-05 03:51:17 +0000 | [diff] [blame] | 249 | <http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review. |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 250 | |
Tanya Lattner | 0d28f80 | 2015-08-05 03:51:17 +0000 | [diff] [blame] | 251 | If you have questions on these items, please direct them to `llvm-dev |
| 252 | <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_. The more relevant |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 253 | context you are able to give to your question, the more likely it is to be |
| 254 | answered. |
| 255 | |