Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 1 | ===================================== |
| 2 | Performance Tips for Frontend Authors |
| 3 | ===================================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | :depth: 2 |
| 8 | |
| 9 | Abstract |
| 10 | ======== |
| 11 | |
| 12 | The intended audience of this document is developers of language frontends |
| 13 | targeting LLVM IR. This document is home to a collection of tips on how to |
| 14 | generate IR that optimizes well. As with any optimizer, LLVM has its strengths |
| 15 | and weaknesses. In some cases, surprisingly small changes in the source IR |
| 16 | can have a large effect on the generated code. |
| 17 | |
| 18 | Avoid loads and stores of large aggregate type |
| 19 | ================================================ |
| 20 | |
| 21 | LLVM currently does not optimize well loads and stores of large :ref:`aggregate |
| 22 | types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider |
| 23 | loading individual fields from memory. |
| 24 | |
| 25 | Aggregates that are smaller than the largest (performant) load or store |
| 26 | instruction supported by the targeted hardware are well supported. These can |
| 27 | be an effective way to represent collections of small packed fields. |
| 28 | |
| 29 | Prefer zext over sext when legal |
| 30 | ================================== |
| 31 | |
| 32 | On some architectures (X86_64 is one), sign extension can involve an extra |
| 33 | instruction whereas zero extension can be folded into a load. LLVM will try to |
| 34 | replace a sext with a zext when it can be proven safe, but if you have |
| 35 | information in your source language about the range of a integer value, it can |
| 36 | be profitable to use a zext rather than a sext. |
| 37 | |
| 38 | Alternatively, you can :ref:`specify the range of the value using metadata |
| 39 | <range-metadata>` and LLVM can do the sext to zext conversion for you. |
| 40 | |
| 41 | Zext GEP indices to machine register width |
| 42 | ============================================ |
| 43 | |
| 44 | Internally, LLVM often promotes the width of GEP indices to machine register |
| 45 | width. When it does so, it will default to using sign extension (sext) |
| 46 | operations for safety. If your source language provides information about |
| 47 | the range of the index, you may wish to manually extend indices to machine |
| 48 | register width using a zext instruction. |
| 49 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 50 | Other things to consider |
| 51 | ========================= |
| 52 | |
| 53 | #. Make sure that a DataLayout is provided (this will likely become required in |
| 54 | the near future, but is certainly important for optimization). |
| 55 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 56 | #. Add nsw/nuw flags as appropriate. Reasoning about overflow is |
| 57 | generally hard for an optimizer so providing these facts from the frontend |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 58 | can be very impactful. |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 59 | |
| 60 | #. Use fast-math flags on floating point operations if legal. If you don't |
| 61 | need strict IEEE floating point semantics, there are a number of additional |
| 62 | optimizations that can be performed. This can be highly impactful for |
| 63 | floating point intensive computations. |
| 64 | |
| 65 | #. Use inbounds on geps. This can help to disambiguate some aliasing queries. |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 66 | |
| 67 | #. Add noalias/align/dereferenceable/nonnull to function arguments and return |
| 68 | values as appropriate |
| 69 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 70 | #. Mark functions as readnone/readonly or noreturn/nounwind when known. The |
| 71 | optimizer will try to infer these flags, but may not always be able to. |
| 72 | Manual annotations are particularly important for external functions that |
| 73 | the optimizer can not analyze. |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 74 | |
| 75 | #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing |
| 76 | analysis), prefer GEPs |
| 77 | |
| 78 | #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end |
| 79 | intrinsics where possible. Common profitable uses are for stack like data |
| 80 | structures (thus allowing dead store elimination) and for describing |
| 81 | life times of allocas (thus allowing smaller stack sizes). |
| 82 | |
| 83 | #. Use pointer aliasing metadata, especially tbaa metadata, to communicate |
| 84 | otherwise-non-deducible pointer aliasing facts |
| 85 | |
| 86 | #. Use the "most-private" possible linkage types for the functions being defined |
| 87 | (private, internal or linkonce_odr preferably) |
| 88 | |
| 89 | #. Mark invariant locations using !invariant.load and TBAA's constant flags |
| 90 | |
| 91 | #. Prefer globals over inttoptr of a constant address - this gives you |
| 92 | dereferencability information. In MCJIT, use getSymbolAddress to provide |
| 93 | actual address. |
| 94 | |
| 95 | #. Be wary of ordered and atomic memory operations. They are hard to optimize |
| 96 | and may not be well optimized by the current optimizer. Depending on your |
| 97 | source language, you may consider using fences instead. |
| 98 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 99 | #. If calling a function which is known to throw an exception (unwind), use |
| 100 | an invoke with a normal destination which contains an unreachable |
| 101 | instruction. This form conveys to the optimizer that the call returns |
| 102 | abnormally. For an invoke which neither returns normally or requires unwind |
| 103 | code in the current function, you can use a noreturn call instruction if |
| 104 | desired. This is generally not required because the optimizer will convert |
| 105 | an invoke with an unreachable unwind destination to a call instruction. |
| 106 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 107 | #. If you language uses range checks, consider using the IRCE pass. It is not |
| 108 | currently part of the standard pass order. |
| 109 | |
Philip Reames | 34843ae | 2015-03-05 05:55:55 +0000 | [diff] [blame] | 110 | #. For languages with numerous rarely executed guard conditions (e.g. null |
| 111 | checks, type checks, range checks) consider adding an extra execution or |
| 112 | two of LoopUnswith and LICM to your pass order. The standard pass order, |
| 113 | which is tuned for C and C++ applications, may not be sufficient to remove |
| 114 | all dischargeable checks from loops. |
| 115 | |
| 116 | #. Use profile metadata to indicate statically known cold paths, even if |
| 117 | dynamic profiling information is not available. This can make a large |
| 118 | difference in code placement and thus the performance of tight loops. |
| 119 | |
| 120 | #. When generating code for loops, try to avoid terminating the header block of |
| 121 | the loop earlier than necessary. If the terminator of the loop header |
| 122 | block is a loop exiting conditional branch, the effectiveness of LICM will |
| 123 | be limited for loads not in the header. (This is due to the fact that LLVM |
| 124 | may not know such a load is safe to speculatively execute and thus can't |
| 125 | lift an otherwise loop invariant load unless it can prove the exiting |
| 126 | condition is not taken.) It can be profitable, in some cases, to emit such |
| 127 | instructions into the header even if they are not used along a rarely |
| 128 | executed path that exits the loop. This guidance specifically does not |
| 129 | apply if the condition which terminates the loop header is itself invariant, |
| 130 | or can be easily discharged by inspecting the loop index variables. |
| 131 | |
| 132 | #. In hot loops, consider duplicating instructions from small basic blocks |
| 133 | which end in highly predictable terminators into their successor blocks. |
| 134 | If a hot successor block contains instructions which can be vectorized |
| 135 | with the duplicated ones, this can provide a noticeable throughput |
| 136 | improvement. Note that this is not always profitable and does involve a |
| 137 | potentially large increase in code size. |
| 138 | |
| 139 | #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds |
| 140 | of predecessors). Among other issues, the register allocator is known to |
| 141 | perform badly with confronted with such structures. The only exception to |
| 142 | this guidance is that a unified return block with high in-degree is fine. |
| 143 | |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 144 | #. When checking a value against a constant, emit the check using a consistent |
| 145 | comparison type. The GVN pass _will_ optimize redundant equalities even if |
| 146 | the type of comparison is inverted, but GVN only runs late in the pipeline. |
Philip Reames | e0e9083 | 2015-04-26 22:23:12 +0000 | [diff] [blame^] | 147 | As a result, you may miss the opportunity to run other important |
Philip Reames | 65f3359 | 2015-04-26 22:15:18 +0000 | [diff] [blame] | 148 | optimizations. Improvements to EarlyCSE to remove this issue are tracked in |
| 149 | Bug 23333. |
| 150 | |
| 151 | #. Avoid using arithmetic intrinsics unless you are _required_ by your source |
| 152 | language specification to emit a particular code sequence. The optimizer |
| 153 | is quite good at reasoning about general control flow and arithmetic, it is |
| 154 | not anywhere near as strong at reasoning about the various intrinsics. If |
| 155 | profitable for code generation purposes, the optimizer will likely form the |
| 156 | intrinsics itself late in the optimization pipeline. It is _very_ rarely |
| 157 | profitable to emit these directly in the language frontend. This item |
| 158 | explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`. |
| 159 | |
Philip Reames | e0e9083 | 2015-04-26 22:23:12 +0000 | [diff] [blame^] | 160 | #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've |
| 161 | established that a) there's no other way to express the given fact and b) |
| 162 | that fact is critical for optimization purposes. Assumes are a great |
| 163 | prototyping mechanism, but they can have negative effects on both compile |
| 164 | time and optimization effectiveness. The former is fixable with enough |
| 165 | effort, but the later is fairly fundamental to their designed purpose. |
| 166 | |
Philip Reames | dd323ac | 2015-03-02 19:19:04 +0000 | [diff] [blame] | 167 | p.s. If you want to help improve this document, patches expanding any of the |
| 168 | above items into standalone sections of their own with a more complete |
| 169 | discussion would be very welcome. |
| 170 | |
Philip Reames | f8bf9dd | 2015-02-27 23:14:50 +0000 | [diff] [blame] | 171 | |
| 172 | Adding to this document |
| 173 | ======================= |
| 174 | |
| 175 | If you run across a case that you feel deserves to be covered here, please send |
| 176 | a patch to `llvm-commits |
| 177 | <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>`_ for review. |
| 178 | |
| 179 | If you have questions on these items, please direct them to `llvmdev |
| 180 | <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_. The more relevant |
| 181 | context you are able to give to your question, the more likely it is to be |
| 182 | answered. |
| 183 | |