Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame^] | 1 | ================== |
| 2 | Vectorization Plan |
| 3 | ================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Abstract |
| 9 | ======== |
| 10 | The vectorization transformation can be rather complicated, involving several |
| 11 | potential alternatives, especially for outer-loops [1]_ but also possibly for |
| 12 | innermost loops. These alternatives may have significant performance impact, |
| 13 | both positive and negative. A cost model is therefore employed to identify the |
| 14 | best alternative, including the alternative of avoiding any transformation |
| 15 | altogether. |
| 16 | |
| 17 | The Vectorization Plan is an explicit model for describing vectorization |
| 18 | candidates. It serves for both optimizing candidates including estimating their |
| 19 | cost reliably, and for performing their final translation into IR. This |
| 20 | facilitates dealing with multiple vectorization candidates. |
| 21 | |
| 22 | High-level Design |
| 23 | ================= |
| 24 | |
| 25 | Vectorization Workflow |
| 26 | ---------------------- |
| 27 | VPlan-based vectorization involves three major steps, taking a "scenario-based |
| 28 | approach" to vectorization planning: |
| 29 | |
| 30 | 1. Legal Step: check if a loop can be legally vectorized; encode contraints and |
| 31 | artifacts if so. |
| 32 | 2. Plan Step: |
| 33 | |
| 34 | a. Build initial VPlans following the constraints and decisions taken by |
| 35 | Legal Step 1, and compute their cost. |
| 36 | b. Apply optimizations to the VPlans, possibly forking additional VPlans. |
| 37 | Prune sub-optimal VPlans having relatively high cost. |
| 38 | 3. Execute Step: materialize the best VPlan. Note that this is the only step |
| 39 | that modifies the IR. |
| 40 | |
| 41 | Design Guidelines |
| 42 | ----------------- |
| 43 | In what follows, the term "input IR" refers to code that is fed into the |
| 44 | vectorizer whereas the term "output IR" refers to code that is generated by the |
| 45 | vectorizer. The output IR contains code that has been vectorized or "widened" |
| 46 | according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed |
| 47 | according to an Unroll Factor (UF). |
| 48 | The design of VPlan follows several high-level guidelines: |
| 49 | |
| 50 | 1. Analysis-like: building and manipulating VPlans must not modify the input IR. |
| 51 | In particular, if the best option is not to vectorize at all, the |
| 52 | vectorization process terminates before reaching Step 3, and compilation |
| 53 | should proceed as if VPlans had not been built. |
| 54 | |
| 55 | 2. Align Cost & Execute: each VPlan must support both estimating the cost and |
| 56 | generating the output IR code, such that the cost estimation evaluates the |
| 57 | to-be-generated code reliably. |
| 58 | |
| 59 | 3. Support vectorizing additional constructs: |
| 60 | |
| 61 | a. Outer-loop vectorization. In particular, VPlan must be able to model the |
| 62 | control-flow of the output IR which may include multiple basic-blocks and |
| 63 | nested loops. |
| 64 | b. SLP vectorization. |
| 65 | c. Combinations of the above, including nested vectorization: vectorizing |
| 66 | both an inner loop and an outer-loop at the same time (each with its own |
| 67 | VF and UF), mixed vectorization: vectorizing a loop with SLP patterns |
| 68 | inside [4]_, (re)vectorizing input IR containing vector code. |
| 69 | d. Function vectorization [2]_. |
| 70 | |
| 71 | 4. Support multiple candidates efficiently. In particular, similar candidates |
| 72 | related to a range of possible VF's and UF's must be represented efficiently. |
| 73 | Potential versioning needs to be supported efficiently. |
| 74 | |
| 75 | 5. Support vectorizing idioms, such as interleaved groups of strided loads or |
| 76 | stores. This is achieved by modeling a sequence of output instructions using |
| 77 | a "Recipe", which is responsible for computing its cost and generating its |
| 78 | code. |
| 79 | |
| 80 | 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization |
| 81 | such regions may need to be, for example, predicated and linearized, or |
| 82 | replicated VF*UF times to handle scalarized and predicated instructions. |
| 83 | Innerloops are also modelled as SESE regions. |
| 84 | |
| 85 | Low-level Design |
| 86 | ================ |
| 87 | The low-level design of VPlan comprises of the following classes. |
| 88 | |
| 89 | :LoopVectorizationPlanner: |
| 90 | A LoopVectorizationPlanner is designed to handle the vectorization of a loop |
| 91 | or a loop nest. It can construct, optimize and discard one or more VPlans, |
| 92 | each VPlan modelling a distinct way to vectorize the loop or the loop nest. |
| 93 | Once the best VPlan is determined, including the best VF and UF, this VPlan |
| 94 | drives the generation of output IR. |
| 95 | |
| 96 | :VPlan: |
| 97 | A model of a vectorized candidate for a given input IR loop or loop nest. This |
| 98 | candidate is represented using a Hierarchical CFG. VPlan supports estimating |
| 99 | the cost and driving the generation of the output IR code it represents. |
| 100 | |
| 101 | :Hierarchical CFG: |
| 102 | A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The |
| 103 | Hierarchical CFG data structure is similar to the Tile Tree [5]_, where |
| 104 | cross-Tile edges are lifted to connect Tiles instead of the original |
| 105 | basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms |
| 106 | Region and Block are used rather than Tile [5]_ to avoid confusion with loop |
| 107 | tiling. |
| 108 | |
| 109 | :VPBlockBase: |
| 110 | The building block of the Hierarchical CFG. A pure-virtual base-class of |
| 111 | VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical |
| 112 | control-flow relations with other VPBlocks. Note that in contrast to the IR |
| 113 | BasicBlock, a VPBlockBase models its control-flow successors and predecessors |
| 114 | directly, rather than through a Terminator branch or through predecessor |
| 115 | branches that "use" the VPBlockBase. |
| 116 | |
| 117 | :VPBasicBlock: |
| 118 | VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the |
| 119 | Hierarchical CFG. It represents a sequence of output IR instructions that will |
| 120 | appear consecutively in an output IR basic-block. The instructions of this |
| 121 | basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a |
| 122 | sequence of zero or more VPRecipes that model the cost and generation of the |
| 123 | output IR instructions. |
| 124 | |
| 125 | :VPRegionBlock: |
| 126 | VPRegionBlock is a subclass of VPBlockBase. It models a collection of |
| 127 | VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR |
| 128 | CFG. A VPRegionBlock may indicate that its contents are to be replicated a |
| 129 | constant number of times when output IR is generated, effectively representing |
| 130 | a loop with constant trip-count that will be completely unrolled. This is used |
| 131 | to support scalarized and predicated instructions with a single model for |
| 132 | multiple candidate VF's and UF's. |
| 133 | |
| 134 | :VPRecipeBase: |
| 135 | A pure-virtual base class modeling a sequence of one or more output IR |
| 136 | instructions, possibly based on one or more input IR instructions. These |
| 137 | input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe |
| 138 | may specify how its ingredients are to be transformed to produce the output IR |
| 139 | instructions; e.g., cloned once, replicated multiple times or widened |
| 140 | according to selected VF. |
| 141 | |
| 142 | :VPTransformState: |
| 143 | Stores information used for generating output IR, passed from |
| 144 | LoopVectorizationPlanner to its selected VPlan for execution, and used to pass |
| 145 | additional information down to VPBlocks and VPRecipes. |
| 146 | |
| 147 | Related LLVM components |
| 148 | ----------------------- |
| 149 | 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP |
| 150 | tree, where TSLP [3]_ adds Plan Step 2.b. |
| 151 | |
| 152 | 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by |
| 153 | Polly [7]_. |
| 154 | |
| 155 | References |
| 156 | ---------- |
| 157 | .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit |
| 158 | Nuzman and Ayal Zaks, PACT 2008. |
| 159 | |
| 160 | .. [2] "Proposal for function vectorization and loop vectorization with function |
| 161 | calls", Xinmin Tian, [`cfe-dev |
| 162 | <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., |
| 163 | March 2, 2016. |
| 164 | See also `review <https://reviews.llvm.org/D22792>`_. |
| 165 | |
| 166 | .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios |
| 167 | Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. |
| 168 | |
| 169 | .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization |
| 170 | overhead", Hao Zhou and Jingling Xue, CGO 2016. |
| 171 | |
| 172 | .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and |
| 173 | Brian Koblenz, PLDI 1991 |
| 174 | |
| 175 | .. [6] "Structural analysis: A new approach to flow analysis in optimizing |
| 176 | compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 |
| 177 | |
| 178 | .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma |
| 179 | thesis, 2011. |
| 180 | |
| 181 | .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, |
| 182 | European LLVM Developers' Meeting 2017. |