Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame] | 1 | ================== |
| 2 | Vectorization Plan |
| 3 | ================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Abstract |
| 9 | ======== |
| 10 | The vectorization transformation can be rather complicated, involving several |
| 11 | potential alternatives, especially for outer-loops [1]_ but also possibly for |
| 12 | innermost loops. These alternatives may have significant performance impact, |
| 13 | both positive and negative. A cost model is therefore employed to identify the |
| 14 | best alternative, including the alternative of avoiding any transformation |
| 15 | altogether. |
| 16 | |
| 17 | The Vectorization Plan is an explicit model for describing vectorization |
| 18 | candidates. It serves for both optimizing candidates including estimating their |
| 19 | cost reliably, and for performing their final translation into IR. This |
| 20 | facilitates dealing with multiple vectorization candidates. |
| 21 | |
| 22 | High-level Design |
| 23 | ================= |
| 24 | |
| 25 | Vectorization Workflow |
| 26 | ---------------------- |
| 27 | VPlan-based vectorization involves three major steps, taking a "scenario-based |
| 28 | approach" to vectorization planning: |
| 29 | |
Sylvestre Ledru | e3fdbae | 2017-06-26 02:45:39 +0000 | [diff] [blame] | 30 | 1. Legal Step: check if a loop can be legally vectorized; encode constraints and |
Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame] | 31 | artifacts if so. |
| 32 | 2. Plan Step: |
| 33 | |
| 34 | a. Build initial VPlans following the constraints and decisions taken by |
| 35 | Legal Step 1, and compute their cost. |
| 36 | b. Apply optimizations to the VPlans, possibly forking additional VPlans. |
| 37 | Prune sub-optimal VPlans having relatively high cost. |
| 38 | 3. Execute Step: materialize the best VPlan. Note that this is the only step |
| 39 | that modifies the IR. |
| 40 | |
| 41 | Design Guidelines |
| 42 | ----------------- |
| 43 | In what follows, the term "input IR" refers to code that is fed into the |
| 44 | vectorizer whereas the term "output IR" refers to code that is generated by the |
| 45 | vectorizer. The output IR contains code that has been vectorized or "widened" |
| 46 | according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed |
| 47 | according to an Unroll Factor (UF). |
| 48 | The design of VPlan follows several high-level guidelines: |
| 49 | |
| 50 | 1. Analysis-like: building and manipulating VPlans must not modify the input IR. |
| 51 | In particular, if the best option is not to vectorize at all, the |
| 52 | vectorization process terminates before reaching Step 3, and compilation |
| 53 | should proceed as if VPlans had not been built. |
| 54 | |
| 55 | 2. Align Cost & Execute: each VPlan must support both estimating the cost and |
| 56 | generating the output IR code, such that the cost estimation evaluates the |
| 57 | to-be-generated code reliably. |
| 58 | |
| 59 | 3. Support vectorizing additional constructs: |
| 60 | |
| 61 | a. Outer-loop vectorization. In particular, VPlan must be able to model the |
| 62 | control-flow of the output IR which may include multiple basic-blocks and |
| 63 | nested loops. |
| 64 | b. SLP vectorization. |
| 65 | c. Combinations of the above, including nested vectorization: vectorizing |
| 66 | both an inner loop and an outer-loop at the same time (each with its own |
| 67 | VF and UF), mixed vectorization: vectorizing a loop with SLP patterns |
| 68 | inside [4]_, (re)vectorizing input IR containing vector code. |
| 69 | d. Function vectorization [2]_. |
| 70 | |
| 71 | 4. Support multiple candidates efficiently. In particular, similar candidates |
| 72 | related to a range of possible VF's and UF's must be represented efficiently. |
| 73 | Potential versioning needs to be supported efficiently. |
| 74 | |
| 75 | 5. Support vectorizing idioms, such as interleaved groups of strided loads or |
| 76 | stores. This is achieved by modeling a sequence of output instructions using |
| 77 | a "Recipe", which is responsible for computing its cost and generating its |
| 78 | code. |
| 79 | |
| 80 | 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization |
| 81 | such regions may need to be, for example, predicated and linearized, or |
| 82 | replicated VF*UF times to handle scalarized and predicated instructions. |
| 83 | Innerloops are also modelled as SESE regions. |
| 84 | |
Gil Rapaport | 8b9d1f3 | 2017-11-20 12:01:47 +0000 | [diff] [blame] | 85 | 7. Support instruction-level analysis and transformation, as part of Planning |
| 86 | Step 2.b: During vectorization instructions may need to be traversed, moved, |
| 87 | replaced by other instructions or be created. For example, vector idiom |
| 88 | detection and formation involves searching for and optimizing instruction |
| 89 | patterns. |
| 90 | |
| 91 | Definitions |
| 92 | =========== |
Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame] | 93 | The low-level design of VPlan comprises of the following classes. |
| 94 | |
| 95 | :LoopVectorizationPlanner: |
| 96 | A LoopVectorizationPlanner is designed to handle the vectorization of a loop |
| 97 | or a loop nest. It can construct, optimize and discard one or more VPlans, |
| 98 | each VPlan modelling a distinct way to vectorize the loop or the loop nest. |
| 99 | Once the best VPlan is determined, including the best VF and UF, this VPlan |
| 100 | drives the generation of output IR. |
| 101 | |
| 102 | :VPlan: |
| 103 | A model of a vectorized candidate for a given input IR loop or loop nest. This |
| 104 | candidate is represented using a Hierarchical CFG. VPlan supports estimating |
| 105 | the cost and driving the generation of the output IR code it represents. |
| 106 | |
| 107 | :Hierarchical CFG: |
| 108 | A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The |
| 109 | Hierarchical CFG data structure is similar to the Tile Tree [5]_, where |
| 110 | cross-Tile edges are lifted to connect Tiles instead of the original |
| 111 | basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms |
| 112 | Region and Block are used rather than Tile [5]_ to avoid confusion with loop |
| 113 | tiling. |
| 114 | |
| 115 | :VPBlockBase: |
| 116 | The building block of the Hierarchical CFG. A pure-virtual base-class of |
| 117 | VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical |
| 118 | control-flow relations with other VPBlocks. Note that in contrast to the IR |
| 119 | BasicBlock, a VPBlockBase models its control-flow successors and predecessors |
| 120 | directly, rather than through a Terminator branch or through predecessor |
| 121 | branches that "use" the VPBlockBase. |
| 122 | |
| 123 | :VPBasicBlock: |
| 124 | VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the |
| 125 | Hierarchical CFG. It represents a sequence of output IR instructions that will |
| 126 | appear consecutively in an output IR basic-block. The instructions of this |
| 127 | basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a |
| 128 | sequence of zero or more VPRecipes that model the cost and generation of the |
| 129 | output IR instructions. |
| 130 | |
| 131 | :VPRegionBlock: |
| 132 | VPRegionBlock is a subclass of VPBlockBase. It models a collection of |
| 133 | VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR |
| 134 | CFG. A VPRegionBlock may indicate that its contents are to be replicated a |
| 135 | constant number of times when output IR is generated, effectively representing |
| 136 | a loop with constant trip-count that will be completely unrolled. This is used |
| 137 | to support scalarized and predicated instructions with a single model for |
| 138 | multiple candidate VF's and UF's. |
| 139 | |
| 140 | :VPRecipeBase: |
| 141 | A pure-virtual base class modeling a sequence of one or more output IR |
| 142 | instructions, possibly based on one or more input IR instructions. These |
| 143 | input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe |
| 144 | may specify how its ingredients are to be transformed to produce the output IR |
| 145 | instructions; e.g., cloned once, replicated multiple times or widened |
| 146 | according to selected VF. |
| 147 | |
Gil Rapaport | 8b9d1f3 | 2017-11-20 12:01:47 +0000 | [diff] [blame] | 148 | :VPValue: |
| 149 | The base of VPlan's def-use relations class hierarchy. When instantiated, it |
| 150 | models a constant or a live-in Value in VPlan. It has users, which are of type |
| 151 | VPUser, but no operands. |
| 152 | |
| 153 | :VPUser: |
| 154 | A VPValue representing a general vertex in the def-use graph of VPlan. It has |
| 155 | operands which are of type VPValue. When instantiated, it represents a |
| 156 | live-out Instruction that exists outside VPlan. VPUser is similar in some |
| 157 | aspects to LLVM's User class. |
| 158 | |
| 159 | :VPInstruction: |
| 160 | A VPInstruction is both a VPRecipe and a VPUser. It models a single |
| 161 | VPlan-level instruction to be generated if the VPlan is executed, including |
| 162 | its opcode and possibly additional characteristics. It is the basis for |
| 163 | writing instruction-level analyses and optimizations in VPlan as creating, |
| 164 | replacing or moving VPInstructions record both def-use and scheduling |
| 165 | decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic |
| 166 | operations that enrich the Vectorizer's semantics. |
| 167 | |
Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame] | 168 | :VPTransformState: |
| 169 | Stores information used for generating output IR, passed from |
| 170 | LoopVectorizationPlanner to its selected VPlan for execution, and used to pass |
| 171 | additional information down to VPBlocks and VPRecipes. |
| 172 | |
Gil Rapaport | 8b9d1f3 | 2017-11-20 12:01:47 +0000 | [diff] [blame] | 173 | The Planning Process and VPlan Roadmap |
| 174 | ====================================== |
| 175 | |
| 176 | Transforming the Loop Vectorizer to use VPlan follows a staged approach. First, |
| 177 | VPlan is used to record the final vectorization decisions, and to execute them: |
| 178 | the Hierarchical CFG models the planned control-flow, and Recipes capture |
| 179 | decisions taken inside basic-blocks. Next, VPlan will be used also as the basis |
| 180 | for taking these decisions, effectively turning them into a series of |
| 181 | VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process |
| 182 | itself including cost-based analyses for making these decisions, to fully |
| 183 | support compositional and iterative decision making. |
| 184 | |
| 185 | Some decisions are local to an instruction in the loop, such as whether to widen |
| 186 | it into a vector instruction or replicate it, keeping the generated instructions |
| 187 | in place. Other decisions, however, involve moving instructions, replacing them |
| 188 | with other instructions, and/or introducing new instructions. For example, a |
| 189 | cast may sink past a later instruction and be widened to handle first-order |
| 190 | recurrence; an interleave group of strided gathers or scatters may effectively |
| 191 | move to one place where they are replaced with shuffles and a common wide vector |
| 192 | load or store; new instructions may be introduced to compute masks, shuffle the |
| 193 | elements of vectors, and pack scalar values into vectors or vice-versa. |
| 194 | |
| 195 | In order for VPlan to support making instruction-level decisions and analyses, |
| 196 | it needs to model the relevant instructions along with their def/use relations. |
| 197 | This too follows a staged approach: first, the new instructions that compute |
| 198 | masks are modeled as VPInstructions, along with their induced def/use subgraph. |
| 199 | This effectively models masks in VPlan, facilitating VPlan-based predication. |
| 200 | Next, the logic embedded within each Recipe for generating its instructions at |
| 201 | VPlan execution time, will instead take part in the planning process by modeling |
| 202 | them as VPInstructions. Finally, only logic that applies to instructions as a |
| 203 | group will remain in Recipes, such as interleave groups and potentially other |
| 204 | idiom groups having synergistic cost. |
| 205 | |
Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame] | 206 | Related LLVM components |
| 207 | ----------------------- |
| 208 | 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP |
| 209 | tree, where TSLP [3]_ adds Plan Step 2.b. |
| 210 | |
| 211 | 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by |
| 212 | Polly [7]_. |
| 213 | |
Gil Rapaport | 8b9d1f3 | 2017-11-20 12:01:47 +0000 | [diff] [blame] | 214 | 3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of |
Tim Hammerquist | 91078c4 | 2018-01-05 00:24:55 +0000 | [diff] [blame] | 215 | the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. |
Gil Rapaport | 8b9d1f3 | 2017-11-20 12:01:47 +0000 | [diff] [blame] | 216 | |
Ayal Zaks | 4c4baf5 | 2017-05-29 15:36:23 +0000 | [diff] [blame] | 217 | References |
| 218 | ---------- |
| 219 | .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit |
| 220 | Nuzman and Ayal Zaks, PACT 2008. |
| 221 | |
| 222 | .. [2] "Proposal for function vectorization and loop vectorization with function |
| 223 | calls", Xinmin Tian, [`cfe-dev |
| 224 | <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., |
| 225 | March 2, 2016. |
| 226 | See also `review <https://reviews.llvm.org/D22792>`_. |
| 227 | |
| 228 | .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios |
| 229 | Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. |
| 230 | |
| 231 | .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization |
| 232 | overhead", Hao Zhou and Jingling Xue, CGO 2016. |
| 233 | |
| 234 | .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and |
| 235 | Brian Koblenz, PLDI 1991 |
| 236 | |
| 237 | .. [6] "Structural analysis: A new approach to flow analysis in optimizing |
| 238 | compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 |
| 239 | |
| 240 | .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma |
| 241 | thesis, 2011. |
| 242 | |
| 243 | .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, |
| 244 | European LLVM Developers' Meeting 2017. |
Gil Rapaport | 8b9d1f3 | 2017-11-20 12:01:47 +0000 | [diff] [blame] | 245 | |
| 246 | .. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop |
| 247 | Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. |