Cody Northrop | 0d5881e | 2014-09-17 14:06:55 -0600 | [diff] [blame^] | 1 | Welcome to Mesa's GLSL compiler. A brief overview of how things flow: |
| 2 | |
| 3 | 1) lex and yacc-based preprocessor takes the incoming shader string |
| 4 | and produces a new string containing the preprocessed shader. This |
| 5 | takes care of things like #if, #ifdef, #define, and preprocessor macro |
| 6 | invocations. Note that #version, #extension, and some others are |
| 7 | passed straight through. See glcpp/* |
| 8 | |
| 9 | 2) lex and yacc-based parser takes the preprocessed string and |
| 10 | generates the AST (abstract syntax tree). Almost no checking is |
| 11 | performed in this stage. See glsl_lexer.lpp and glsl_parser.ypp. |
| 12 | |
| 13 | 3) The AST is converted to "HIR". This is the intermediate |
| 14 | representation of the compiler. Constructors are generated, function |
| 15 | calls are resolved to particular function signatures, and all the |
| 16 | semantic checking is performed. See ast_*.cpp for the conversion, and |
| 17 | ir.h for the IR structures. |
| 18 | |
| 19 | 4) The driver (Mesa, or main.cpp for the standalone binary) performs |
| 20 | optimizations. These include copy propagation, dead code elimination, |
| 21 | constant folding, and others. Generally the driver will call |
| 22 | optimizations in a loop, as each may open up opportunities for other |
| 23 | optimizations to do additional work. See most files called ir_*.cpp |
| 24 | |
| 25 | 5) linking is performed. This does checking to ensure that the |
| 26 | outputs of the vertex shader match the inputs of the fragment shader, |
| 27 | and assigns locations to uniforms, attributes, and varyings. See |
| 28 | linker.cpp. |
| 29 | |
| 30 | 6) The driver may perform additional optimization at this point, as |
| 31 | for example dead code elimination previously couldn't remove functions |
| 32 | or global variable usage when we didn't know what other code would be |
| 33 | linked in. |
| 34 | |
| 35 | 7) The driver performs code generation out of the IR, taking a linked |
| 36 | shader program and producing a compiled program for each stage. See |
| 37 | ir_to_mesa.cpp for Mesa IR code generation. |
| 38 | |
| 39 | FAQ: |
| 40 | |
| 41 | Q: What is HIR versus IR versus LIR? |
| 42 | |
| 43 | A: The idea behind the naming was that ast_to_hir would produce a |
| 44 | high-level IR ("HIR"), with things like matrix operations, structure |
| 45 | assignments, etc., present. A series of lowering passes would occur |
| 46 | that do things like break matrix multiplication into a series of dot |
| 47 | products/MADs, make structure assignment be a series of assignment of |
| 48 | components, flatten if statements into conditional moves, and such, |
| 49 | producing a low level IR ("LIR"). |
| 50 | |
| 51 | However, it now appears that each driver will have different |
| 52 | requirements from a LIR. A 915-generation chipset wants all functions |
| 53 | inlined, all loops unrolled, all ifs flattened, no variable array |
| 54 | accesses, and matrix multiplication broken down. The Mesa IR backend |
| 55 | for swrast would like matrices and structure assignment broken down, |
| 56 | but it can support function calls and dynamic branching. A 965 vertex |
| 57 | shader IR backend could potentially even handle some matrix operations |
| 58 | without breaking them down, but the 965 fragment shader IR backend |
| 59 | would want to break to have (almost) all operations down channel-wise |
| 60 | and perform optimization on that. As a result, there's no single |
| 61 | low-level IR that will make everyone happy. So that usage has fallen |
| 62 | out of favor, and each driver will perform a series of lowering passes |
| 63 | to take the HIR down to whatever restrictions it wants to impose |
| 64 | before doing codegen. |
| 65 | |
| 66 | Q: How is the IR structured? |
| 67 | |
| 68 | A: The best way to get started seeing it would be to run the |
| 69 | standalone compiler against a shader: |
| 70 | |
| 71 | ./glsl_compiler --dump-lir \ |
| 72 | ~/src/piglit/tests/shaders/glsl-orangebook-ch06-bump.frag |
| 73 | |
| 74 | So for example one of the ir_instructions in main() contains: |
| 75 | |
| 76 | (assign (constant bool (1)) (var_ref litColor) (expression vec3 * (var_ref Surf |
| 77 | aceColor) (var_ref __retval) ) ) |
| 78 | |
| 79 | Or more visually: |
| 80 | (assign) |
| 81 | / | \ |
| 82 | (var_ref) (expression *) (constant bool 1) |
| 83 | / / \ |
| 84 | (litColor) (var_ref) (var_ref) |
| 85 | / \ |
| 86 | (SurfaceColor) (__retval) |
| 87 | |
| 88 | which came from: |
| 89 | |
| 90 | litColor = SurfaceColor * max(dot(normDelta, LightDir), 0.0); |
| 91 | |
| 92 | (the max call is not represented in this expression tree, as it was a |
| 93 | function call that got inlined but not brought into this expression |
| 94 | tree) |
| 95 | |
| 96 | Each of those nodes is a subclass of ir_instruction. A particular |
| 97 | ir_instruction instance may only appear once in the whole IR tree with |
| 98 | the exception of ir_variables, which appear once as variable |
| 99 | declarations: |
| 100 | |
| 101 | (declare () vec3 normDelta) |
| 102 | |
| 103 | and multiple times as the targets of variable dereferences: |
| 104 | ... |
| 105 | (assign (constant bool (1)) (var_ref __retval) (expression float dot |
| 106 | (var_ref normDelta) (var_ref LightDir) ) ) |
| 107 | ... |
| 108 | (assign (constant bool (1)) (var_ref __retval) (expression vec3 - |
| 109 | (var_ref LightDir) (expression vec3 * (constant float (2.000000)) |
| 110 | (expression vec3 * (expression float dot (var_ref normDelta) (var_ref |
| 111 | LightDir) ) (var_ref normDelta) ) ) ) ) |
| 112 | ... |
| 113 | |
| 114 | Each node has a type. Expressions may involve several different types: |
| 115 | (declare (uniform ) mat4 gl_ModelViewMatrix) |
| 116 | ((assign (constant bool (1)) (var_ref constructor_tmp) (expression |
| 117 | vec4 * (var_ref gl_ModelViewMatrix) (var_ref gl_Vertex) ) ) |
| 118 | |
| 119 | An expression tree can be arbitrarily deep, and the compiler tries to |
| 120 | keep them structured like that so that things like algebraic |
| 121 | optimizations ((color * 1.0 == color) and ((mat1 * mat2) * vec == mat1 |
| 122 | * (mat2 * vec))) or recognizing operation patterns for code generation |
| 123 | (vec1 * vec2 + vec3 == mad(vec1, vec2, vec3)) are easier. This comes |
| 124 | at the expense of additional trickery in implementing some |
| 125 | optimizations like CSE where one must navigate an expression tree. |
| 126 | |
| 127 | Q: Why no SSA representation? |
| 128 | |
| 129 | A: Converting an IR tree to SSA form makes dead code elmimination, |
| 130 | common subexpression elimination, and many other optimizations much |
| 131 | easier. However, in our primarily vector-based language, there's some |
| 132 | major questions as to how it would work. Do we do SSA on the scalar |
| 133 | or vector level? If we do it at the vector level, we're going to end |
| 134 | up with many different versions of the variable when encountering code |
| 135 | like: |
| 136 | |
| 137 | (assign (constant bool (1)) (swiz x (var_ref __retval) ) (var_ref a) ) |
| 138 | (assign (constant bool (1)) (swiz y (var_ref __retval) ) (var_ref b) ) |
| 139 | (assign (constant bool (1)) (swiz z (var_ref __retval) ) (var_ref c) ) |
| 140 | |
| 141 | If every masked update of a component relies on the previous value of |
| 142 | the variable, then we're probably going to be quite limited in our |
| 143 | dead code elimination wins, and recognizing common expressions may |
| 144 | just not happen. On the other hand, if we operate channel-wise, then |
| 145 | we'll be prone to optimizing the operation on one of the channels at |
| 146 | the expense of making its instruction flow different from the other |
| 147 | channels, and a vector-based GPU would end up with worse code than if |
| 148 | we didn't optimize operations on that channel! |
| 149 | |
| 150 | Once again, it appears that our optimization requirements are driven |
| 151 | significantly by the target architecture. For now, targeting the Mesa |
| 152 | IR backend, SSA does not appear to be that important to producing |
| 153 | excellent code, but we do expect to do some SSA-based optimizations |
| 154 | for the 965 fragment shader backend when that is developed. |
| 155 | |
| 156 | Q: How should I expand instructions that take multiple backend instructions? |
| 157 | |
| 158 | Sometimes you'll have to do the expansion in your code generation -- |
| 159 | see, for example, ir_to_mesa.cpp's handling of ir_unop_sqrt. However, |
| 160 | in many cases you'll want to do a pass over the IR to convert |
| 161 | non-native instructions to a series of native instructions. For |
| 162 | example, for the Mesa backend we have ir_div_to_mul_rcp.cpp because |
| 163 | Mesa IR (and many hardware backends) only have a reciprocal |
| 164 | instruction, not a divide. Implementing non-native instructions this |
| 165 | way gives the chance for constant folding to occur, so (a / 2.0) |
| 166 | becomes (a * 0.5) after codegen instead of (a * (1.0 / 2.0)) |
| 167 | |
| 168 | Q: How shoud I handle my special hardware instructions with respect to IR? |
| 169 | |
| 170 | Our current theory is that if multiple targets have an instruction for |
| 171 | some operation, then we should probably be able to represent that in |
| 172 | the IR. Generally this is in the form of an ir_{bin,un}op expression |
| 173 | type. For example, we initially implemented fract() using (a - |
| 174 | floor(a)), but both 945 and 965 have instructions to give that result, |
| 175 | and it would also simplify the implementation of mod(), so |
| 176 | ir_unop_fract was added. The following areas need updating to add a |
| 177 | new expression type: |
| 178 | |
| 179 | ir.h (new enum) |
| 180 | ir.cpp:operator_strs (used for ir_reader) |
| 181 | ir_constant_expression.cpp (you probably want to be able to constant fold) |
| 182 | ir_validate.cpp (check users have the right types) |
| 183 | |
| 184 | You may also need to update the backends if they will see the new expr type: |
| 185 | |
| 186 | ../mesa/shaders/ir_to_mesa.cpp |
| 187 | |
| 188 | You can then use the new expression from builtins (if all backends |
| 189 | would rather see it), or scan the IR and convert to use your new |
| 190 | expression type (see ir_mod_to_fract, for example). |
| 191 | |
| 192 | Q: How is memory management handled in the compiler? |
| 193 | |
| 194 | The hierarchical memory allocator "talloc" developed for the Samba |
| 195 | project is used, so that things like optimization passes don't have to |
| 196 | worry about their garbage collection so much. It has a few nice |
| 197 | features, including low performance overhead and good debugging |
| 198 | support that's trivially available. |
| 199 | |
| 200 | Generally, each stage of the compile creates a talloc context and |
| 201 | allocates its memory out of that or children of it. At the end of the |
| 202 | stage, the pieces still live are stolen to a new context and the old |
| 203 | one freed, or the whole context is kept for use by the next stage. |
| 204 | |
| 205 | For IR transformations, a temporary context is used, then at the end |
| 206 | of all transformations, reparent_ir reparents all live nodes under the |
| 207 | shader's IR list, and the old context full of dead nodes is freed. |
| 208 | When developing a single IR transformation pass, this means that you |
| 209 | want to allocate instruction nodes out of the temporary context, so if |
| 210 | it becomes dead it doesn't live on as the child of a live node. At |
| 211 | the moment, optimization passes aren't passed that temporary context, |
| 212 | so they find it by calling talloc_parent() on a nearby IR node. The |
| 213 | talloc_parent() call is expensive, so many passes will cache the |
| 214 | result of the first talloc_parent(). Cleaning up all the optimization |
| 215 | passes to take a context argument and not call talloc_parent() is left |
| 216 | as an exercise. |
| 217 | |
| 218 | Q: What is the file naming convention in this directory? |
| 219 | |
| 220 | Initially, there really wasn't one. We have since adopted one: |
| 221 | |
| 222 | - Files that implement code lowering passes should be named lower_* |
| 223 | (e.g., lower_noise.cpp). |
| 224 | - Files that implement optimization passes should be named opt_*. |
| 225 | - Files that implement a class that is used throught the code should |
| 226 | take the name of that class (e.g., ir_hierarchical_visitor.cpp). |
| 227 | - Files that contain code not fitting in one of the previous |
| 228 | categories should have a sensible name (e.g., glsl_parser.ypp). |