| # Bufferization on MLIR |
| |
| The general task of bufferization is to move SSA values (like tensors) into |
| allocated memory buffers that have to be freed when they are no longer needed. |
| This also involves the placement of copies to clone contents of allocated |
| memory buffers at specific locations (similar to register allocation). On the |
| one hand, these copies are needed to ensure the right behavior of a program, on |
| the other hand, introducing several aliases for a certain buffer could lead to |
| a wrong freeing of buffers. Therefore, we have to take care of them and the |
| program structure. The introduction of copies solves this problem. Several |
| unnecessary introduced copies during this process can be eliminated afterwards. |
| |
| ```mlir |
| func @func_on_tensors(%source: tensor<4xf32>) -> tensor<4xf32> { |
| %0 = mhlo.exp %source : (tensor<4xf32>) -> (tensor<4xf32>) |
| return %0 : tensor<4xf32> |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @func_on_buffers(%source: memref<4xf32>, %target: memref<4xf32>) { |
| %0 = alloc() : memref<4xf32> |
| lmhlo.exp %source, %0 : (memref<4xf32>, memref<4xf32>) -> () |
| lmhlo.copy %0, %target : (memref<4xf32>, memref<4xf32>) -> () |
| dealloc %0 : memref<4xf32> |
| return |
| } |
| ``` |
| |
| In general, Bufferization is split into three separate phases: a preparation |
| phase, a bufferization phase and a post-processing phase. The assignment |
| process happens during dialect conversion and allocates buffers for each value |
| that should be moved into a memory buffer. This has to be implemented by each |
| dialect using the following tools and patterns. Thereby, all operations on |
| memory buffers have to be changed to `memref<T>` types (see Preparation Phase). |
| Afterwards, the placement transformation (see BufferDeallocation) adds all |
| required deallocation operations, temporary buffers and copy operations |
| automatically. |
| |
| ## Preparation Phase |
| |
| In order to apply the BufferDeallocation code transformation, the input MLIR |
| program needs to leverage allocations (buffers in general) and `memref<T>` |
| types(as outlined above). If your input program does not work on buffers, you |
| need to perform this preparation step in order to port it to the “buffer |
| world”. This is a user-defined preparation step that is intended to be applied |
| during dialect conversion. The user has to take care for the right conversion |
| by providing conversion patterns relying on a type converter to assign buffers. |
| |
| A necessary step is to apply type and function signature conversion. |
| Furthermore, after changing all function signatures, the associated return and |
| call operations must comply with the new corresponding function signatures. For |
| this purpose, three operation conversion patterns are introduced: |
| |
| * BufferizeFuncOpConverter |
| * BufferizeReturnOpConverter |
| * BufferizeCallOpConverter |
| |
| In order to use these conversion patterns, the user needs to define a custom |
| BufferizeTypeConverter implementation. |
| |
| ### BufferizeTypeConverter |
| |
| The BufferizeTypeConverter is an extension to the TypeConverter class that |
| provides additional functionality for dialect developers to decide on the |
| signature of a function. The extra features include: |
| |
| * `setResultConversionKind` to decide on a function result after the conversion |
| with a specific type to be appended to the function argument list as an output |
| argument or remains as a function result. |
| * define their own callback function for type or value unwrapping. |
| |
| ### ResultConversionKind |
| |
| ResultConversionKind is an enum with two values |
| |
| * AppendToArgumentList |
| * KeepAsFunctionResult |
| |
| that defines how BufferizeFuncOpConverter should handle function results in |
| order to convert the signature of the function. The other two operation |
| conversion patterns also use ResultConversionKind to adapt themselves with the |
| new function signature. |
| |
| `ResultConversionKind` can be set using `setResultConversionKind`, which needs |
| two template parameters that correspond to the types before and after type |
| conversion. This mapping specifies whether the resulting type should stay as a |
| function result or should be appended to the arguments list after the |
| conversion is done. Note that the default value for unspecified mappings is |
| `KeepAsFunctionResult`. For instance, the following command updates the |
| `BufferizeTypeConverter` instance that defines all MemRefType function results |
| (converted from `RankedTensorTypes`). These results should be appended to the |
| function argument list in BufferizeFuncOpConverter: |
| |
| ```mlir |
| converter.setResultConversionKind<RankedTensorType, MemRefType>( |
| BufferizeTypeConverter::AppendToArgumentsList); |
| ``` |
| |
| ## Callbacks for Unpacking Types |
| |
| ```mlir |
| func @func_on_tensors(%arg0: tuple<i1,f32>) -> (tuple<tensor<10xf32>, tensor<5xf32>>) { |
| ... |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @func_on_buffers(%arg0: i1, %arg1: f32, %target0: memref<10xf32>, %target1: memref<5xf32>) { |
| ... |
| } |
| ``` |
| |
| BufferizeFuncOpConverter is also able to unpack the types of arguments and |
| results of a function during function signature conversion. In the example |
| above, it unwraps the tuple type and converts the type of each constituent. |
| |
| For this purpose, users can provide custom callbacks by using |
| `addDecomposeTypeConversion` for the `BufferizeTypeConverter` instance to show |
| how a specific type (i.e. TupleType) can be unpacked. However, when a type |
| decomposition is provided, there are two additional callbacks that have to be |
| defined as well. They specify how to pack or unpack values of that particular |
| type. These two callbacks can be provided by the `addArgumentMaterialization` |
| (packing) **and** `addDecomposeValueConversion` (unpacking) functions: |
| |
| The following piece of code demonstrates this functionality by flattening a |
| `TupleType`. |
| |
| ```mlir |
| converter.addDecomposeTypeConversion( |
| [](TupleType tupleType, SmallVectorImpl<Type> &types) { |
| tupleType.getFlattenedTypes(types); |
| return success(); |
| }); |
| |
| converter.addArgumentMaterialization( |
| [](OpBuilder &builder, TupleType resultType, ValueRange inputs, |
| Location loc) -> Optional<Value> { |
| if (inputs.size() == 1) |
| return llvm::None; |
| TypeRange TypeRange = inputs.getTypes(); |
| SmallVector<Type, 2> types(TypeRange.begin(), TypeRange.end()); |
| TupleType tuple = TupleType::get(types, builder.getContext()); |
| mlir::Value value = builder.create<MakeTupleOp>(loc, tuple, inputs); |
| return value; |
| }); |
| |
| converter.addDecomposeValueConversion([](OpBuilder &builder, Location loc, |
| TupleType resultType, Value value, |
| SmallVectorImpl<Value> &values) { |
| for (unsigned i = 0, e = resultType.size(); i < e; ++i) { |
| Value res = builder.create<GetTupleElementOp>( |
| loc, resultType.getType(i), value, builder.getI32IntegerAttr(i)); |
| values.push_back(res); |
| } |
| return success(); |
| }); |
| ``` |
| |
| In the scope of these callback functions, the elements of a tuple value can be |
| decomposed using `GetTupleElementOp`. Conversely, `MakeTupleOp` is used to pack |
| a list of values as a single tuple type. |
| |
| ### Bufferization Operation Conversion Patterns |
| |
| The following conversion patterns can be used to conveniently transform the |
| signature of a function, the return and call operations: |
| |
| * `BufferizeFuncOpConverter` |
| * `BufferizeReturnOpConverter` |
| * `BufferizeCallOpConverter` |
| |
| Any combination of these conversion patterns can be specified by the user. If |
| you need to apply all of these operation converters, you can use |
| `populateWithBufferizeOpConversionPatterns` which sets up all converters. |
| |
| ### BufferizeFuncOpConverter |
| |
| The BufferizeFuncOpConverter is the actual function operation converter that |
| applies signature conversion by using a previously defined |
| `BufferizeTypeConverter`. |
| |
| In the following example, we configure a `BufferizeTypeConverter` instance such |
| that |
| |
| * all RankedTensorTypes should be converted to MemRefTypes. |
| * all function results that are results of type conversion from |
| RankedTensorTypes to MemRefTypes should be appended to the function argument |
| list. |
| * all TupleTypes should be flattened and decomposed to its constituents. |
| |
| ```mlir |
| converter.addConversion([](RankedTensorType type) { |
| return (Type)MemRefType::get(type.getShape(), type.getElementType()); |
| }); |
| converter.setResultConversionKind<RankedTensorType, MemRefType>( |
| BufferizeTypeConverter::AppendToArgumentsList); |
| converter.addDecomposeTypeConversion( |
| [](TupleType tupleType, SmallVectorImpl<Type> &types) { |
| tupleType.getFlattenedTypes(types); |
| return success(); |
| }); |
| ``` |
| |
| Consider the following signature conversion: |
| |
| ```mlir |
| func @on_tensors(%arg1: tuple<i1,f32>) -> (tuple<memref<10xf32>, tensor<5xf32>>){ |
| ... |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @on_buffers(%arg0: i1, %arg1: f32, %out: memref<5xf32>) -> memref<10xf32> { |
| ... |
| } |
| ``` |
| |
| Using the presented converter setup, all TupleType arguments and results are |
| decomposed first. The tensor<5xf32> result is converted to a memref<5xf32> type |
| and appended to the argument list. There is no conversion for the types memref, |
| i1, and f32. Therefore, the memref<10xf32> result is kept as it is and will |
| also be kept as a function result since there is no ResultConversionKind |
| mapping from a MemRefType to a MemRefType. However, if we change the |
| result-conversion behavior via |
| |
| ```mlir |
| converter.setResultConversionKind<RankedTensorType, MemRefType>( |
| BufferizeTypeConverter::KeepAsFunctionResult); |
| ``` |
| |
| the output will be: |
| |
| ```mlir |
| func @on_buffers(%arg0: i1, %arg1: f32) -> (memref<10xf32>, memref<5xf32>) { |
| ... |
| } |
| ``` |
| |
| ### BufferizeReturnOpConverter |
| |
| When changing the signature of a function, the return operands must match with |
| the results of the corresponding function if buffer-typed-results have been |
| configured to be appended to the function arguments list. This matching |
| consists of two separate steps. First, we have to remove the operands that have |
| been appended to the argument list as output arguments. Second, we have to |
| introduce additional copies for each operand. However, since each dialect has |
| its own dialect-dependent return and copy operations, this conversion pattern |
| comes with three template parameters which are the original return operation, |
| target return operation, and copy operation kinds. |
| |
| In the following example, two conversion patterns are inserted into the pattern |
| list. The `BufferizeReturnOpConverter` is set to replace a standard return |
| operation with the same operation type. |
| |
| ```mlir |
| patterns->insert< |
| BufferizeFuncOpConverter, |
| BufferizeReturnOpConverter |
| <mlir::ReturnOp, mlir::ReturnOp, linalg::CopyOp> |
| >(...) |
| ``` |
| |
| Consider the following input/output program using a single return: |
| |
| ```mlir |
| func @on_tensors(%arg0: tensor<5xf32>, %arg1: i1) -> (tensor<5xf32>, i1) { |
| return %arg0, %arg1 : tensor<5xf32>, i1 |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @on_buffers(%arg0: memref<5xf32>, %arg1: i1, %out: memref<5xf32>) -> i1 { |
| linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32> |
| return %arg1 : i1 |
| } |
| ``` |
| |
| Based on our previously configured `BufferizeTypeConverter` instance which |
| requires buffer-typed-function-results to be appended to the function argument |
| list, the new `on_buffers` function signature is created. The operands of the |
| return operation must be adapted with the new function signature. Therefore, |
| the buffer-typed operand is removed from the operand list of the new return |
| operation. Instead, a copy operation is inserted right before the return |
| operation to copy the content of the operand buffer to the target buffer and |
| yields the output as shown above. |
| |
| ### BufferizeCallOpConverter |
| |
| The BufferizeCallOpConverter is a call operation converter that transforms and |
| matches the operands and results of a call operation with the arguments and |
| results of the callee. Besides converting operand and result types, it |
| allocates a buffer for each buffer-based result of the called function that is |
| appended to the argument list (if buffer typed results have been configured to |
| be appended to the function arguments list). |
| |
| The following piece of code shows a sample call site, based on our previously |
| configured `BufferizeTypeConversion`: |
| |
| ```mlir |
| func @callee(%arg0: tensor<5xf32>) -> (tensor<5xf32>) { |
| return %arg0 : tensor<5xf32> |
| } |
| |
| func @caller(%arg0: tensor<5xf32>) -> tensor<5xf32> { |
| %x = call @callee(%arg0) : (tensor<5xf32>) -> tensor<5xf32> |
| return %x : tensor<5xf32> |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @callee(%arg0: memref<5xf32>, %out: memref<5xf32>) { |
| linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32> |
| return |
| } |
| |
| func @caller(%arg0: memref<5xf32>, %out: memref<5xf32>) { |
| %0 = alloc() : memref<5xf32> |
| call @callee(%arg0, %0) : (memref<5xf32>, memref<5xf32>) -> () |
| linalg.copy(%0, %out) : memref<5xf32>, memref<5xf32> |
| return |
| } |
| ``` |
| |
| ### Summarizing Example |
| |
| To summarize all preparation converters, the following sample is a complete |
| listing of an input IR program and its output after applying all converters: |
| |
| ```mlir |
| func @callee(%arg0: tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> { |
| return %arg0 : tuple<tensor<5xf32>,i1> |
| } |
| |
| func @caller(%arg0: tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> { |
| %x = call @callee(%arg0) : (tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> |
| return %x : tuple<tensor<5xf32>,i1> |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @callee(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 { |
| %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1> |
| %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32> |
| %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1 |
| linalg.copy(%1, %arg2) : memref<5xf32>, memref<5xf32> |
| return %2 : i1 |
| } |
| func @caller(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 { |
| %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1> |
| %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32> |
| %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1 |
| %3 = alloc() : memref<5xf32> |
| %4 = call @callee(%1, %2, %3) : (memref<5xf32>, i1, memref<5xf32>) -> i1 |
| %5 = "test.make_tuple"(%3, %4) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1> |
| %6 = "test.get_tuple_element"(%5) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32> |
| %7 = "test.get_tuple_element"(%5) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1 |
| linalg.copy(%6, %arg2) : memref<5xf32>, memref<5xf32> |
| return %7 : i1 |
| } |
| ``` |
| |
| ## Buffer Deallocation - Internal Functionality |
| |
| This section covers the internal functionality of the BufferDeallocation |
| transformation. The transformation consists of several passes. The main pass |
| called BufferDeallocation can be applied via “-buffer-deallocation” on MLIR |
| programs. Currently, there are three optimization passes, that move allocs and |
| convert AllocOps to AllocaOps, if possible. The first and second pass can be |
| applied using “-buffer-hoisting” or “-buffer-loop-hoisting”, the third one |
| using “-promote-buffers-to-stack”. However, these optimizations must be applied |
| before using the BufferDeallocation pass. |
| |
| ### Requirements |
| |
| In order to use BufferDeallocation on an arbitrary dialect, several |
| control-flow interfaces have to be implemented when using custom operations. |
| This is particularly important to understand the implicit control-flow |
| dependencies between different parts of the input program. Without implementing |
| the following interfaces, control-flow relations cannot be discovered properly |
| and the resulting program can become invalid: |
| |
| * Branch-like terminators should implement the `BranchOpInterface` to query and |
| manipulate associated operands. |
| * Operations involving structured control flow have to implement the |
| `RegionBranchOpInterface` to model inter-region control flow. |
| * Terminators yielding values to their parent operation (in particular in the |
| scope of nested regions within `RegionBranchOpInterface`-based operations), |
| should implement the `ReturnLike` trait to represent logical “value returns”. |
| |
| Example dialects that are fully compatible are the “std” and “scf” dialects |
| with respect to all implemented interfaces. |
| |
| ### Detection of Buffer Allocations |
| |
| The first step of the BufferDeallocation transformation is to identify |
| manageable allocation operations that implement the `SideEffects` interface. |
| Furthermore, these ops need to apply the effect `MemoryEffects::Allocate` to a |
| particular result value while not using the resource |
| `SideEffects::AutomaticAllocationScopeResource` (since it is currently reserved |
| for allocations, like `Alloca` that will be automatically deallocated by a |
| parent scope). Allocations that have not been detected in this phase will not |
| be tracked internally, and thus, not deallocated automatically. However, |
| BufferDeallocation is fully compatible with “hybrid” setups in which tracked |
| and untracked allocations are mixed: |
| |
| ```mlir |
| func @mixedAllocation(%arg0: i1) { |
| %0 = alloca() : memref<2xf32> // aliases: %2 |
| %1 = alloc() : memref<2xf32> // aliases: %2 |
| cond_br %arg0, ^bb1, ^bb2 |
| ^bb1: |
| use(%0) |
| br ^bb3(%0 : memref<2xf32>) |
| ^bb2: |
| use(%1) |
| br ^bb3(%1 : memref<2xf32>) |
| ^bb3(%2: memref<2xf32>): |
| ... |
| } |
| ``` |
| |
| Example of using a conditional branch with alloc and alloca. BufferDeallocation |
| can detect and handle the different allocation types that might be intermixed. |
| |
| Note: the current version does not support allocation operations returning |
| multiple result buffers. |
| |
| ### Conversion from AllocOp to AllocaOp |
| |
| The PromoteBuffersToStack-pass converts AllocOps to AllocaOps, if possible. In |
| some cases, it can be useful to use such stack-based buffers instead of |
| heap-based buffers. The conversion is restricted to several constraints like: |
| |
| * Control flow |
| * Buffer Size |
| * Dynamic Size |
| |
| If a buffer is leaving a block, we are not allowed to convert it into an |
| alloca. If the size of the buffer is large, we could convert it, but regarding |
| stack overflow, it makes sense to limit the size of these buffers and only |
| convert small ones. The size can be set via a pass option. The current default |
| value is 1KB. Furthermore, we can not convert buffers with dynamic size, since |
| the dimension is not known a priori. |
| |
| ### Movement and Placement of Allocations |
| |
| Using the buffer hoisting pass, all buffer allocations are moved as far upwards |
| as possible in order to group them and make upcoming optimizations easier by |
| limiting the search space. Such a movement is shown in the following graphs. |
| In addition, we are able to statically free an alloc, if we move it into a |
| dominator of all of its uses. This simplifies further optimizations (e.g. |
| buffer fusion) in the future. However, movement of allocations is limited by |
| external data dependencies (in particular in the case of allocations of |
| dynamically shaped types). Furthermore, allocations can be moved out of nested |
| regions, if necessary. In order to move allocations to valid locations with |
| respect to their uses only, we leverage Liveness information. |
| |
| The following code snippets shows a conditional branch before running the |
| BufferHoisting pass: |
| |
|  |
| |
| ```mlir |
| func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) { |
| cond_br %arg0, ^bb1, ^bb2 |
| ^bb1: |
| br ^bb3(%arg1 : memref<2xf32>) |
| ^bb2: |
| %0 = alloc() : memref<2xf32> // aliases: %1 |
| use(%0) |
| br ^bb3(%0 : memref<2xf32>) |
| ^bb3(%1: memref<2xf32>): // %1 could be %0 or %arg1 |
| "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> () |
| return |
| } |
| ``` |
| |
| Applying the BufferHoisting pass on this program results in the following piece |
| of code: |
| |
|  |
| |
| ```mlir |
| func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) { |
| %0 = alloc() : memref<2xf32> // moved to bb0 |
| cond_br %arg0, ^bb1, ^bb2 |
| ^bb1: |
| br ^bb3(%arg1 : memref<2xf32>) |
| ^bb2: |
| use(%0) |
| br ^bb3(%0 : memref<2xf32>) |
| ^bb3(%1: memref<2xf32>): |
| "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> () |
| return |
| } |
| ``` |
| |
| The alloc is moved from bb2 to the beginning and it is passed as an argument to |
| bb3. |
| |
| The following example demonstrates an allocation using dynamically shaped |
| types. Due to the data dependency of the allocation to %0, we cannot move the |
| allocation out of bb2 in this case: |
| |
| ```mlir |
| func @condBranchDynamicType( |
| %arg0: i1, |
| %arg1: memref<?xf32>, |
| %arg2: memref<?xf32>, |
| %arg3: index) { |
| cond_br %arg0, ^bb1, ^bb2(%arg3: index) |
| ^bb1: |
| br ^bb3(%arg1 : memref<?xf32>) |
| ^bb2(%0: index): |
| %1 = alloc(%0) : memref<?xf32> // cannot be moved upwards to the data |
| // dependency to %0 |
| use(%1) |
| br ^bb3(%1 : memref<?xf32>) |
| ^bb3(%2: memref<?xf32>): |
| "linalg.copy"(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> () |
| return |
| } |
| ``` |
| |
| ### Introduction of Copies |
| |
| In order to guarantee that all allocated buffers are freed properly, we have to |
| pay attention to the control flow and all potential aliases a buffer allocation |
| can have. Since not all allocations can be safely freed with respect to their |
| aliases (see the following code snippet), it is often required to introduce |
| copies to eliminate them. Consider the following example in which the |
| allocations have already been placed: |
| |
| ```mlir |
| func @branch(%arg0: i1) { |
| %0 = alloc() : memref<2xf32> // aliases: %2 |
| cond_br %arg0, ^bb1, ^bb2 |
| ^bb1: |
| %1 = alloc() : memref<2xf32> // resides here for demonstration purposes |
| // aliases: %2 |
| br ^bb3(%1 : memref<2xf32>) |
| ^bb2: |
| use(%0) |
| br ^bb3(%0 : memref<2xf32>) |
| ^bb3(%2: memref<2xf32>): |
| … |
| return |
| } |
| ``` |
| |
| The first alloc can be safely freed after the live range of its post-dominator |
| block (bb3). The alloc in bb1 has an alias %2 in bb3 that also keeps this |
| buffer alive until the end of bb3. Since we cannot determine the actual |
| branches that will be taken at runtime, we have to ensure that all buffers are |
| freed correctly in bb3 regardless of the branches we will take to reach the |
| exit block. This makes it necessary to introduce a copy for %2, which allows us |
| to free %alloc0 in bb0 and %alloc1 in bb1. Afterwards, we can continue |
| processing all aliases of %2 (none in this case) and we can safely free %2 at |
| the end of the sample program. This sample demonstrates that not all |
| allocations can be safely freed in their associated post-dominator blocks. |
| Instead, we have to pay attention to all of their aliases. |
| |
| Applying the BufferDeallocation pass to the program above yields the following |
| result: |
| |
| ```mlir |
| func @branch(%arg0: i1) { |
| %0 = alloc() : memref<2xf32> |
| cond_br %arg0, ^bb1, ^bb2 |
| ^bb1: |
| %1 = alloc() : memref<2xf32> |
| %3 = alloc() : memref<2xf32> // temp copy for %1 |
| "linalg.copy"(%1, %3) : (memref<2xf32>, memref<2xf32>) -> () |
| dealloc %1 : memref<2xf32> // %1 can be safely freed here |
| br ^bb3(%3 : memref<2xf32>) |
| ^bb2: |
| use(%0) |
| %4 = alloc() : memref<2xf32> // temp copy for %0 |
| "linalg.copy"(%0, %4) : (memref<2xf32>, memref<2xf32>) -> () |
| br ^bb3(%4 : memref<2xf32>) |
| ^bb3(%2: memref<2xf32>): |
| … |
| dealloc %2 : memref<2xf32> // free temp buffer %2 |
| dealloc %0 : memref<2xf32> // %0 can be safely freed here |
| return |
| } |
| ``` |
| |
| Note that a temporary buffer for %2 was introduced to free all allocations |
| properly. Note further that the unnecessary allocation of %3 can be easily |
| removed using one of the post-pass transformations. |
| |
| Reconsider the previously introduced sample demonstrating dynamically shaped |
| types: |
| |
| ```mlir |
| func @condBranchDynamicType( |
| %arg0: i1, |
| %arg1: memref<?xf32>, |
| %arg2: memref<?xf32>, |
| %arg3: index) { |
| cond_br %arg0, ^bb1, ^bb2(%arg3: index) |
| ^bb1: |
| br ^bb3(%arg1 : memref<?xf32>) |
| ^bb2(%0: index): |
| %1 = alloc(%0) : memref<?xf32> // aliases: %2 |
| use(%1) |
| br ^bb3(%1 : memref<?xf32>) |
| ^bb3(%2: memref<?xf32>): |
| "linalg.copy"(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> () |
| return |
| } |
| ``` |
| |
| In the presence of DSTs, we have to parameterize the allocations with |
| additional dimension information of the source buffers, we want to copy from. |
| BufferDeallocation automatically introduces all required operations to extract |
| dimension specifications and wires them with the associated allocations: |
| |
| ```mlir |
| func @condBranchDynamicType( |
| %arg0: i1, |
| %arg1: memref<?xf32>, |
| %arg2: memref<?xf32>, |
| %arg3: index) { |
| cond_br %arg0, ^bb1, ^bb2(%arg3 : index) |
| ^bb1: |
| %c0 = constant 0 : index |
| %0 = dim %arg1, %c0 : memref<?xf32> // dimension operation to parameterize |
| // the following temp allocation |
| %1 = alloc(%0) : memref<?xf32> |
| "linalg.copy"(%arg1, %1) : (memref<?xf32>, memref<?xf32>) -> () |
| br ^bb3(%1 : memref<?xf32>) |
| ^bb2(%2: index): |
| %3 = alloc(%2) : memref<?xf32> |
| use(%3) |
| %c0_0 = constant 0 : index |
| %4 = dim %3, %c0_0 : memref<?xf32> // dimension operation to parameterize |
| // the following temp allocation |
| %5 = alloc(%4) : memref<?xf32> |
| "linalg.copy"(%3, %5) : (memref<?xf32>, memref<?xf32>) -> () |
| dealloc %3 : memref<?xf32> // %3 can be safely freed here |
| br ^bb3(%5 : memref<?xf32>) |
| ^bb3(%6: memref<?xf32>): |
| "linalg.copy"(%6, %arg2) : (memref<?xf32>, memref<?xf32>) -> () |
| dealloc %6 : memref<?xf32> // %6 can be safely freed here |
| return |
| } |
| ``` |
| |
| BufferDeallocation performs a fix-point iteration taking all aliases of all |
| tracked allocations into account. We initialize the general iteration process |
| using all tracked allocations and their associated aliases. As soon as we |
| encounter an alias that is not properly dominated by our allocation, we mark |
| this alias as _critical_ (needs to be freed and tracked by the internal |
| fix-point iteration). The following sample demonstrates the presence of |
| critical and non-critical aliases: |
| |
|  |
| |
| ```mlir |
| func @condBranchDynamicTypeNested( |
| %arg0: i1, |
| %arg1: memref<?xf32>, // aliases: %3, %4 |
| %arg2: memref<?xf32>, |
| %arg3: index) { |
| cond_br %arg0, ^bb1, ^bb2(%arg3: index) |
| ^bb1: |
| br ^bb6(%arg1 : memref<?xf32>) |
| ^bb2(%0: index): |
| %1 = alloc(%0) : memref<?xf32> // cannot be moved upwards due to the data |
| // dependency to %0 |
| // aliases: %2, %3, %4 |
| use(%1) |
| cond_br %arg0, ^bb3, ^bb4 |
| ^bb3: |
| br ^bb5(%1 : memref<?xf32>) |
| ^bb4: |
| br ^bb5(%1 : memref<?xf32>) |
| ^bb5(%2: memref<?xf32>): // non-crit. alias of %1, since %1 dominates %2 |
| br ^bb6(%2 : memref<?xf32>) |
| ^bb6(%3: memref<?xf32>): // crit. alias of %arg1 and %2 (in other words %1) |
| br ^bb7(%3 : memref<?xf32>) |
| ^bb7(%4: memref<?xf32>): // non-crit. alias of %3, since %3 dominates %4 |
| "linalg.copy"(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> () |
| return |
| } |
| ``` |
| |
| Applying BufferDeallocation yields the following output: |
| |
|  |
| |
| ```mlir |
| func @condBranchDynamicTypeNested( |
| %arg0: i1, |
| %arg1: memref<?xf32>, |
| %arg2: memref<?xf32>, |
| %arg3: index) { |
| cond_br %arg0, ^bb1, ^bb2(%arg3 : index) |
| ^bb1: |
| %c0 = constant 0 : index |
| %d0 = dim %arg1, %c0 : memref<?xf32> |
| %5 = alloc(%d0) : memref<?xf32> // temp buffer required due to alias %3 |
| "linalg.copy"(%arg1, %5) : (memref<?xf32>, memref<?xf32>) -> () |
| br ^bb6(%5 : memref<?xf32>) |
| ^bb2(%0: index): |
| %1 = alloc(%0) : memref<?xf32> |
| use(%1) |
| cond_br %arg0, ^bb3, ^bb4 |
| ^bb3: |
| br ^bb5(%1 : memref<?xf32>) |
| ^bb4: |
| br ^bb5(%1 : memref<?xf32>) |
| ^bb5(%2: memref<?xf32>): |
| %c0_0 = constant 0 : index |
| %d1 = dim %2, %c0_0 : memref<?xf32> |
| %6 = alloc(%d1) : memref<?xf32> // temp buffer required due to alias %3 |
| "linalg.copy"(%1, %6) : (memref<?xf32>, memref<?xf32>) -> () |
| dealloc %1 : memref<?xf32> |
| br ^bb6(%6 : memref<?xf32>) |
| ^bb6(%3: memref<?xf32>): |
| br ^bb7(%3 : memref<?xf32>) |
| ^bb7(%4: memref<?xf32>): |
| "linalg.copy"(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> () |
| dealloc %3 : memref<?xf32> // free %3, since %4 is a non-crit. alias of %3 |
| return |
| } |
| ``` |
| |
| Since %3 is a critical alias, BufferDeallocation introduces an additional |
| temporary copy in all predecessor blocks. %3 has an additional (non-critical) |
| alias %4 that extends the live range until the end of bb7. Therefore, we can |
| free %3 after its last use, while taking all aliases into account. Note that %4 |
| does not need to be freed, since we did not introduce a copy for it. |
| |
| The actual introduction of buffer copies is done after the fix-point iteration |
| has been terminated and all critical aliases have been detected. A critical |
| alias can be either a block argument or another value that is returned by an |
| operation. Copies for block arguments are handled by analyzing all predecessor |
| blocks. This is primarily done by querying the `BranchOpInterface` of the |
| associated branch terminators that can jump to the current block. Consider the |
| following example which involves a simple branch and the critical block |
| argument %2: |
| |
| ```mlir |
| custom.br ^bb1(..., %0, : ...) |
| ... |
| custom.br ^bb1(..., %1, : ...) |
| ... |
| ^bb1(%2: memref<2xf32>): |
| ... |
| ``` |
| |
| The `BranchOpInterface` allows us to determine the actual values that will be |
| passed to block bb1 and its argument %2 by analyzing its predecessor blocks. |
| Once we have resolved the values %0 and %1 (that are associated with %2 in this |
| sample), we can introduce a temporary buffer and clone its contents into the |
| new buffer. Afterwards, we rewire the branch operands to use the newly |
| allocated buffer instead. However, blocks can have implicitly defined |
| predecessors by parent ops that implement the `RegionBranchOpInterface`. This |
| can be the case if this block argument belongs to the entry block of a region. |
| In this setting, we have to identify all predecessor regions defined by the |
| parent operation. For every region, we need to get all terminator operations |
| implementing the `ReturnLike` trait, indicating that they can branch to our |
| current block. Finally, we can use a similar functionality as described above |
| to add the temporary copy. This time, we can modify the terminator operands |
| directly without touching a high-level interface. |
| |
| Consider the following inner-region control-flow sample that uses an imaginary |
| “custom.region_if” operation. It either executes the “then” or “else” region |
| and always continues to the “join” region. The “custom.region_if_yield” |
| operation returns a result to the parent operation. This sample demonstrates |
| the use of the `RegionBranchOpInterface` to determine predecessors in order to |
| infer the high-level control flow: |
| |
| ```mlir |
| func @inner_region_control_flow( |
| %arg0 : index, |
| %arg1 : index) -> memref<?x?xf32> { |
| %0 = alloc(%arg0, %arg0) : memref<?x?xf32> |
| %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>) |
| then(%arg2 : memref<?x?xf32>) { // aliases: %arg4, %1 |
| custom.region_if_yield %arg2 : memref<?x?xf32> |
| } else(%arg3 : memref<?x?xf32>) { // aliases: %arg4, %1 |
| custom.region_if_yield %arg3 : memref<?x?xf32> |
| } join(%arg4 : memref<?x?xf32>) { // aliases: %1 |
| custom.region_if_yield %arg4 : memref<?x?xf32> |
| } |
| return %1 : memref<?x?xf32> |
| } |
| ``` |
| |
|  |
| |
| Non-block arguments (other values) can become aliases when they are returned by |
| dialect-specific operations. BufferDeallocation supports this behavior via the |
| `RegionBranchOpInterface`. Consider the following example that uses an “scf.if” |
| operation to determine the value of %2 at runtime which creates an alias: |
| |
| ```mlir |
| func @nested_region_control_flow(%arg0 : index, %arg1 : index) -> memref<?x?xf32> { |
| %0 = cmpi "eq", %arg0, %arg1 : index |
| %1 = alloc(%arg0, %arg0) : memref<?x?xf32> |
| %2 = scf.if %0 -> (memref<?x?xf32>) { |
| scf.yield %1 : memref<?x?xf32> // %2 will be an alias of %1 |
| } else { |
| %3 = alloc(%arg0, %arg1) : memref<?x?xf32> // nested allocation in a div. |
| // branch |
| use(%3) |
| scf.yield %1 : memref<?x?xf32> // %2 will be an alias of %1 |
| } |
| return %2 : memref<?x?xf32> |
| } |
| ``` |
| |
| In this example, a dealloc is inserted to release the buffer within the else |
| block since it cannot be accessed by the remainder of the program. Accessing |
| the `RegionBranchOpInterface`, allows us to infer that %2 is a non-critical |
| alias of %1 which does not need to be tracked. |
| |
| ```mlir |
| func @nested_region_control_flow(%arg0: index, %arg1: index) -> memref<?x?xf32> { |
| %0 = cmpi "eq", %arg0, %arg1 : index |
| %1 = alloc(%arg0, %arg0) : memref<?x?xf32> |
| %2 = scf.if %0 -> (memref<?x?xf32>) { |
| scf.yield %1 : memref<?x?xf32> |
| } else { |
| %3 = alloc(%arg0, %arg1) : memref<?x?xf32> |
| use(%3) |
| dealloc %3 : memref<?x?xf32> // %3 can be safely freed here |
| scf.yield %1 : memref<?x?xf32> |
| } |
| return %2 : memref<?x?xf32> |
| } |
| ``` |
| |
| Analogous to the previous case, we have to detect all terminator operations in |
| all attached regions of “scf.if” that provides a value to its parent operation |
| (in this sample via scf.yield). Querying the `RegionBranchOpInterface` allows |
| us to determine the regions that “return” a result to their parent operation. |
| Like before, we have to update all `ReturnLike` terminators as described above. |
| Reconsider a slightly adapted version of the “custom.region_if” example from |
| above that uses a nested allocation: |
| |
| ```mlir |
| func @inner_region_control_flow_div( |
| %arg0 : index, |
| %arg1 : index) -> memref<?x?xf32> { |
| %0 = alloc(%arg0, %arg0) : memref<?x?xf32> |
| %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>) |
| then(%arg2 : memref<?x?xf32>) { // aliases: %arg4, %1 |
| custom.region_if_yield %arg2 : memref<?x?xf32> |
| } else(%arg3 : memref<?x?xf32>) { |
| %2 = alloc(%arg0, %arg1) : memref<?x?xf32> // aliases: %arg4, %1 |
| custom.region_if_yield %2 : memref<?x?xf32> |
| } join(%arg4 : memref<?x?xf32>) { // aliases: %1 |
| custom.region_if_yield %arg4 : memref<?x?xf32> |
| } |
| return %1 : memref<?x?xf32> |
| } |
| ``` |
| |
| Since the allocation %2 happens in a divergent branch and cannot be safely |
| deallocated in a post-dominator, %arg4 will be considered a critical alias. |
| Furthermore, %arg4 is returned to its parent operation and has an alias %1. |
| This causes BufferDeallocation to introduce additional copies: |
| |
| ```mlir |
| func @inner_region_control_flow_div( |
| %arg0 : index, |
| %arg1 : index) -> memref<?x?xf32> { |
| %0 = alloc(%arg0, %arg0) : memref<?x?xf32> |
| %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>) |
| then(%arg2 : memref<?x?xf32>) { |
| %c0 = constant 0 : index // determine dimension extents for temp allocation |
| %2 = dim %arg2, %c0 : memref<?x?xf32> |
| %c1 = constant 1 : index |
| %3 = dim %arg2, %c1 : memref<?x?xf32> |
| %4 = alloc(%2, %3) : memref<?x?xf32> // temp buffer required due to critic. |
| // alias %arg4 |
| linalg.copy(%arg2, %4) : memref<?x?xf32>, memref<?x?xf32> |
| custom.region_if_yield %4 : memref<?x?xf32> |
| } else(%arg3 : memref<?x?xf32>) { |
| %2 = alloc(%arg0, %arg1) : memref<?x?xf32> |
| %c0 = constant 0 : index // determine dimension extents for temp allocation |
| %3 = dim %2, %c0 : memref<?x?xf32> |
| %c1 = constant 1 : index |
| %4 = dim %2, %c1 : memref<?x?xf32> |
| %5 = alloc(%3, %4) : memref<?x?xf32> // temp buffer required due to critic. |
| // alias %arg4 |
| linalg.copy(%2, %5) : memref<?x?xf32>, memref<?x?xf32> |
| dealloc %2 : memref<?x?xf32> |
| custom.region_if_yield %5 : memref<?x?xf32> |
| } join(%arg4: memref<?x?xf32>) { |
| %c0 = constant 0 : index // determine dimension extents for temp allocation |
| %2 = dim %arg4, %c0 : memref<?x?xf32> |
| %c1 = constant 1 : index |
| %3 = dim %arg4, %c1 : memref<?x?xf32> |
| %4 = alloc(%2, %3) : memref<?x?xf32> // this allocation will be removed by |
| // applying the copy removal pass |
| linalg.copy(%arg4, %4) : memref<?x?xf32>, memref<?x?xf32> |
| dealloc %arg4 : memref<?x?xf32> |
| custom.region_if_yield %4 : memref<?x?xf32> |
| } |
| dealloc %0 : memref<?x?xf32> // %0 can be safely freed here |
| return %1 : memref<?x?xf32> |
| } |
| ``` |
| |
| ### Placement of Deallocs |
| |
| After introducing allocs and copies, deallocs have to be placed to free |
| allocated memory and avoid memory leaks. The deallocation needs to take place |
| after the last use of the given value. The position can be determined by |
| calculating the common post-dominator of all values using their remaining |
| non-critical aliases. A special-case is the presence of back edges: since such |
| edges can cause memory leaks when a newly allocated buffer flows back to |
| another part of the program. In these cases, we need to free the associated |
| buffer instances from the previous iteration by inserting additional deallocs. |
| |
| Consider the following “scf.for” use case containing a nested structured |
| control-flow if: |
| |
| ```mlir |
| func @loop_nested_if( |
| %lb: index, |
| %ub: index, |
| %step: index, |
| %buf: memref<2xf32>, |
| %res: memref<2xf32>) { |
| %0 = scf.for %i = %lb to %ub step %step |
| iter_args(%iterBuf = %buf) -> memref<2xf32> { |
| %1 = cmpi "eq", %i, %ub : index |
| %2 = scf.if %1 -> (memref<2xf32>) { |
| %3 = alloc() : memref<2xf32> // makes %2 a critical alias due to a |
| // divergent allocation |
| use(%3) |
| scf.yield %3 : memref<2xf32> |
| } else { |
| scf.yield %iterBuf : memref<2xf32> |
| } |
| scf.yield %2 : memref<2xf32> |
| } |
| "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> () |
| return |
| } |
| ``` |
| |
| In this example, the _then_ branch of the nested “scf.if” operation returns a |
| newly allocated buffer. |
| |
| Since this allocation happens in the scope of a divergent branch, %2 becomes a |
| critical alias that needs to be handled. As before, we have to insert |
| additional copies to eliminate this alias using copies of %3 and %iterBuf. This |
| guarantees that %2 will be a newly allocated buffer that is returned in each |
| iteration. However, “returning” %2 to its alias %iterBuf turns %iterBuf into a |
| critical alias as well. In other words, we have to create a copy of %2 to pass |
| it to %iterBuf. Since this jump represents a back edge, and %2 will always be a |
| new buffer, we have to free the buffer from the previous iteration to avoid |
| memory leaks: |
| |
| ```mlir |
| func @loop_nested_if( |
| %lb: index, |
| %ub: index, |
| %step: index, |
| %buf: memref<2xf32>, |
| %res: memref<2xf32>) { |
| %4 = alloc() : memref<2xf32> |
| "linalg.copy"(%buf, %4) : (memref<2xf32>, memref<2xf32>) -> () |
| %0 = scf.for %i = %lb to %ub step %step |
| iter_args(%iterBuf = %4) -> memref<2xf32> { |
| %1 = cmpi "eq", %i, %ub : index |
| %2 = scf.if %1 -> (memref<2xf32>) { |
| %3 = alloc() : memref<2xf32> // makes %2 a critical alias |
| use(%3) |
| %5 = alloc() : memref<2xf32> // temp copy due to crit. alias %2 |
| "linalg.copy"(%3, %5) : memref<2xf32>, memref<2xf32> |
| dealloc %3 : memref<2xf32> |
| scf.yield %5 : memref<2xf32> |
| } else { |
| %6 = alloc() : memref<2xf32> // temp copy due to crit. alias %2 |
| "linalg.copy"(%iterBuf, %6) : memref<2xf32>, memref<2xf32> |
| scf.yield %6 : memref<2xf32> |
| } |
| %7 = alloc() : memref<2xf32> // temp copy due to crit. alias %iterBuf |
| "linalg.copy"(%2, %7) : memref<2xf32>, memref<2xf32> |
| dealloc %2 : memref<2xf32> |
| dealloc %iterBuf : memref<2xf32> // free backedge iteration variable |
| scf.yield %7 : memref<2xf32> |
| } |
| "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> () |
| dealloc %0 : memref<2xf32> // free temp copy %0 |
| return |
| } |
| ``` |
| |
| Example for loop-like control flow. The CFG contains back edges that have to be |
| handled to avoid memory leaks. The bufferization is able to free the backedge |
| iteration variable %iterBuf. |
| |
| ### Private Analyses Implementations |
| |
| The BufferDeallocation transformation relies on one primary control-flow |
| analysis: BufferPlacementAliasAnalysis. Furthermore, we also use dominance and |
| liveness to place and move nodes. The liveness analysis determines the live |
| range of a given value. Within this range, a value is alive and can or will be |
| used in the course of the program. After this range, the value is dead and can |
| be discarded - in our case, the buffer can be freed. To place the allocs, we |
| need to know from which position a value will be alive. The allocs have to be |
| placed in front of this position. However, the most important analysis is the |
| alias analysis that is needed to introduce copies and to place all |
| deallocations. |
| |
| ## Post Phase |
| |
| In order to limit the complexity of the BufferDeallocation transformation, some |
| tiny code-polishing/optimization transformations are not applied on-the-fly |
| during placement. Currently, there is only the CopyRemoval transformation to |
| remove unnecessary copy and allocation operations. |
| |
| Note: further transformations might be added to the post-pass phase in the |
| future. |
| |
| ### CopyRemoval Pass |
| |
| A common pattern that arises during placement is the introduction of |
| unnecessary temporary copies that are used instead of the original source |
| buffer. For this reason, there is a post-pass transformation that removes these |
| allocations and copies via `-copy-removal`. This pass, besides removing |
| unnecessary copy operations, will also remove the dead allocations and their |
| corresponding deallocation operations. The CopyRemoval pass can currently be |
| applied to operations that implement the `CopyOpInterface` in any of these two |
| situations which are |
| |
| * reusing the source buffer of the copy operation. |
| * reusing the target buffer of the copy operation. |
| |
| ### Reusing the Source Buffer of the Copy Operation |
| |
| In this case, the source of the copy operation can be used instead of target. |
| The unused allocation and deallocation operations that are defined for this |
| copy operation are also removed. Here is a working example generated by the |
| BufferDeallocation pass that allocates a buffer with dynamic size. A deeper |
| analysis of this sample reveals that the highlighted operations are redundant |
| and can be removed. |
| |
| ```mlir |
| func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> { |
| %7 = alloc(%arg0, %arg1) : memref<?x?xf32> |
| %c0_0 = constant 0 : index |
| %8 = dim %7, %c0_0 : memref<?x?xf32> |
| %c1_1 = constant 1 : index |
| %9 = dim %7, %c1_1 : memref<?x?xf32> |
| %10 = alloc(%8, %9) : memref<?x?xf32> |
| linalg.copy(%7, %10) : memref<?x?xf32>, memref<?x?xf32> |
| dealloc %7 : memref<?x?xf32> |
| return %10 : memref<?x?xf32> |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> { |
| %7 = alloc(%arg0, %arg1) : memref<?x?xf32> |
| %c0_0 = constant 0 : index |
| %8 = dim %7, %c0_0 : memref<?x?xf32> |
| %c1_1 = constant 1 : index |
| %9 = dim %7, %c1_1 : memref<?x?xf32> |
| return %7 : memref<?x?xf32> |
| } |
| ``` |
| |
| In this case, the additional copy %10 can be replaced with its original source |
| buffer %7. This also applies to the associated dealloc operation of %7. |
| |
| To limit the complexity of this transformation, it only removes copy operations |
| when the following constraints are met: |
| |
| * The copy operation, the defining operation for the target value, and the |
| deallocation of the source value lie in the same block. |
| * There are no users/aliases of the target value between the defining operation |
| of the target value and its copy operation. |
| * There are no users/aliases of the source value between its associated copy |
| operation and the deallocation of the source value. |
| |
| ### Reusing the Target Buffer of the Copy Operation |
| |
| In this case, the target buffer of the copy operation can be used instead of |
| its source. The unused allocation and deallocation operations that are defined |
| for this copy operation are also removed. |
| |
| Consider the following example where a generic linalg operation writes the |
| result to %temp and then copies %temp to %result. However, these two operations |
| can be merged into a single step. Copy removal removes the copy operation and |
| %temp, and replaces the uses of %temp with %result: |
| |
| ```mlir |
| func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){ |
| %temp = alloc() : memref<2xf32> |
| linalg.generic { |
| args_in = 1 : i64, |
| args_out = 1 : i64, |
| indexing_maps = [#map0, #map0], |
| iterator_types = ["parallel"]} %arg0, %temp { |
| ^bb0(%gen2_arg0: f32, %gen2_arg1: f32): |
| %tmp2 = exp %gen2_arg0 : f32 |
| linalg.yield %tmp2 : f32 |
| }: memref<2xf32>, memref<2xf32> |
| "linalg.copy"(%temp, %result) : (memref<2xf32>, memref<2xf32>) -> () |
| dealloc %temp : memref<2xf32> |
| return |
| } |
| ``` |
| |
| Will be transformed to: |
| |
| ```mlir |
| func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){ |
| linalg.generic { |
| args_in = 1 : i64, |
| args_out = 1 : i64, |
| indexing_maps = [#map0, #map0], |
| iterator_types = ["parallel"]} %arg0, %result { |
| ^bb0(%gen2_arg0: f32, %gen2_arg1: f32): |
| %tmp2 = exp %gen2_arg0 : f32 |
| linalg.yield %tmp2 : f32 |
| }: memref<2xf32>, memref<2xf32> |
| return |
| } |
| ``` |
| |
| Like before, several constraints to use the transformation apply: |
| |
| * The copy operation, the defining operation of the source value, and the |
| deallocation of the source value lie in the same block. |
| * There are no users/aliases of the target value between the defining operation |
| of the source value and the copy operation. |
| * There are no users/aliases of the source value between the copy operation and |
| the deallocation of the source value. |
| |
| ## Known Limitations |
| |
| BufferDeallocation introduces additional copies using allocations from the |
| “std” dialect (“std.alloc”). Analogous, all deallocations use the “std” |
| dialect-free operation “std.dealloc”. The actual copy process is realized using |
| “linalg.copy”. Furthermore, buffers are essentially immutable after their |
| creation in a block. Another limitations are known in the case using |
| unstructered control flow. |