Jordan Rose | eaf9c74 | 2013-02-05 17:31:34 +0000 | [diff] [blame] | 1 | The analyzer "Store" represents the contents of memory regions. It is an opaque |
| 2 | functional data structure stored in each ProgramState; the only class that can |
| 3 | modify the store is its associated StoreManager. |
| 4 | |
| 5 | Currently (Feb. 2013), the only StoreManager implementation being used is |
| 6 | RegionStoreManager. This store records bindings to memory regions using a "base |
| 7 | region + offset" key. (This allows `*p` and `p[0]` to map to the same location, |
| 8 | among other benefits.) |
| 9 | |
| 10 | Regions are grouped into "clusters", which roughly correspond to "regions with |
| 11 | the same base region". This allows certain operations to be more efficient, |
| 12 | such as invalidation. |
| 13 | |
| 14 | Regions that do not have a known offset use a special "symbolic" offset. These |
| 15 | keys store both the original region, and the "concrete offset region" -- the |
| 16 | last region whose offset is entirely concrete. (For example, in the expression |
| 17 | `foo.bar[1][i].baz`, the concrete offset region is the array `foo.bar[1]`, |
| 18 | since that has a known offset from the start of the top-level `foo` struct.) |
| 19 | |
| 20 | |
| 21 | Binding Invalidation |
| 22 | ==================== |
| 23 | |
| 24 | Supporting both concrete and symbolic offsets makes things a bit tricky. Here's |
| 25 | an example: |
| 26 | |
| 27 | foo[0] = 0; |
| 28 | foo[1] = 1; |
| 29 | foo[i] = i; |
| 30 | |
| 31 | After the third assignment, nothing can be said about the value of `foo[0]`, |
| 32 | because `foo[i]` may have overwritten it! Thus, *binding to a region with a |
| 33 | symbolic offset invalidates the entire concrete offset region.* We know |
| 34 | `foo[i]` is somewhere within `foo`, so we don't have to invalidate anything |
| 35 | else, but we do have to be conservative about all other bindings within `foo`. |
| 36 | |
| 37 | Continuing the example: |
| 38 | |
| 39 | foo[i] = i; |
| 40 | foo[0] = 0; |
| 41 | |
| 42 | After this latest assignment, nothing can be said about the value of `foo[i]`, |
| 43 | because `foo[0]` may have overwritten it! *Binding to a region R with a |
| 44 | concrete offset invalidates any symbolic offset bindings whose concrete offset |
| 45 | region is a super-region **or** sub-region of R.* All we know about `foo[i]` is |
| 46 | that it is somewhere within `foo`, so changing *anything* within `foo` might |
| 47 | change `foo[i]`, and changing *all* of `foo` (or its base region) will |
| 48 | *definitely* change `foo[i]`. |
| 49 | |
| 50 | This logic could be improved by using the current constraints on `i`, at the |
| 51 | cost of speed. The latter case could also be improved by matching region kinds, |
| 52 | i.e. changing `foo[0].a` is unlikely to affect `foo[i].b`, no matter what `i` |
| 53 | is. |
| 54 | |
| 55 | For more detail, read through RegionStoreManager::removeSubRegionBindings in |
| 56 | RegionStore.cpp. |
| 57 | |
| 58 | |
| 59 | ObjCIvarRegions |
| 60 | =============== |
| 61 | |
| 62 | Objective-C instance variables require a bit of special handling. Like struct |
| 63 | fields, they are not base regions, and when their parent object region is |
| 64 | invalidated, all the instance variables must be invalidated as well. However, |
| 65 | they have no concrete compile-time offsets (in the modern, "non-fragile" |
| 66 | runtime), and so cannot easily be represented as an offset from the start of |
| 67 | the object in the analyzer. Moreover, this means that invalidating a single |
| 68 | instance variable should *not* invalidate the rest of the object, since unlike |
| 69 | struct fields or array elements there is no way to perform pointer arithmetic |
| 70 | to access another instance variable. |
| 71 | |
| 72 | Consequently, although the base region of an ObjCIvarRegion is the entire |
| 73 | object, RegionStore offsets are computed from the start of the instance |
| 74 | variable. Thus it is not valid to assume that all bindings with non-symbolic |
| 75 | offsets start from the base region! |
| 76 | |
| 77 | |
| 78 | Region Invalidation |
| 79 | =================== |
| 80 | |
| 81 | Unlike binding invalidation, region invalidation occurs when the entire |
| 82 | contents of a region may have changed---say, because it has been passed to a |
| 83 | function the analyzer can model, like memcpy, or because its address has |
| 84 | escaped, usually as an argument to an opaque function call. In these cases we |
| 85 | need to throw away not just all bindings within the region itself, but within |
| 86 | its entire cluster, since neighboring regions may be accessed via pointer |
| 87 | arithmetic. |
| 88 | |
| 89 | Region invalidation typically does even more than this, however. Because it |
| 90 | usually represents the complete escape of a region from the analyzer's model, |
| 91 | its *contents* must also be transitively invalidated. (For example, if a region |
| 92 | 'p' of type 'int **' is invalidated, the contents of '*p' and '**p' may have |
| 93 | changed as well.) The algorithm that traverses this transitive closure of |
| 94 | accessible regions is known as ClusterAnalysis, and is also used for finding |
| 95 | all live bindings in the store (in order to throw away the dead ones). The name |
| 96 | "ClusterAnalysis" predates the cluster-based organization of bindings, but |
| 97 | refers to the same concept: during invalidation and liveness analysis, all |
| 98 | bindings within a cluster must be treated in the same way for a conservative |
| 99 | model of program behavior. |
| 100 | |
| 101 | |
| 102 | Default Bindings |
| 103 | ================ |
| 104 | |
| 105 | Most bindings in RegionStore are simple scalar values -- integers and pointers. |
| 106 | These are known as "Direct" bindings. However, RegionStore supports a second |
| 107 | type of binding called a "Default" binding. These are used to provide values to |
| 108 | all the elements of an aggregate type (struct or array) without having to |
| 109 | explicitly specify a binding for each individual element. |
| 110 | |
| 111 | When there is no Direct binding for a particular region, the store manager |
| 112 | looks at each super-region in turn to see if there is a Default binding. If so, |
| 113 | this value is used as the value of the original region. The search ends when |
| 114 | the base region is reached, at which point the RegionStore will pick an |
| 115 | appropriate default value for the region (usually a symbolic value, but |
| 116 | sometimes zero, for static data, or "uninitialized", for stack variables). |
| 117 | |
| 118 | int manyInts[10]; |
| 119 | manyInts[1] = 42; // Creates a Direct binding for manyInts[1]. |
| 120 | print(manyInts[1]); // Retrieves the Direct binding for manyInts[1]; |
| 121 | print(manyInts[0]); // There is no Direct binding for manyInts[1]. |
| 122 | // Is there a Default binding for the entire array? |
| 123 | // There is not, but it is a stack variable, so we use |
| 124 | // "uninitialized" as the default value (and emit a |
| 125 | // diagnostic!). |
| 126 | |
| 127 | NOTE: The fact that bindings are stored as a base region plus an offset limits |
| 128 | the Default Binding strategy, because in C aggregates can contain other |
| 129 | aggregates. In the current implementation of RegionStore, there is no way to |
| 130 | distinguish a Default binding for an entire aggregate from a Default binding |
| 131 | for the sub-aggregate at offset 0. |
| 132 | |
| 133 | |
| 134 | Lazy Bindings (LazyCompoundVal) |
| 135 | =============================== |
| 136 | |
| 137 | RegionStore implements an optimization for copying aggregates (structs and |
| 138 | arrays) called "lazy bindings", implemented using a special SVal called |
| 139 | LazyCompoundVal. When the store is asked for the "binding" for an entire |
| 140 | aggregate (i.e. for an lvalue-to-rvalue conversion), it returns a |
| 141 | LazyCompoundVal instead. When this value is then stored into a variable, it is |
| 142 | bound as a Default value. This makes copying arrays and structs much cheaper |
| 143 | than if they had required memberwise access. |
| 144 | |
| 145 | Under the hood, a LazyCompoundVal is implemented as a uniqued pair of (region, |
| 146 | store), representing "the value of the region during this 'snapshot' of the |
| 147 | store". This has important implications for any sort of liveness or |
| 148 | reachability analysis, which must take the bindings in the old store into |
| 149 | account. |
| 150 | |
| 151 | Retrieving a value from a lazy binding happens in the same way as any other |
| 152 | Default binding: since there is no direct binding, the store manager falls back |
| 153 | to super-regions to look for an appropriate default binding. LazyCompoundVal |
| 154 | differs from a normal default binding, however, in that it contains several |
| 155 | different values, instead of one value that will appear several times. Because |
| 156 | of this, the store manager has to reconstruct the subregion chain on top of the |
| 157 | LazyCompoundVal region, and look up *that* region in the previous store. |
| 158 | |
| 159 | Here's a concrete example: |
| 160 | |
| 161 | CGPoint p; |
| 162 | p.x = 42; // A Direct binding is made to the FieldRegion 'p.x'. |
| 163 | CGPoint p2 = p; // A LazyCompoundVal is created for 'p', along with a |
| 164 | // snapshot of the current store state. This value is then |
| 165 | // used as a Default binding for the VarRegion 'p2'. |
| 166 | return p2.x; // The binding for FieldRegion 'p2.x' is requested. |
| 167 | // There is no Direct binding, so we look for a Default |
| 168 | // binding to 'p2' and find the LCV. |
| 169 | // Because it's an LCV, we look at our requested region |
| 170 | // and see that it's the '.x' field. We ask for the value |
| 171 | // of 'p.x' within the snapshot, and get back 42. |