Ted Kremenek | 9d9963e | 2009-03-26 16:19:54 +0000 | [diff] [blame] | 1 | Static Analyzer: 'Regions' |
| 2 | -------------------------- |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 3 | |
Ted Kremenek | 9d9963e | 2009-03-26 16:19:54 +0000 | [diff] [blame] | 4 | INTRODUCTION |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 5 | |
Ted Kremenek | 9d9963e | 2009-03-26 16:19:54 +0000 | [diff] [blame] | 6 | The path-sensitive analysis engine in libAnalysis employs an extensible API |
| 7 | for abstractly modeling the memory of an analyzed program. This API employs |
| 8 | the concept of "memory regions" to abstractly model chunks of program memory |
| 9 | such as program variables and dynamically allocated memory such as those |
| 10 | returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions |
| 11 | modeling subtyping relationships, field and array offsets into larger chunks |
| 12 | of memory, and so on. |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 13 | |
Ted Kremenek | 9d9963e | 2009-03-26 16:19:54 +0000 | [diff] [blame] | 14 | The region API consists of two components. The first is the taxonomy and |
| 15 | representation of regions themselves within the analyzer engine. The primary |
| 16 | definitions and interfaces are described in |
| 17 | 'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region |
| 18 | hierarchy is the class 'MemRegion' with specific subclasses refining the |
| 19 | region concept for variables, heap allocated memory, and so forth. |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 20 | |
Ted Kremenek | 9d9963e | 2009-03-26 16:19:54 +0000 | [diff] [blame] | 21 | The second component in the region API is the modeling of the binding of |
| 22 | values to regions. For example, modeling the value stored to a local variable |
| 23 | 'x' consists of recording the binding between the region for 'x' (which |
| 24 | represents the raw memory associated with 'x') and the value stored to 'x'. |
| 25 | This binding relationship is captured with the notion of "symbolic stores." |
| 26 | |
| 27 | Symbolic stores, which can be thought of as representing the relation 'regions |
| 28 | -> values', are implemented by subclasses of the StoreManager class (Store.h). |
| 29 | A particular StoreManager implementation has complete flexibility concerning |
| 30 | (a) *how* to model the binding between regions and values and (b) *what* |
| 31 | bindings are recorded. Together, both points allow different StoreManagers to |
| 32 | tradeoff between different levels of analysis precision and scalability |
| 33 | concerning the reasoning of program memory. Meanwhile, the core path-sensitive |
| 34 | engine makes no assumptions about (a) or (b), and queries a StoreManager about |
| 35 | the bindings to a memory region through a generic interface that all |
| 36 | StoreManagers share. If a particular StoreManager cannot reason about the |
| 37 | potential bindings of a given memory region (e.g., 'BasicStoreManager' does |
| 38 | not reason about fields of structures) then the StoreManager can simply return |
| 39 | 'unknown' (represented by 'UnknownVal') for a particular region-binding. This |
| 40 | separation of concerns not only isolates the core analysis engine from the |
| 41 | details of reasoning about program memory but also facilities the option of a |
| 42 | client of the path-sensitive engine to easily swap in different StoreManager |
| 43 | implementations that internally reason about program memory in very different |
| 44 | ways. |
| 45 | |
| 46 | The rest of this document is divided into two parts. We first discuss region |
| 47 | taxonomy and the semantics of regions. We then discuss the StoreManager |
| 48 | interface, and details of how the currently available StoreManager classes |
| 49 | implement region bindings. |
| 50 | |
| 51 | MEMORY REGIONS and REGION TAXONOMY |
| 52 | |
Zhongxing Xu | 59eb5f9 | 2009-04-01 05:05:22 +0000 | [diff] [blame] | 53 | POINTERS |
| 54 | |
| 55 | Before talking about the memory regions, we would talk about the pointers |
| 56 | since memory regions are essentially used to represent pointer values. |
| 57 | |
| 58 | The pointer is a type of values. Pointer values have two semantic aspects. One |
| 59 | is its physical value, which is an address or location. The other is the type |
| 60 | of the memory object residing in the address. |
| 61 | |
| 62 | Memory regions are designed to abstract these two properties of the |
| 63 | pointer. The physical value of a pointer is represented by MemRegion |
| 64 | pointers. The rvalue type of the region corresponds to the type of the pointee |
| 65 | object. |
| 66 | |
| 67 | One complication is that we could have different view regions on the same |
| 68 | memory chunk. They represent the same memory location, but have different |
| 69 | abstract location, i.e., MemRegion pointers. Thus we need to canonicalize |
| 70 | the abstract locations to get a unique abstract location for one physical |
| 71 | location. |
| 72 | |
| 73 | Furthermore, these different view regions may or may not represent memory |
| 74 | objects of different types. Some different types are semantically the same, |
| 75 | for example, 'struct s' and 'my_type' are the same type. |
| 76 | struct s; |
| 77 | typedef struct s my_type; |
| 78 | |
| 79 | But 'char' and 'int' are not the same type in the code below: |
| 80 | void *p; |
| 81 | int *q = (int*) p; |
| 82 | char *r = (char*) p; |
| 83 | |
| 84 | Thus we need to canonicalize the MemRegion which is used in binding and |
| 85 | retrieving. |
| 86 | |
Ted Kremenek | 9d9963e | 2009-03-26 16:19:54 +0000 | [diff] [blame] | 87 | SYMBOLIC REGIONS |
| 88 | |
| 89 | A symbolic region is a map of the concept of symbolic values into the domain |
| 90 | of regions. It is the way that we represent symbolic pointers. Whenever a |
| 91 | symbolic pointer value is needed, a symbolic region is created to represent |
| 92 | it. |
| 93 | |
| 94 | A symbolic region has no type. It wraps a SymbolData. But sometimes we have |
| 95 | type information associated with a symbolic region. For this case, a |
| 96 | TypedViewRegion is created to layer the type information on top of the |
| 97 | symbolic region. The reason we do not carry type information with the symbolic |
| 98 | region is that the symbolic regions can have no type. To be consistent, we |
| 99 | don't let them to carry type information. |
| 100 | |
| 101 | Like a symbolic pointer, a symbolic region may be NULL, has unknown extent, |
| 102 | and represents a generic chunk of memory. |
| 103 | |
| 104 | NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it |
| 105 | gradually. |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 106 | |
Zhongxing Xu | 113cc14 | 2009-04-01 03:23:38 +0000 | [diff] [blame] | 107 | Symbolic regions get their rvalue types through the following ways: |
| 108 | * through the parameter or global variable that points to it, e.g.: |
| 109 | |
| 110 | void f(struct s* p) { |
| 111 | ... |
| 112 | } |
| 113 | |
| 114 | The symbolic region pointed to by 'p' has type 'struct s'. |
| 115 | |
| 116 | * through explicit or implicit casts, e.g.: |
| 117 | void f(void* p) { |
| 118 | struct s* q = (struct s*) p; |
| 119 | ... |
| 120 | } |
| 121 | |
| 122 | We attach the type information to the symbolic region lazily. For the first |
| 123 | case above, we create the TypedViewRegion only when the pointer is actually |
| 124 | used to access the pointee memory object, that is when the element or field |
| 125 | region is created. For the cast case, the TypedViewRegion is created when |
| 126 | visiting the CastExpr. |
| 127 | |
| 128 | The reason for doing lazy typing is that symbolic regions are sometimes only |
| 129 | used to do location comparison. |
| 130 | |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 131 | Pointer Casts |
| 132 | |
Zhongxing Xu | 7fe3e05 | 2009-04-01 05:26:39 +0000 | [diff] [blame] | 133 | Pointer casts allow people to impose different 'views' onto a chunk of memory. |
| 134 | |
| 135 | Usually we have two kinds of casts. One kind of casts cast down with in the |
| 136 | type hierarchy. It imposes more specific views onto more generic memory |
| 137 | regions. The other kind of casts cast up with in the type hierarchy. It strips |
| 138 | away more specific views on top of the more generic memory regions. |
| 139 | |
| 140 | We simulate the down casts by layering another TypedViewRegion on top of the |
| 141 | original region. We simulate the up casts by striping away the top |
| 142 | TypedViewRegion. Down casts is usually simple. For up casts, if the there is |
| 143 | no TypedViewRegion to be stripped, we return the original region. If the |
| 144 | underlying region is of the different type than the cast-to type, we flag an |
| 145 | error state. |
| 146 | |
| 147 | For toll-free bridging casts, we return the original region. |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 148 | |
Zhongxing Xu | a564789 | 2009-04-01 06:01:08 +0000 | [diff] [blame] | 149 | We can set up a lattice for pointer types, with the most general type 'void*' |
| 150 | at the top. The lattice enforces a partial order among types. |
| 151 | |
| 152 | Every MemRegion has a root position in the type lattice. For example, the |
| 153 | pointee region of 'void *p' has its root position at the top of the lattice. |
| 154 | VarRegion of 'int x' has its root position at the 'int type' node. |
| 155 | |
| 156 | TypedViewRegion is used to move the region down or up in the lattice. Moving |
| 157 | down in the lattice adds a TypedViewRegion. Moving up in the lattice removes a |
| 158 | TypedViewRegion. |
| 159 | |
| 160 | Do we want to allow moving up beyond the root position? This happens when: |
| 161 | int x; |
| 162 | void *p = &x; |
| 163 | |
| 164 | The region of 'x' has its root position at 'int*' node. the cast to void* |
| 165 | moves that region up to the 'void*' node. I propose to not allow such casts, |
| 166 | and assign the region of 'x' for 'p'. |
| 167 | |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 168 | Region Bindings |
| 169 | |
Zhongxing Xu | 7fe3e05 | 2009-04-01 05:26:39 +0000 | [diff] [blame] | 170 | The following region kinds are boundable: VarRegion, CompoundLiteralRegion, |
| 171 | StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion. |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 172 | |
Zhongxing Xu | 7fe3e05 | 2009-04-01 05:26:39 +0000 | [diff] [blame] | 173 | When binding regions, we perform canonicalization on element regions and field |
| 174 | regions. This is because we can have different views on the same region, some |
| 175 | of which are essentially the same view with different sugar type names. |
Zhongxing Xu | 7fddc33 | 2009-03-26 08:23:58 +0000 | [diff] [blame] | 176 | |
Zhongxing Xu | 7fe3e05 | 2009-04-01 05:26:39 +0000 | [diff] [blame] | 177 | To canonicalize a region, we get the canonical types for all TypedViewRegions |
| 178 | along the way up to the root region, and make new TypedViewRegions with those |
| 179 | canonical types. |
Zhongxing Xu | a564789 | 2009-04-01 06:01:08 +0000 | [diff] [blame] | 180 | |
| 181 | For ObjC and C++, perhaps another canonicalization rule should be added: for |
| 182 | FieldRegion, the least derived class that has the field is used as the type |
| 183 | of the super region of the FieldRegion. |
Zhongxing Xu | 7fe3e05 | 2009-04-01 05:26:39 +0000 | [diff] [blame] | 184 | |
| 185 | All bindings and retrievings are done on the canonicalized regions. |
| 186 | |
| 187 | Canonicalization is transparent outside the region store manager, and more |
| 188 | specifically, unaware outside the Bind() and Retrieve() method. We don't need |
| 189 | to consider region canonicalization when doing pointer cast. |
Zhongxing Xu | a564789 | 2009-04-01 06:01:08 +0000 | [diff] [blame] | 190 | |
| 191 | Constraint Manager |
| 192 | |
| 193 | The constraint manager reasons about the abstract location of memory |
| 194 | objects. We can have different views on a region, but none of these views |
| 195 | changes the location of that object. Thus we should get the same abstract |
| 196 | location for those regions. |