| Ted Kremenek | 5d71204 | 2009-04-03 01:38:55 +0000 | [diff] [blame] | 1 | <html> | 
|  | 2 | <head> | 
|  | 3 | <title>Static Analyzer Design Document: Memory Regions</title> | 
|  | 4 | </head> | 
|  | 5 | <body> | 
|  | 6 |  | 
|  | 7 | <h1>Static Analyzer Design Document: Memory Regions</h1> | 
|  | 8 |  | 
|  | 9 | <h3>Authors</h3> | 
|  | 10 |  | 
|  | 11 | <p>Ted Kremenek, <tt>kremenek at apple</tt><br> | 
|  | 12 | Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p> | 
|  | 13 |  | 
|  | 14 | <h2 id="intro">Introduction</h2> | 
|  | 15 |  | 
|  | 16 | <p>The path-sensitive analysis engine in libAnalysis employs an extensible API | 
|  | 17 | for abstractly modeling the memory of an analyzed program. This API employs the | 
|  | 18 | concept of "memory regions" to abstractly model chunks of program memory such as | 
|  | 19 | program variables and dynamically allocated memory such as those returned from | 
|  | 20 | 'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling | 
|  | 21 | subtyping relationships, field and array offsets into larger chunks of memory, | 
|  | 22 | and so on.</p> | 
|  | 23 |  | 
|  | 24 | <p>The region API consists of two components:</p> | 
|  | 25 |  | 
|  | 26 | <ul> <li>A taxonomy and representation of regions themselves within the analyzer | 
|  | 27 | engine. The primary definitions and interfaces are described in <tt><a | 
|  | 28 | href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>. | 
|  | 29 | At the root of the region hierarchy is the class <tt>MemRegion</tt> with | 
|  | 30 | specific subclasses refining the region concept for variables, heap allocated | 
|  | 31 | memory, and so forth.</li> <li>The modeling of binding of values to regions. For | 
|  | 32 | example, modeling the value stored to a local variable <tt>x</tt> consists of | 
|  | 33 | recording the binding between the region for <tt>x</tt> (which represents the | 
|  | 34 | raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This | 
|  | 35 | binding relationship is captured with the notion of "symbolic | 
|  | 36 | stores."</li> </ul> | 
|  | 37 |  | 
|  | 38 | <p>Symbolic stores, which can be thought of as representing the relation | 
|  | 39 | <tt>regions -> values</tt>, are implemented by subclasses of the | 
|  | 40 | <tt>StoreManager</tt> class (<tt><a | 
|  | 41 | href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A | 
|  | 42 | particular StoreManager implementation has complete flexibility concerning the | 
|  | 43 | following: | 
|  | 44 |  | 
|  | 45 | <ul> | 
|  | 46 | <li><em>How</em> to model the binding between regions and values</li> | 
|  | 47 | <li><em>What</em> bindings are recorded | 
|  | 48 | </ul> | 
|  | 49 |  | 
|  | 50 | <p>Together, both points allow different StoreManagers to tradeoff between | 
|  | 51 | different levels of analysis precision and scalability concerning the reasoning | 
|  | 52 | of program memory. Meanwhile, the core path-sensitive engine makes no | 
|  | 53 | assumptions about either points, and queries a StoreManager about the bindings | 
|  | 54 | to a memory region through a generic interface that all StoreManagers share. If | 
|  | 55 | a particular StoreManager cannot reason about the potential bindings of a given | 
|  | 56 | memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields | 
|  | 57 | of structures) then the StoreManager can simply return 'unknown' (represented by | 
|  | 58 | '<tt>UnknownVal</tt>') for a particular region-binding. This separation of | 
|  | 59 | concerns not only isolates the core analysis engine from the details of | 
|  | 60 | reasoning about program memory but also facilities the option of a client of the | 
|  | 61 | path-sensitive engine to easily swap in different StoreManager implementations | 
|  | 62 | that internally reason about program memory in very different ways.</pp> | 
|  | 63 |  | 
|  | 64 | <p>The rest of this document is divided into two parts. We first discuss region | 
|  | 65 | taxonomy and the semantics of regions. We then discuss the StoreManager | 
|  | 66 | interface, and details of how the currently available StoreManager classes | 
|  | 67 | implement region bindings.</p> | 
|  | 68 |  | 
|  | 69 | <h2 id="regions">Memory Regions and Region Taxonomy</h2> | 
|  | 70 |  | 
|  | 71 | <h3>Pointers</h3> | 
|  | 72 |  | 
|  | 73 | <p>Before talking about the memory regions, we would talk about the pointers | 
|  | 74 | since memory regions are essentially used to represent pointer values.</p> | 
|  | 75 |  | 
|  | 76 | <p>The pointer is a type of values. Pointer values have two semantic aspects. | 
|  | 77 | One is its physical value, which is an address or location. The other is the | 
|  | 78 | type of the memory object residing in the address.</p> | 
|  | 79 |  | 
|  | 80 | <p>Memory regions are designed to abstract these two properties of the pointer. | 
|  | 81 | The physical value of a pointer is represented by MemRegion pointers. The rvalue | 
|  | 82 | type of the region corresponds to the type of the pointee object.</p> | 
|  | 83 |  | 
|  | 84 | <p>One complication is that we could have different view regions on the same | 
|  | 85 | memory chunk. They represent the same memory location, but have different | 
|  | 86 | abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the | 
|  | 87 | abstract locations to get a unique abstract location for one physical | 
|  | 88 | location.</p> | 
|  | 89 |  | 
|  | 90 | <p>Furthermore, these different view regions may or may not represent memory | 
|  | 91 | objects of different types. Some different types are semantically the same, | 
|  | 92 | for example, 'struct s' and 'my_type' are the same type.</p> | 
|  | 93 |  | 
|  | 94 | <pre> | 
|  | 95 | struct s; | 
|  | 96 | typedef struct s my_type; | 
|  | 97 | </pre> | 
|  | 98 |  | 
|  | 99 | <p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p> | 
|  | 100 |  | 
|  | 101 | <pre> | 
|  | 102 | void *p; | 
|  | 103 | int *q = (int*) p; | 
|  | 104 | char *r = (char*) p; | 
|  | 105 | </pre | 
|  | 106 |  | 
|  | 107 | <p>Thus we need to canonicalize the MemRegion which is used in binding and | 
|  | 108 | retrieving.</p> | 
|  | 109 |  | 
| Zhongxing Xu | a720c8b | 2009-04-10 06:52:49 +0000 | [diff] [blame] | 110 | <h3>Regions</h3> | 
|  | 111 | <p>Region is the entity used to model pointer values. A Region has the following | 
|  | 112 | properties:</p> | 
|  | 113 |  | 
|  | 114 | <ul> | 
|  | 115 | <li>Kind</li> | 
|  | 116 |  | 
|  | 117 | <li>ObjectType: the type of the object residing on the region.</li> | 
|  | 118 |  | 
|  | 119 | <li>LocationType: the type of the pointer value that the region corresponds to. | 
|  | 120 | Usually this is the pointer to the ObjectType. But sometimes we want to cache | 
|  | 121 | this type explicitly, for example, for a CodeTextRegion.</li> | 
|  | 122 |  | 
|  | 123 | <li>StartLocation</li> | 
|  | 124 |  | 
|  | 125 | <li>EndLocation</li> | 
|  | 126 | </ul> | 
|  | 127 |  | 
| Ted Kremenek | 5d71204 | 2009-04-03 01:38:55 +0000 | [diff] [blame] | 128 | <h3>Symbolic Regions</h3> | 
|  | 129 |  | 
|  | 130 | <p>A symbolic region is a map of the concept of symbolic values into the domain | 
|  | 131 | of regions. It is the way that we represent symbolic pointers. Whenever a | 
|  | 132 | symbolic pointer value is needed, a symbolic region is created to represent | 
|  | 133 | it.</p> | 
|  | 134 |  | 
|  | 135 | <p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have | 
|  | 136 | type information associated with a symbolic region. For this case, a | 
|  | 137 | TypedViewRegion is created to layer the type information on top of the symbolic | 
|  | 138 | region. The reason we do not carry type information with the symbolic region is | 
|  | 139 | that the symbolic regions can have no type. To be consistent, we don't let them | 
|  | 140 | to carry type information.</p> | 
|  | 141 |  | 
|  | 142 | <p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent, | 
|  | 143 | and represents a generic chunk of memory.</p> | 
|  | 144 |  | 
|  | 145 | <p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it | 
|  | 146 | gradually.</em></p> | 
|  | 147 |  | 
|  | 148 | <p>Symbolic regions get their rvalue types through the following ways:</p> | 
|  | 149 |  | 
|  | 150 | <ul> | 
|  | 151 | <li>Through the parameter or global variable that points to it, e.g.: | 
|  | 152 | <pre> | 
|  | 153 | void f(struct s* p) { | 
|  | 154 | ... | 
|  | 155 | } | 
|  | 156 | </pre> | 
|  | 157 |  | 
|  | 158 | <p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct | 
|  | 159 | s</tt>.</p></li> | 
|  | 160 |  | 
|  | 161 | <li>Through explicit or implicit casts, e.g.: | 
|  | 162 | <pre> | 
|  | 163 | void f(void* p) { | 
|  | 164 | struct s* q = (struct s*) p; | 
|  | 165 | ... | 
|  | 166 | } | 
|  | 167 | </pre> | 
|  | 168 | </li> | 
|  | 169 | </ul> | 
|  | 170 |  | 
|  | 171 | <p>We attach the type information to the symbolic region lazily. For the first | 
|  | 172 | case above, we create the <tt>TypedViewRegion</tt> only when the pointer is | 
|  | 173 | actually used to access the pointee memory object, that is when the element or | 
|  | 174 | field region is created. For the cast case, the <tt>TypedViewRegion</tt> is | 
|  | 175 | created when visiting the <tt>CastExpr</tt>.</p> | 
|  | 176 |  | 
|  | 177 | <p>The reason for doing lazy typing is that symbolic regions are sometimes only | 
|  | 178 | used to do location comparison.</p> | 
|  | 179 |  | 
|  | 180 | <h3>Pointer Casts</h3> | 
|  | 181 |  | 
|  | 182 | <p>Pointer casts allow people to impose different 'views' onto a chunk of | 
|  | 183 | memory.</p> | 
|  | 184 |  | 
|  | 185 | <p>Usually we have two kinds of casts. One kind of casts cast down with in the | 
|  | 186 | type hierarchy. It imposes more specific views onto more generic memory regions. | 
|  | 187 | The other kind of casts cast up with in the type hierarchy. It strips away more | 
|  | 188 | specific views on top of the more generic memory regions.</p> | 
|  | 189 |  | 
|  | 190 | <p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on | 
|  | 191 | top of the original region. We simulate the up casts by striping away the top | 
|  | 192 | <tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the | 
|  | 193 | there is no <tt>TypedViewRegion</tt> to be stripped, we return the original | 
|  | 194 | region. If the underlying region is of the different type than the cast-to type, | 
|  | 195 | we flag an error state.</p> | 
|  | 196 |  | 
|  | 197 | <p>For toll-free bridging casts, we return the original region.</p> | 
|  | 198 |  | 
|  | 199 | <p>We can set up a partial order for pointer types, with the most general type | 
|  | 200 | <tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as | 
|  | 201 | its root node.</p> | 
|  | 202 |  | 
|  | 203 | <p>Every <tt>MemRegion</tt> has a root position in the type tree. For example, | 
|  | 204 | the pointee region of <tt>void *p</tt> has its root position at the root node of | 
|  | 205 | the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int | 
|  | 206 | type' node.</p> | 
|  | 207 |  | 
|  | 208 | <p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree. | 
|  | 209 | Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree | 
|  | 210 | removes a <Tt>TypedViewRegion</tt>.</p> | 
|  | 211 |  | 
|  | 212 | <p>Do we want to allow moving up beyond the root position? This happens | 
|  | 213 | when:</p> <pre> int x; void *p = &x; </pre> | 
|  | 214 |  | 
|  | 215 | <p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to | 
|  | 216 | void* moves that region up to the 'void*' node. I propose to not allow such | 
| Zhongxing Xu | 2f8198f | 2009-04-20 10:09:10 +0000 | [diff] [blame] | 217 | casts, and assign the region of <tt>x</tt> for <tt>p</tt>.</p> | 
|  | 218 |  | 
|  | 219 | <p>Another non-ideal case is that people might cast to a non-generic pointer | 
|  | 220 | from another non-generic pointer instead of first casting it back to the generic | 
|  | 221 | pointer. Direct handling of this case would result in multiple layers of | 
|  | 222 | TypedViewRegions. This enforces an incorrect semantic view to the region, | 
|  | 223 | because we can only have one typed view on a region at a time. To avoid this | 
|  | 224 | inconsistency, before casting the region, we strip the TypedViewRegion, then do | 
|  | 225 | the cast. In summary, we only allow one layer of TypedViewRegion.</p> | 
| Ted Kremenek | 5d71204 | 2009-04-03 01:38:55 +0000 | [diff] [blame] | 226 |  | 
|  | 227 | <h3>Region Bindings</h3> | 
|  | 228 |  | 
|  | 229 | <p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion, | 
|  | 230 | StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p> | 
|  | 231 |  | 
|  | 232 | <p>When binding regions, we perform canonicalization on element regions and field | 
|  | 233 | regions. This is because we can have different views on the same region, some | 
|  | 234 | of which are essentially the same view with different sugar type names.</p> | 
|  | 235 |  | 
|  | 236 | <p>To canonicalize a region, we get the canonical types for all TypedViewRegions | 
|  | 237 | along the way up to the root region, and make new TypedViewRegions with those | 
|  | 238 | canonical types.</p> | 
|  | 239 |  | 
|  | 240 | <p>For Objective-C and C++, perhaps another canonicalization rule should be | 
|  | 241 | added: for FieldRegion, the least derived class that has the field is used as | 
|  | 242 | the type of the super region of the FieldRegion.</p> | 
|  | 243 |  | 
|  | 244 | <p>All bindings and retrievings are done on the canonicalized regions.</p> | 
|  | 245 |  | 
|  | 246 | <p>Canonicalization is transparent outside the region store manager, and more | 
|  | 247 | specifically, unaware outside the Bind() and Retrieve() method. We don't need to | 
|  | 248 | consider region canonicalization when doing pointer cast.</p> | 
|  | 249 |  | 
|  | 250 | <h3>Constraint Manager</h3> | 
|  | 251 |  | 
|  | 252 | <p>The constraint manager reasons about the abstract location of memory objects. | 
|  | 253 | We can have different views on a region, but none of these views changes the | 
|  | 254 | location of that object. Thus we should get the same abstract location for those | 
|  | 255 | regions.</p> | 
|  | 256 |  | 
|  | 257 | </body> | 
|  | 258 | </html> |