| Static Analyzer: 'Regions' |
| -------------------------- |
| |
| INTRODUCTION |
| |
| The path-sensitive analysis engine in libAnalysis employs an extensible API |
| for abstractly modeling the memory of an analyzed program. This API employs |
| the concept of "memory regions" to abstractly model chunks of program memory |
| such as program variables and dynamically allocated memory such as those |
| returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions |
| modeling subtyping relationships, field and array offsets into larger chunks |
| of memory, and so on. |
| |
| The region API consists of two components. The first is the taxonomy and |
| representation of regions themselves within the analyzer engine. The primary |
| definitions and interfaces are described in |
| 'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region |
| hierarchy is the class 'MemRegion' with specific subclasses refining the |
| region concept for variables, heap allocated memory, and so forth. |
| |
| The second component in the region API is the modeling of the binding of |
| values to regions. For example, modeling the value stored to a local variable |
| 'x' consists of recording the binding between the region for 'x' (which |
| represents the raw memory associated with 'x') and the value stored to 'x'. |
| This binding relationship is captured with the notion of "symbolic stores." |
| |
| Symbolic stores, which can be thought of as representing the relation 'regions |
| -> values', are implemented by subclasses of the StoreManager class (Store.h). |
| A particular StoreManager implementation has complete flexibility concerning |
| (a) *how* to model the binding between regions and values and (b) *what* |
| bindings are recorded. Together, both points allow different StoreManagers to |
| tradeoff between different levels of analysis precision and scalability |
| concerning the reasoning of program memory. Meanwhile, the core path-sensitive |
| engine makes no assumptions about (a) or (b), and queries a StoreManager about |
| the bindings to a memory region through a generic interface that all |
| StoreManagers share. If a particular StoreManager cannot reason about the |
| potential bindings of a given memory region (e.g., 'BasicStoreManager' does |
| not reason about fields of structures) then the StoreManager can simply return |
| 'unknown' (represented by 'UnknownVal') for a particular region-binding. This |
| separation of concerns not only isolates the core analysis engine from the |
| details of reasoning about program memory but also facilities the option of a |
| client of the path-sensitive engine to easily swap in different StoreManager |
| implementations that internally reason about program memory in very different |
| ways. |
| |
| The rest of this document is divided into two parts. We first discuss region |
| taxonomy and the semantics of regions. We then discuss the StoreManager |
| interface, and details of how the currently available StoreManager classes |
| implement region bindings. |
| |
| MEMORY REGIONS and REGION TAXONOMY |
| |
| POINTERS |
| |
| Before talking about the memory regions, we would talk about the pointers |
| since memory regions are essentially used to represent pointer values. |
| |
| The pointer is a type of values. Pointer values have two semantic aspects. One |
| is its physical value, which is an address or location. The other is the type |
| of the memory object residing in the address. |
| |
| Memory regions are designed to abstract these two properties of the |
| pointer. The physical value of a pointer is represented by MemRegion |
| pointers. The rvalue type of the region corresponds to the type of the pointee |
| object. |
| |
| One complication is that we could have different view regions on the same |
| memory chunk. They represent the same memory location, but have different |
| abstract location, i.e., MemRegion pointers. Thus we need to canonicalize |
| the abstract locations to get a unique abstract location for one physical |
| location. |
| |
| Furthermore, these different view regions may or may not represent memory |
| objects of different types. Some different types are semantically the same, |
| for example, 'struct s' and 'my_type' are the same type. |
| struct s; |
| typedef struct s my_type; |
| |
| But 'char' and 'int' are not the same type in the code below: |
| void *p; |
| int *q = (int*) p; |
| char *r = (char*) p; |
| |
| Thus we need to canonicalize the MemRegion which is used in binding and |
| retrieving. |
| |
| SYMBOLIC REGIONS |
| |
| A symbolic region is a map of the concept of symbolic values into the domain |
| of regions. It is the way that we represent symbolic pointers. Whenever a |
| symbolic pointer value is needed, a symbolic region is created to represent |
| it. |
| |
| A symbolic region has no type. It wraps a SymbolData. But sometimes we have |
| type information associated with a symbolic region. For this case, a |
| TypedViewRegion is created to layer the type information on top of the |
| symbolic region. The reason we do not carry type information with the symbolic |
| region is that the symbolic regions can have no type. To be consistent, we |
| don't let them to carry type information. |
| |
| Like a symbolic pointer, a symbolic region may be NULL, has unknown extent, |
| and represents a generic chunk of memory. |
| |
| NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it |
| gradually. |
| |
| Symbolic regions get their rvalue types through the following ways: |
| * through the parameter or global variable that points to it, e.g.: |
| |
| void f(struct s* p) { |
| ... |
| } |
| |
| The symbolic region pointed to by 'p' has type 'struct s'. |
| |
| * through explicit or implicit casts, e.g.: |
| void f(void* p) { |
| struct s* q = (struct s*) p; |
| ... |
| } |
| |
| We attach the type information to the symbolic region lazily. For the first |
| case above, we create the TypedViewRegion only when the pointer is actually |
| used to access the pointee memory object, that is when the element or field |
| region is created. For the cast case, the TypedViewRegion is created when |
| visiting the CastExpr. |
| |
| The reason for doing lazy typing is that symbolic regions are sometimes only |
| used to do location comparison. |
| |
| Pointer Casts |
| |
| Pointer casts allow people to impose different 'views' onto a chunk of memory. |
| |
| Usually we have two kinds of casts. One kind of casts cast down with in the |
| type hierarchy. It imposes more specific views onto more generic memory |
| regions. The other kind of casts cast up with in the type hierarchy. It strips |
| away more specific views on top of the more generic memory regions. |
| |
| We simulate the down casts by layering another TypedViewRegion on top of the |
| original region. We simulate the up casts by striping away the top |
| TypedViewRegion. Down casts is usually simple. For up casts, if the there is |
| no TypedViewRegion to be stripped, we return the original region. If the |
| underlying region is of the different type than the cast-to type, we flag an |
| error state. |
| |
| For toll-free bridging casts, we return the original region. |
| |
| We can set up a partial order for pointer types, with the most general type |
| 'void*' at the top. The partial order forms a tree with 'void*' as its root |
| node. |
| |
| Every MemRegion has a root position in the type tree. For example, the pointee |
| region of 'void *p' has its root position at the root node of the tree. |
| VarRegion of 'int x' has its root position at the 'int type' node. |
| |
| TypedViewRegion is used to move the region down or up in the tree. Moving |
| down in the tree adds a TypedViewRegion. Moving up in the tree removes a |
| TypedViewRegion. |
| |
| Do we want to allow moving up beyond the root position? This happens when: |
| int x; |
| void *p = &x; |
| |
| The region of 'x' has its root position at 'int*' node. the cast to void* |
| moves that region up to the 'void*' node. I propose to not allow such casts, |
| and assign the region of 'x' for 'p'. |
| |
| Region Bindings |
| |
| The following region kinds are boundable: VarRegion, CompoundLiteralRegion, |
| StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion. |
| |
| When binding regions, we perform canonicalization on element regions and field |
| regions. This is because we can have different views on the same region, some |
| of which are essentially the same view with different sugar type names. |
| |
| To canonicalize a region, we get the canonical types for all TypedViewRegions |
| along the way up to the root region, and make new TypedViewRegions with those |
| canonical types. |
| |
| For ObjC and C++, perhaps another canonicalization rule should be added: for |
| FieldRegion, the least derived class that has the field is used as the type |
| of the super region of the FieldRegion. |
| |
| All bindings and retrievings are done on the canonicalized regions. |
| |
| Canonicalization is transparent outside the region store manager, and more |
| specifically, unaware outside the Bind() and Retrieve() method. We don't need |
| to consider region canonicalization when doing pointer cast. |
| |
| Constraint Manager |
| |
| The constraint manager reasons about the abstract location of memory |
| objects. We can have different views on a region, but none of these views |
| changes the location of that object. Thus we should get the same abstract |
| location for those regions. |