blob: c9c4ab30dff64d79485afa7dddf22e03ba377d63 [file] [log] [blame]
Ted Kremenek9d9963e2009-03-26 16:19:54 +00001Static Analyzer: 'Regions'
2--------------------------
Zhongxing Xu7fddc332009-03-26 08:23:58 +00003
Ted Kremenek9d9963e2009-03-26 16:19:54 +00004INTRODUCTION
Zhongxing Xu7fddc332009-03-26 08:23:58 +00005
Ted Kremenek9d9963e2009-03-26 16:19:54 +00006 The path-sensitive analysis engine in libAnalysis employs an extensible API
7 for abstractly modeling the memory of an analyzed program. This API employs
8 the concept of "memory regions" to abstractly model chunks of program memory
9 such as program variables and dynamically allocated memory such as those
10 returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
11 modeling subtyping relationships, field and array offsets into larger chunks
12 of memory, and so on.
Zhongxing Xu7fddc332009-03-26 08:23:58 +000013
Ted Kremenek9d9963e2009-03-26 16:19:54 +000014 The region API consists of two components. The first is the taxonomy and
15 representation of regions themselves within the analyzer engine. The primary
16 definitions and interfaces are described in
17 'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
18 hierarchy is the class 'MemRegion' with specific subclasses refining the
19 region concept for variables, heap allocated memory, and so forth.
Zhongxing Xu7fddc332009-03-26 08:23:58 +000020
Ted Kremenek9d9963e2009-03-26 16:19:54 +000021 The second component in the region API is the modeling of the binding of
22 values to regions. For example, modeling the value stored to a local variable
23 'x' consists of recording the binding between the region for 'x' (which
24 represents the raw memory associated with 'x') and the value stored to 'x'.
25 This binding relationship is captured with the notion of "symbolic stores."
26
27 Symbolic stores, which can be thought of as representing the relation 'regions
28 -> values', are implemented by subclasses of the StoreManager class (Store.h).
29 A particular StoreManager implementation has complete flexibility concerning
30 (a) *how* to model the binding between regions and values and (b) *what*
31 bindings are recorded. Together, both points allow different StoreManagers to
32 tradeoff between different levels of analysis precision and scalability
33 concerning the reasoning of program memory. Meanwhile, the core path-sensitive
34 engine makes no assumptions about (a) or (b), and queries a StoreManager about
35 the bindings to a memory region through a generic interface that all
36 StoreManagers share. If a particular StoreManager cannot reason about the
37 potential bindings of a given memory region (e.g., 'BasicStoreManager' does
38 not reason about fields of structures) then the StoreManager can simply return
39 'unknown' (represented by 'UnknownVal') for a particular region-binding. This
40 separation of concerns not only isolates the core analysis engine from the
41 details of reasoning about program memory but also facilities the option of a
42 client of the path-sensitive engine to easily swap in different StoreManager
43 implementations that internally reason about program memory in very different
44 ways.
45
46 The rest of this document is divided into two parts. We first discuss region
47 taxonomy and the semantics of regions. We then discuss the StoreManager
48 interface, and details of how the currently available StoreManager classes
49 implement region bindings.
50
51MEMORY REGIONS and REGION TAXONOMY
52
Zhongxing Xu59eb5f92009-04-01 05:05:22 +000053 POINTERS
54
55 Before talking about the memory regions, we would talk about the pointers
56 since memory regions are essentially used to represent pointer values.
57
58 The pointer is a type of values. Pointer values have two semantic aspects. One
59 is its physical value, which is an address or location. The other is the type
60 of the memory object residing in the address.
61
62 Memory regions are designed to abstract these two properties of the
63 pointer. The physical value of a pointer is represented by MemRegion
64 pointers. The rvalue type of the region corresponds to the type of the pointee
65 object.
66
67 One complication is that we could have different view regions on the same
68 memory chunk. They represent the same memory location, but have different
69 abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
70 the abstract locations to get a unique abstract location for one physical
71 location.
72
73 Furthermore, these different view regions may or may not represent memory
74 objects of different types. Some different types are semantically the same,
75 for example, 'struct s' and 'my_type' are the same type.
76 struct s;
77 typedef struct s my_type;
78
79 But 'char' and 'int' are not the same type in the code below:
80 void *p;
81 int *q = (int*) p;
82 char *r = (char*) p;
83
84 Thus we need to canonicalize the MemRegion which is used in binding and
85 retrieving.
86
Ted Kremenek9d9963e2009-03-26 16:19:54 +000087 SYMBOLIC REGIONS
88
89 A symbolic region is a map of the concept of symbolic values into the domain
90 of regions. It is the way that we represent symbolic pointers. Whenever a
91 symbolic pointer value is needed, a symbolic region is created to represent
92 it.
93
94 A symbolic region has no type. It wraps a SymbolData. But sometimes we have
95 type information associated with a symbolic region. For this case, a
96 TypedViewRegion is created to layer the type information on top of the
97 symbolic region. The reason we do not carry type information with the symbolic
98 region is that the symbolic regions can have no type. To be consistent, we
99 don't let them to carry type information.
100
101 Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
102 and represents a generic chunk of memory.
103
104 NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
105 gradually.
Zhongxing Xu7fddc332009-03-26 08:23:58 +0000106
Zhongxing Xu113cc142009-04-01 03:23:38 +0000107 Symbolic regions get their rvalue types through the following ways:
108 * through the parameter or global variable that points to it, e.g.:
109
110 void f(struct s* p) {
111 ...
112 }
113
114 The symbolic region pointed to by 'p' has type 'struct s'.
115
116 * through explicit or implicit casts, e.g.:
117 void f(void* p) {
118 struct s* q = (struct s*) p;
119 ...
120 }
121
122 We attach the type information to the symbolic region lazily. For the first
123 case above, we create the TypedViewRegion only when the pointer is actually
124 used to access the pointee memory object, that is when the element or field
125 region is created. For the cast case, the TypedViewRegion is created when
126 visiting the CastExpr.
127
128 The reason for doing lazy typing is that symbolic regions are sometimes only
129 used to do location comparison.
130
Zhongxing Xu7fddc332009-03-26 08:23:58 +0000131Pointer Casts
132
Zhongxing Xu7fe3e052009-04-01 05:26:39 +0000133 Pointer casts allow people to impose different 'views' onto a chunk of memory.
134
135 Usually we have two kinds of casts. One kind of casts cast down with in the
136 type hierarchy. It imposes more specific views onto more generic memory
137 regions. The other kind of casts cast up with in the type hierarchy. It strips
138 away more specific views on top of the more generic memory regions.
139
140 We simulate the down casts by layering another TypedViewRegion on top of the
141 original region. We simulate the up casts by striping away the top
142 TypedViewRegion. Down casts is usually simple. For up casts, if the there is
143 no TypedViewRegion to be stripped, we return the original region. If the
144 underlying region is of the different type than the cast-to type, we flag an
145 error state.
146
147 For toll-free bridging casts, we return the original region.
Zhongxing Xu7fddc332009-03-26 08:23:58 +0000148
Zhongxing Xu75ae2f12009-04-02 01:14:56 +0000149 We can set up a partial order for pointer types, with the most general type
150 'void*' at the top. The partial order forms a tree with 'void*' as its root
151 node.
Zhongxing Xua5647892009-04-01 06:01:08 +0000152
Zhongxing Xu75ae2f12009-04-02 01:14:56 +0000153 Every MemRegion has a root position in the type tree. For example, the pointee
154 region of 'void *p' has its root position at the root node of the tree.
Zhongxing Xua5647892009-04-01 06:01:08 +0000155 VarRegion of 'int x' has its root position at the 'int type' node.
156
Zhongxing Xu75ae2f12009-04-02 01:14:56 +0000157 TypedViewRegion is used to move the region down or up in the tree. Moving
158 down in the tree adds a TypedViewRegion. Moving up in the tree removes a
Zhongxing Xua5647892009-04-01 06:01:08 +0000159 TypedViewRegion.
160
161 Do we want to allow moving up beyond the root position? This happens when:
162 int x;
163 void *p = &x;
164
165 The region of 'x' has its root position at 'int*' node. the cast to void*
166 moves that region up to the 'void*' node. I propose to not allow such casts,
167 and assign the region of 'x' for 'p'.
168
Zhongxing Xu7fddc332009-03-26 08:23:58 +0000169Region Bindings
170
Zhongxing Xu7fe3e052009-04-01 05:26:39 +0000171 The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
172 StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.
Zhongxing Xu7fddc332009-03-26 08:23:58 +0000173
Zhongxing Xu7fe3e052009-04-01 05:26:39 +0000174 When binding regions, we perform canonicalization on element regions and field
175 regions. This is because we can have different views on the same region, some
176 of which are essentially the same view with different sugar type names.
Zhongxing Xu7fddc332009-03-26 08:23:58 +0000177
Zhongxing Xu7fe3e052009-04-01 05:26:39 +0000178 To canonicalize a region, we get the canonical types for all TypedViewRegions
179 along the way up to the root region, and make new TypedViewRegions with those
180 canonical types.
Zhongxing Xua5647892009-04-01 06:01:08 +0000181
182 For ObjC and C++, perhaps another canonicalization rule should be added: for
183 FieldRegion, the least derived class that has the field is used as the type
184 of the super region of the FieldRegion.
Zhongxing Xu7fe3e052009-04-01 05:26:39 +0000185
186 All bindings and retrievings are done on the canonicalized regions.
187
188 Canonicalization is transparent outside the region store manager, and more
189 specifically, unaware outside the Bind() and Retrieve() method. We don't need
190 to consider region canonicalization when doing pointer cast.
Zhongxing Xua5647892009-04-01 06:01:08 +0000191
192Constraint Manager
193
194 The constraint manager reasons about the abstract location of memory
195 objects. We can have different views on a region, but none of these views
196 changes the location of that object. Thus we should get the same abstract
197 location for those regions.