blob: f9d33379204527e94294b58d9f4998f5a6f7480e [file] [log] [blame]
Benjamin Kramer665a8dc2012-01-15 15:26:07 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
Ted Kremenekd76e0a62009-04-03 01:38:55 +00003<html>
4<head>
5<title>Static Analyzer Design Document: Memory Regions</title>
6</head>
7<body>
8
9<h1>Static Analyzer Design Document: Memory Regions</h1>
10
11<h3>Authors</h3>
12
13<p>Ted Kremenek, <tt>kremenek at apple</tt><br>
14Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p>
15
16<h2 id="intro">Introduction</h2>
17
18<p>The path-sensitive analysis engine in libAnalysis employs an extensible API
19for abstractly modeling the memory of an analyzed program. This API employs the
20concept of "memory regions" to abstractly model chunks of program memory such as
21program variables and dynamically allocated memory such as those returned from
22'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling
23subtyping relationships, field and array offsets into larger chunks of memory,
24and so on.</p>
25
26<p>The region API consists of two components:</p>
27
28<ul> <li>A taxonomy and representation of regions themselves within the analyzer
29engine. The primary definitions and interfaces are described in <tt><a
30href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>.
31At the root of the region hierarchy is the class <tt>MemRegion</tt> with
32specific subclasses refining the region concept for variables, heap allocated
33memory, and so forth.</li> <li>The modeling of binding of values to regions. For
34example, modeling the value stored to a local variable <tt>x</tt> consists of
35recording the binding between the region for <tt>x</tt> (which represents the
36raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This
37binding relationship is captured with the notion of &quot;symbolic
38stores.&quot;</li> </ul>
39
40<p>Symbolic stores, which can be thought of as representing the relation
41<tt>regions -> values</tt>, are implemented by subclasses of the
42<tt>StoreManager</tt> class (<tt><a
43href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A
44particular StoreManager implementation has complete flexibility concerning the
45following:
46
47<ul>
48<li><em>How</em> to model the binding between regions and values</li>
49<li><em>What</em> bindings are recorded
50</ul>
51
52<p>Together, both points allow different StoreManagers to tradeoff between
53different levels of analysis precision and scalability concerning the reasoning
54of program memory. Meanwhile, the core path-sensitive engine makes no
55assumptions about either points, and queries a StoreManager about the bindings
56to a memory region through a generic interface that all StoreManagers share. If
57a particular StoreManager cannot reason about the potential bindings of a given
58memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields
59of structures) then the StoreManager can simply return 'unknown' (represented by
60'<tt>UnknownVal</tt>') for a particular region-binding. This separation of
61concerns not only isolates the core analysis engine from the details of
62reasoning about program memory but also facilities the option of a client of the
63path-sensitive engine to easily swap in different StoreManager implementations
Benjamin Kramer665a8dc2012-01-15 15:26:07 +000064that internally reason about program memory in very different ways.</p>
Ted Kremenekd76e0a62009-04-03 01:38:55 +000065
66<p>The rest of this document is divided into two parts. We first discuss region
67taxonomy and the semantics of regions. We then discuss the StoreManager
68interface, and details of how the currently available StoreManager classes
69implement region bindings.</p>
70
71<h2 id="regions">Memory Regions and Region Taxonomy</h2>
72
73<h3>Pointers</h3>
74
75<p>Before talking about the memory regions, we would talk about the pointers
76since memory regions are essentially used to represent pointer values.</p>
77
78<p>The pointer is a type of values. Pointer values have two semantic aspects.
79One is its physical value, which is an address or location. The other is the
80type of the memory object residing in the address.</p>
81
82<p>Memory regions are designed to abstract these two properties of the pointer.
83The physical value of a pointer is represented by MemRegion pointers. The rvalue
84type of the region corresponds to the type of the pointee object.</p>
85
86<p>One complication is that we could have different view regions on the same
87memory chunk. They represent the same memory location, but have different
88abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the
89abstract locations to get a unique abstract location for one physical
90location.</p>
91
92<p>Furthermore, these different view regions may or may not represent memory
93objects of different types. Some different types are semantically the same,
94for example, 'struct s' and 'my_type' are the same type.</p>
95
96<pre>
97struct s;
98typedef struct s my_type;
99</pre>
100
101<p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p>
102
103<pre>
104void *p;
105int *q = (int*) p;
106char *r = (char*) p;
Benjamin Kramer665a8dc2012-01-15 15:26:07 +0000107</pre>
Ted Kremenekd76e0a62009-04-03 01:38:55 +0000108
109<p>Thus we need to canonicalize the MemRegion which is used in binding and
110retrieving.</p>
111
Zhongxing Xucf665e12009-04-10 06:52:49 +0000112<h3>Regions</h3>
113<p>Region is the entity used to model pointer values. A Region has the following
114properties:</p>
115
116<ul>
117<li>Kind</li>
118
119<li>ObjectType: the type of the object residing on the region.</li>
120
121<li>LocationType: the type of the pointer value that the region corresponds to.
122 Usually this is the pointer to the ObjectType. But sometimes we want to cache
123 this type explicitly, for example, for a CodeTextRegion.</li>
124
125<li>StartLocation</li>
126
127<li>EndLocation</li>
128</ul>
129
Ted Kremenekd76e0a62009-04-03 01:38:55 +0000130<h3>Symbolic Regions</h3>
131
132<p>A symbolic region is a map of the concept of symbolic values into the domain
133of regions. It is the way that we represent symbolic pointers. Whenever a
134symbolic pointer value is needed, a symbolic region is created to represent
135it.</p>
136
137<p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have
138type information associated with a symbolic region. For this case, a
139TypedViewRegion is created to layer the type information on top of the symbolic
140region. The reason we do not carry type information with the symbolic region is
141that the symbolic regions can have no type. To be consistent, we don't let them
142to carry type information.</p>
143
144<p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
145and represents a generic chunk of memory.</p>
146
147<p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it
148 gradually.</em></p>
149
150<p>Symbolic regions get their rvalue types through the following ways:</p>
151
152<ul>
153<li>Through the parameter or global variable that points to it, e.g.:
154<pre>
155void f(struct s* p) {
156 ...
157}
158</pre>
159
160<p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct
161s</tt>.</p></li>
162
163<li>Through explicit or implicit casts, e.g.:
164<pre>
165void f(void* p) {
166 struct s* q = (struct s*) p;
167 ...
168}
169</pre>
170</li>
171</ul>
172
173<p>We attach the type information to the symbolic region lazily. For the first
174case above, we create the <tt>TypedViewRegion</tt> only when the pointer is
175actually used to access the pointee memory object, that is when the element or
176field region is created. For the cast case, the <tt>TypedViewRegion</tt> is
177created when visiting the <tt>CastExpr</tt>.</p>
178
179<p>The reason for doing lazy typing is that symbolic regions are sometimes only
180used to do location comparison.</p>
181
182<h3>Pointer Casts</h3>
183
184<p>Pointer casts allow people to impose different 'views' onto a chunk of
185memory.</p>
186
187<p>Usually we have two kinds of casts. One kind of casts cast down with in the
188type hierarchy. It imposes more specific views onto more generic memory regions.
189The other kind of casts cast up with in the type hierarchy. It strips away more
190specific views on top of the more generic memory regions.</p>
191
192<p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on
193top of the original region. We simulate the up casts by striping away the top
194<tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the
195there is no <tt>TypedViewRegion</tt> to be stripped, we return the original
196region. If the underlying region is of the different type than the cast-to type,
197we flag an error state.</p>
198
199<p>For toll-free bridging casts, we return the original region.</p>
200
201<p>We can set up a partial order for pointer types, with the most general type
202<tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as
203its root node.</p>
204
205<p>Every <tt>MemRegion</tt> has a root position in the type tree. For example,
206the pointee region of <tt>void *p</tt> has its root position at the root node of
207the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int
208type' node.</p>
209
210<p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree.
211Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree
212removes a <Tt>TypedViewRegion</tt>.</p>
213
214<p>Do we want to allow moving up beyond the root position? This happens
215when:</p> <pre> int x; void *p = &amp;x; </pre>
216
217<p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to
218void* moves that region up to the 'void*' node. I propose to not allow such
Zhongxing Xu4a115512009-04-20 10:09:10 +0000219casts, and assign the region of <tt>x</tt> for <tt>p</tt>.</p>
220
221<p>Another non-ideal case is that people might cast to a non-generic pointer
222from another non-generic pointer instead of first casting it back to the generic
223pointer. Direct handling of this case would result in multiple layers of
224TypedViewRegions. This enforces an incorrect semantic view to the region,
225because we can only have one typed view on a region at a time. To avoid this
226inconsistency, before casting the region, we strip the TypedViewRegion, then do
227the cast. In summary, we only allow one layer of TypedViewRegion.</p>
Ted Kremenekd76e0a62009-04-03 01:38:55 +0000228
229<h3>Region Bindings</h3>
230
231<p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
232StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p>
233
234<p>When binding regions, we perform canonicalization on element regions and field
235regions. This is because we can have different views on the same region, some
236of which are essentially the same view with different sugar type names.</p>
237
238<p>To canonicalize a region, we get the canonical types for all TypedViewRegions
239along the way up to the root region, and make new TypedViewRegions with those
240canonical types.</p>
241
242<p>For Objective-C and C++, perhaps another canonicalization rule should be
243added: for FieldRegion, the least derived class that has the field is used as
244the type of the super region of the FieldRegion.</p>
245
246<p>All bindings and retrievings are done on the canonicalized regions.</p>
247
248<p>Canonicalization is transparent outside the region store manager, and more
249specifically, unaware outside the Bind() and Retrieve() method. We don't need to
250consider region canonicalization when doing pointer cast.</p>
251
252<h3>Constraint Manager</h3>
253
254<p>The constraint manager reasons about the abstract location of memory objects.
255We can have different views on a region, but none of these views changes the
256location of that object. Thus we should get the same abstract location for those
257regions.</p>
258
259</body>
260</html>