Blame - clang/docs/AnalyzerRegions.html - toolchain/llvm-project

blob: 35708d57c9701f189e622ff9956d602f38139823 [file] [log] [blame]

Ted Kremenek	5d71204	2009-04-03 01:38:55 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>Static Analyzer Design Document: Memory Regions</title>
				4	</head>
				5	<body>
				6
				7	<h1>Static Analyzer Design Document: Memory Regions</h1>
				8
				9	<h3>Authors</h3>
				10
				11	<p>Ted Kremenek, <tt>kremenek at apple</tt><br>
				12	Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p>
				13
				14	<h2 id="intro">Introduction</h2>
				15
				16	<p>The path-sensitive analysis engine in libAnalysis employs an extensible API
				17	for abstractly modeling the memory of an analyzed program. This API employs the
				18	concept of "memory regions" to abstractly model chunks of program memory such as
				19	program variables and dynamically allocated memory such as those returned from
				20	'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling
				21	subtyping relationships, field and array offsets into larger chunks of memory,
				22	and so on.</p>
				23
				24	<p>The region API consists of two components:</p>
				25
				26	<ul> <li>A taxonomy and representation of regions themselves within the analyzer
				27	engine. The primary definitions and interfaces are described in <tt><a
				28	href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>.
				29	At the root of the region hierarchy is the class <tt>MemRegion</tt> with
				30	specific subclasses refining the region concept for variables, heap allocated
				31	memory, and so forth.</li> <li>The modeling of binding of values to regions. For
				32	example, modeling the value stored to a local variable <tt>x</tt> consists of
				33	recording the binding between the region for <tt>x</tt> (which represents the
				34	raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This
				35	binding relationship is captured with the notion of "symbolic
				36	stores."</li> </ul>
				37
				38	<p>Symbolic stores, which can be thought of as representing the relation
				39	<tt>regions -> values</tt>, are implemented by subclasses of the
				40	<tt>StoreManager</tt> class (<tt><a
				41	href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A
				42	particular StoreManager implementation has complete flexibility concerning the
				43	following:
				44
				45	<ul>
				46	<li><em>How</em> to model the binding between regions and values</li>
				47	<li><em>What</em> bindings are recorded
				48	</ul>
				49
				50	<p>Together, both points allow different StoreManagers to tradeoff between
				51	different levels of analysis precision and scalability concerning the reasoning
				52	of program memory. Meanwhile, the core path-sensitive engine makes no
				53	assumptions about either points, and queries a StoreManager about the bindings
				54	to a memory region through a generic interface that all StoreManagers share. If
				55	a particular StoreManager cannot reason about the potential bindings of a given
				56	memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields
				57	of structures) then the StoreManager can simply return 'unknown' (represented by
				58	'<tt>UnknownVal</tt>') for a particular region-binding. This separation of
				59	concerns not only isolates the core analysis engine from the details of
				60	reasoning about program memory but also facilities the option of a client of the
				61	path-sensitive engine to easily swap in different StoreManager implementations
				62	that internally reason about program memory in very different ways.</pp>
				63
				64	<p>The rest of this document is divided into two parts. We first discuss region
				65	taxonomy and the semantics of regions. We then discuss the StoreManager
				66	interface, and details of how the currently available StoreManager classes
				67	implement region bindings.</p>
				68
				69	<h2 id="regions">Memory Regions and Region Taxonomy</h2>
				70
				71	<h3>Pointers</h3>
				72
				73	<p>Before talking about the memory regions, we would talk about the pointers
				74	since memory regions are essentially used to represent pointer values.</p>
				75
				76	<p>The pointer is a type of values. Pointer values have two semantic aspects.
				77	One is its physical value, which is an address or location. The other is the
				78	type of the memory object residing in the address.</p>
				79
				80	<p>Memory regions are designed to abstract these two properties of the pointer.
				81	The physical value of a pointer is represented by MemRegion pointers. The rvalue
				82	type of the region corresponds to the type of the pointee object.</p>
				83
				84	<p>One complication is that we could have different view regions on the same
				85	memory chunk. They represent the same memory location, but have different
				86	abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the
				87	abstract locations to get a unique abstract location for one physical
				88	location.</p>
				89
				90	<p>Furthermore, these different view regions may or may not represent memory
				91	objects of different types. Some different types are semantically the same,
				92	for example, 'struct s' and 'my_type' are the same type.</p>
				93
				94	<pre>
				95	struct s;
				96	typedef struct s my_type;
				97	</pre>
				98
				99	<p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p>
				100
				101	<pre>
				102	void *p;
				103	int q = (int) p;
				104	char r = (char) p;
				105	</pre
				106
				107	<p>Thus we need to canonicalize the MemRegion which is used in binding and
				108	retrieving.</p>
				109
Zhongxing Xu	a720c8b	2009-04-10 06:52:49 +0000	[diff] [blame]	110	<h3>Regions</h3>
				111	<p>Region is the entity used to model pointer values. A Region has the following
				112	properties:</p>
				113
				114	<ul>
				115	<li>Kind</li>
				116
				117	<li>ObjectType: the type of the object residing on the region.</li>
				118
				119	<li>LocationType: the type of the pointer value that the region corresponds to.
				120	Usually this is the pointer to the ObjectType. But sometimes we want to cache
				121	this type explicitly, for example, for a CodeTextRegion.</li>
				122
				123	<li>StartLocation</li>
				124
				125	<li>EndLocation</li>
				126	</ul>
				127
Ted Kremenek	5d71204	2009-04-03 01:38:55 +0000	[diff] [blame]	128	<h3>Symbolic Regions</h3>
				129
				130	<p>A symbolic region is a map of the concept of symbolic values into the domain
				131	of regions. It is the way that we represent symbolic pointers. Whenever a
				132	symbolic pointer value is needed, a symbolic region is created to represent
				133	it.</p>
				134
				135	<p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have
				136	type information associated with a symbolic region. For this case, a
				137	TypedViewRegion is created to layer the type information on top of the symbolic
				138	region. The reason we do not carry type information with the symbolic region is
				139	that the symbolic regions can have no type. To be consistent, we don't let them
				140	to carry type information.</p>
				141
				142	<p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
				143	and represents a generic chunk of memory.</p>
				144
				145	<p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it
				146	gradually.</em></p>
				147
				148	<p>Symbolic regions get their rvalue types through the following ways:</p>
				149
				150	<ul>
				151	<li>Through the parameter or global variable that points to it, e.g.:
				152	<pre>
				153	void f(struct s* p) {
				154	...
				155	}
				156	</pre>
				157
				158	<p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct
				159	s</tt>.</p></li>
				160
				161	<li>Through explicit or implicit casts, e.g.:
				162	<pre>
				163	void f(void* p) {
				164	struct s* q = (struct s*) p;
				165	...
				166	}
				167	</pre>
				168	</li>
				169	</ul>
				170
				171	<p>We attach the type information to the symbolic region lazily. For the first
				172	case above, we create the <tt>TypedViewRegion</tt> only when the pointer is
				173	actually used to access the pointee memory object, that is when the element or
				174	field region is created. For the cast case, the <tt>TypedViewRegion</tt> is
				175	created when visiting the <tt>CastExpr</tt>.</p>
				176
				177	<p>The reason for doing lazy typing is that symbolic regions are sometimes only
				178	used to do location comparison.</p>
				179
				180	<h3>Pointer Casts</h3>
				181
				182	<p>Pointer casts allow people to impose different 'views' onto a chunk of
				183	memory.</p>
				184
				185	<p>Usually we have two kinds of casts. One kind of casts cast down with in the
				186	type hierarchy. It imposes more specific views onto more generic memory regions.
				187	The other kind of casts cast up with in the type hierarchy. It strips away more
				188	specific views on top of the more generic memory regions.</p>
				189
				190	<p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on
				191	top of the original region. We simulate the up casts by striping away the top
				192	<tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the
				193	there is no <tt>TypedViewRegion</tt> to be stripped, we return the original
				194	region. If the underlying region is of the different type than the cast-to type,
				195	we flag an error state.</p>
				196
				197	<p>For toll-free bridging casts, we return the original region.</p>
				198
				199	<p>We can set up a partial order for pointer types, with the most general type
				200	<tt>void</tt> at the top. The partial order forms a tree with <tt>void</tt> as
				201	its root node.</p>
				202
				203	<p>Every <tt>MemRegion</tt> has a root position in the type tree. For example,
				204	the pointee region of <tt>void *p</tt> has its root position at the root node of
				205	the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int
				206	type' node.</p>
				207
				208	<p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree.
				209	Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree
				210	removes a <Tt>TypedViewRegion</tt>.</p>
				211
				212	<p>Do we want to allow moving up beyond the root position? This happens
				213	when:</p> <pre> int x; void *p = &x; </pre>
				214
				215	<p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to
				216	void* moves that region up to the 'void*' node. I propose to not allow such
Zhongxing Xu	2f8198f	2009-04-20 10:09:10 +0000	[diff] [blame]	217	casts, and assign the region of <tt>x</tt> for <tt>p</tt>.</p>
				218
				219	<p>Another non-ideal case is that people might cast to a non-generic pointer
				220	from another non-generic pointer instead of first casting it back to the generic
				221	pointer. Direct handling of this case would result in multiple layers of
				222	TypedViewRegions. This enforces an incorrect semantic view to the region,
				223	because we can only have one typed view on a region at a time. To avoid this
				224	inconsistency, before casting the region, we strip the TypedViewRegion, then do
				225	the cast. In summary, we only allow one layer of TypedViewRegion.</p>
Ted Kremenek	5d71204	2009-04-03 01:38:55 +0000	[diff] [blame]	226
				227	<h3>Region Bindings</h3>
				228
				229	<p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
				230	StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p>
				231
				232	<p>When binding regions, we perform canonicalization on element regions and field
				233	regions. This is because we can have different views on the same region, some
				234	of which are essentially the same view with different sugar type names.</p>
				235
				236	<p>To canonicalize a region, we get the canonical types for all TypedViewRegions
				237	along the way up to the root region, and make new TypedViewRegions with those
				238	canonical types.</p>
				239
				240	<p>For Objective-C and C++, perhaps another canonicalization rule should be
				241	added: for FieldRegion, the least derived class that has the field is used as
				242	the type of the super region of the FieldRegion.</p>
				243
				244	<p>All bindings and retrievings are done on the canonicalized regions.</p>
				245
				246	<p>Canonicalization is transparent outside the region store manager, and more
				247	specifically, unaware outside the Bind() and Retrieve() method. We don't need to
				248	consider region canonicalization when doing pointer cast.</p>
				249
				250	<h3>Constraint Manager</h3>
				251
				252	<p>The constraint manager reasons about the abstract location of memory objects.
				253	We can have different views on a region, but none of these views changes the
				254	location of that object. Thus we should get the same abstract location for those
				255	regions.</p>
				256
				257	</body>
				258	</html>