Blame - docs/AnalyzerRegions.txt - platform/external/clang

blob: 62f23e4618c18361b50c825c86b741ef2138f7d1 [file] [log] [blame]

Ted Kremenek	9d9963e	2009-03-26 16:19:54 +0000	[diff] [blame]	1	Static Analyzer: 'Regions'
				2	--------------------------
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	3
Ted Kremenek	9d9963e	2009-03-26 16:19:54 +0000	[diff] [blame]	4	INTRODUCTION
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	5
Ted Kremenek	9d9963e	2009-03-26 16:19:54 +0000	[diff] [blame]	6	The path-sensitive analysis engine in libAnalysis employs an extensible API
				7	for abstractly modeling the memory of an analyzed program. This API employs
				8	the concept of "memory regions" to abstractly model chunks of program memory
				9	such as program variables and dynamically allocated memory such as those
				10	returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
				11	modeling subtyping relationships, field and array offsets into larger chunks
				12	of memory, and so on.
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	13
Ted Kremenek	9d9963e	2009-03-26 16:19:54 +0000	[diff] [blame]	14	The region API consists of two components. The first is the taxonomy and
				15	representation of regions themselves within the analyzer engine. The primary
				16	definitions and interfaces are described in
				17	'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
				18	hierarchy is the class 'MemRegion' with specific subclasses refining the
				19	region concept for variables, heap allocated memory, and so forth.
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	20
Ted Kremenek	9d9963e	2009-03-26 16:19:54 +0000	[diff] [blame]	21	The second component in the region API is the modeling of the binding of
				22	values to regions. For example, modeling the value stored to a local variable
				23	'x' consists of recording the binding between the region for 'x' (which
				24	represents the raw memory associated with 'x') and the value stored to 'x'.
				25	This binding relationship is captured with the notion of "symbolic stores."
				26
				27	Symbolic stores, which can be thought of as representing the relation 'regions
				28	-> values', are implemented by subclasses of the StoreManager class (Store.h).
				29	A particular StoreManager implementation has complete flexibility concerning
				30	(a) how to model the binding between regions and values and (b) what
				31	bindings are recorded. Together, both points allow different StoreManagers to
				32	tradeoff between different levels of analysis precision and scalability
				33	concerning the reasoning of program memory. Meanwhile, the core path-sensitive
				34	engine makes no assumptions about (a) or (b), and queries a StoreManager about
				35	the bindings to a memory region through a generic interface that all
				36	StoreManagers share. If a particular StoreManager cannot reason about the
				37	potential bindings of a given memory region (e.g., 'BasicStoreManager' does
				38	not reason about fields of structures) then the StoreManager can simply return
				39	'unknown' (represented by 'UnknownVal') for a particular region-binding. This
				40	separation of concerns not only isolates the core analysis engine from the
				41	details of reasoning about program memory but also facilities the option of a
				42	client of the path-sensitive engine to easily swap in different StoreManager
				43	implementations that internally reason about program memory in very different
				44	ways.
				45
				46	The rest of this document is divided into two parts. We first discuss region
				47	taxonomy and the semantics of regions. We then discuss the StoreManager
				48	interface, and details of how the currently available StoreManager classes
				49	implement region bindings.
				50
				51	MEMORY REGIONS and REGION TAXONOMY
				52
Zhongxing Xu	59eb5f9	2009-04-01 05:05:22 +0000	[diff] [blame]	53	POINTERS
				54
				55	Before talking about the memory regions, we would talk about the pointers
				56	since memory regions are essentially used to represent pointer values.
				57
				58	The pointer is a type of values. Pointer values have two semantic aspects. One
				59	is its physical value, which is an address or location. The other is the type
				60	of the memory object residing in the address.
				61
				62	Memory regions are designed to abstract these two properties of the
				63	pointer. The physical value of a pointer is represented by MemRegion
				64	pointers. The rvalue type of the region corresponds to the type of the pointee
				65	object.
				66
				67	One complication is that we could have different view regions on the same
				68	memory chunk. They represent the same memory location, but have different
				69	abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
				70	the abstract locations to get a unique abstract location for one physical
				71	location.
				72
				73	Furthermore, these different view regions may or may not represent memory
				74	objects of different types. Some different types are semantically the same,
				75	for example, 'struct s' and 'my_type' are the same type.
				76	struct s;
				77	typedef struct s my_type;
				78
				79	But 'char' and 'int' are not the same type in the code below:
				80	void *p;
				81	int q = (int) p;
				82	char r = (char) p;
				83
				84	Thus we need to canonicalize the MemRegion which is used in binding and
				85	retrieving.
				86
Ted Kremenek	9d9963e	2009-03-26 16:19:54 +0000	[diff] [blame]	87	SYMBOLIC REGIONS
				88
				89	A symbolic region is a map of the concept of symbolic values into the domain
				90	of regions. It is the way that we represent symbolic pointers. Whenever a
				91	symbolic pointer value is needed, a symbolic region is created to represent
				92	it.
				93
				94	A symbolic region has no type. It wraps a SymbolData. But sometimes we have
				95	type information associated with a symbolic region. For this case, a
				96	TypedViewRegion is created to layer the type information on top of the
				97	symbolic region. The reason we do not carry type information with the symbolic
				98	region is that the symbolic regions can have no type. To be consistent, we
				99	don't let them to carry type information.
				100
				101	Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
				102	and represents a generic chunk of memory.
				103
				104	NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
				105	gradually.
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	106
Zhongxing Xu	113cc14	2009-04-01 03:23:38 +0000	[diff] [blame]	107	Symbolic regions get their rvalue types through the following ways:
				108	* through the parameter or global variable that points to it, e.g.:
				109
				110	void f(struct s* p) {
				111	...
				112	}
				113
				114	The symbolic region pointed to by 'p' has type 'struct s'.
				115
				116	* through explicit or implicit casts, e.g.:
				117	void f(void* p) {
				118	struct s* q = (struct s*) p;
				119	...
				120	}
				121
				122	We attach the type information to the symbolic region lazily. For the first
				123	case above, we create the TypedViewRegion only when the pointer is actually
				124	used to access the pointee memory object, that is when the element or field
				125	region is created. For the cast case, the TypedViewRegion is created when
				126	visiting the CastExpr.
				127
				128	The reason for doing lazy typing is that symbolic regions are sometimes only
				129	used to do location comparison.
				130
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	131	Pointer Casts
				132
Zhongxing Xu	7fe3e05	2009-04-01 05:26:39 +0000	[diff] [blame]	133	Pointer casts allow people to impose different 'views' onto a chunk of memory.
				134
				135	Usually we have two kinds of casts. One kind of casts cast down with in the
				136	type hierarchy. It imposes more specific views onto more generic memory
				137	regions. The other kind of casts cast up with in the type hierarchy. It strips
				138	away more specific views on top of the more generic memory regions.
				139
				140	We simulate the down casts by layering another TypedViewRegion on top of the
				141	original region. We simulate the up casts by striping away the top
				142	TypedViewRegion. Down casts is usually simple. For up casts, if the there is
				143	no TypedViewRegion to be stripped, we return the original region. If the
				144	underlying region is of the different type than the cast-to type, we flag an
				145	error state.
				146
				147	For toll-free bridging casts, we return the original region.
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	148
Zhongxing Xu	a564789	2009-04-01 06:01:08 +0000	[diff] [blame]	149	We can set up a lattice for pointer types, with the most general type 'void*'
				150	at the top. The lattice enforces a partial order among types.
				151
				152	Every MemRegion has a root position in the type lattice. For example, the
				153	pointee region of 'void *p' has its root position at the top of the lattice.
				154	VarRegion of 'int x' has its root position at the 'int type' node.
				155
				156	TypedViewRegion is used to move the region down or up in the lattice. Moving
				157	down in the lattice adds a TypedViewRegion. Moving up in the lattice removes a
				158	TypedViewRegion.
				159
				160	Do we want to allow moving up beyond the root position? This happens when:
				161	int x;
				162	void *p = &x;
				163
				164	The region of 'x' has its root position at 'int' node. the cast to void
				165	moves that region up to the 'void*' node. I propose to not allow such casts,
				166	and assign the region of 'x' for 'p'.
				167
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	168	Region Bindings
				169
Zhongxing Xu	7fe3e05	2009-04-01 05:26:39 +0000	[diff] [blame]	170	The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
				171	StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	172
Zhongxing Xu	7fe3e05	2009-04-01 05:26:39 +0000	[diff] [blame]	173	When binding regions, we perform canonicalization on element regions and field
				174	regions. This is because we can have different views on the same region, some
				175	of which are essentially the same view with different sugar type names.
Zhongxing Xu	7fddc33	2009-03-26 08:23:58 +0000	[diff] [blame]	176
Zhongxing Xu	7fe3e05	2009-04-01 05:26:39 +0000	[diff] [blame]	177	To canonicalize a region, we get the canonical types for all TypedViewRegions
				178	along the way up to the root region, and make new TypedViewRegions with those
				179	canonical types.
Zhongxing Xu	a564789	2009-04-01 06:01:08 +0000	[diff] [blame]	180
				181	For ObjC and C++, perhaps another canonicalization rule should be added: for
				182	FieldRegion, the least derived class that has the field is used as the type
				183	of the super region of the FieldRegion.
Zhongxing Xu	7fe3e05	2009-04-01 05:26:39 +0000	[diff] [blame]	184
				185	All bindings and retrievings are done on the canonicalized regions.
				186
				187	Canonicalization is transparent outside the region store manager, and more
				188	specifically, unaware outside the Bind() and Retrieve() method. We don't need
				189	to consider region canonicalization when doing pointer cast.
Zhongxing Xu	a564789	2009-04-01 06:01:08 +0000	[diff] [blame]	190
				191	Constraint Manager
				192
				193	The constraint manager reasons about the abstract location of memory
				194	objects. We can have different views on a region, but none of these views
				195	changes the location of that object. Thus we should get the same abstract
				196	location for those regions.