docs/AnalyzerRegions.txt - platform/external/clang - Gitiles

 Static Analyzer: 'Regions'
 --------------------------

 INTRODUCTION

   The path-sensitive analysis engine in libAnalysis employs an extensible API
   for abstractly modeling the memory of an analyzed program. This API employs
   the concept of "memory regions" to abstractly model chunks of program memory
   such as program variables and dynamically allocated memory such as those
   returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
   modeling subtyping relationships, field and array offsets into larger chunks
   of memory, and so on.

   The region API consists of two components. The first is the taxonomy and
   representation of regions themselves within the analyzer engine. The primary
   definitions and interfaces are described in
   'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
   hierarchy is the class 'MemRegion' with specific subclasses refining the
   region concept for variables, heap allocated memory, and so forth.

   The second component in the region API is the modeling of the binding of
   values to regions. For example, modeling the value stored to a local variable
   'x' consists of recording the binding between the region for 'x' (which
   represents the raw memory associated with 'x') and the value stored to 'x'.
   This binding relationship is captured with the notion of "symbolic stores."

   Symbolic stores, which can be thought of as representing the relation 'regions
   -> values', are implemented by subclasses of the StoreManager class (Store.h).
   A particular StoreManager implementation has complete flexibility concerning
   (a) *how* to model the binding between regions and values and (b) *what*
   bindings are recorded. Together, both points allow different StoreManagers to
   tradeoff between different levels of analysis precision and scalability
   concerning the reasoning of program memory. Meanwhile, the core path-sensitive
   engine makes no assumptions about (a) or (b), and queries a StoreManager about
   the bindings to a memory region through a generic interface that all
   StoreManagers share. If a particular StoreManager cannot reason about the
   potential bindings of a given memory region (e.g., 'BasicStoreManager' does
   not reason about fields of structures) then the StoreManager can simply return
   'unknown' (represented by 'UnknownVal') for a particular region-binding. This
   separation of concerns not only isolates the core analysis engine from the
   details of reasoning about program memory but also facilities the option of a
   client of the path-sensitive engine to easily swap in different StoreManager
   implementations that internally reason about program memory in very different
   ways.

   The rest of this document is divided into two parts. We first discuss region
   taxonomy and the semantics of regions. We then discuss the StoreManager
   interface, and details of how the currently available StoreManager classes
   implement region bindings.

 MEMORY REGIONS and REGION TAXONOMY

   POINTERS

   Before talking about the memory regions, we would talk about the pointers
   since memory regions are essentially used to represent pointer values.

   The pointer is a type of values. Pointer values have two semantic aspects. One
   is its physical value, which is an address or location. The other is the type
   of the memory object residing in the address.

   Memory regions are designed to abstract these two properties of the
   pointer. The physical value of a pointer is represented by MemRegion
   pointers. The rvalue type of the region corresponds to the type of the pointee
   object.

   One complication is that we could have different view regions on the same
   memory chunk. They represent the same memory location, but have different
   abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
   the abstract locations to get a unique abstract location for one physical
   location.

   Furthermore, these different view regions may or may not represent memory
   objects of different types. Some different types are semantically the same,
   for example, 'struct s' and 'my_type' are the same type.
     struct s;
     typedef struct s my_type;

   But 'char' and 'int' are not the same type in the code below:
     void *p;
     int *q = (int*) p;
     char *r = (char*) p;

   Thus we need to canonicalize the MemRegion which is used in binding and
   retrieving.

   SYMBOLIC REGIONS

   A symbolic region is a map of the concept of symbolic values into the domain
   of regions. It is the way that we represent symbolic pointers. Whenever a
   symbolic pointer value is needed, a symbolic region is created to represent
   it.

   A symbolic region has no type. It wraps a SymbolData. But sometimes we have
   type information associated with a symbolic region. For this case, a
   TypedViewRegion is created to layer the type information on top of the
   symbolic region. The reason we do not carry type information with the symbolic
   region is that the symbolic regions can have no type. To be consistent, we
   don't let them to carry type information.

   Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
   and represents a generic chunk of memory.

   NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
   gradually.

   Symbolic regions get their rvalue types through the following ways:
   * through the parameter or global variable that points to it, e.g.:

     void f(struct s* p) {
       ...
     }

     The symbolic region pointed to by 'p' has type 'struct s'.

   * through explicit or implicit casts, e.g.:
     void f(void* p) {
       struct s* q = (struct s*) p;
       ...
     }

   We attach the type information to the symbolic region lazily. For the first
   case above, we create the TypedViewRegion only when the pointer is actually
   used to access the pointee memory object, that is when the element or field
   region is created. For the cast case, the TypedViewRegion is created when
   visiting the CastExpr.

   The reason for doing lazy typing is that symbolic regions are sometimes only
   used to do location comparison.

 Pointer Casts

   Pointer casts allow people to impose different 'views' onto a chunk of memory.

   Usually we have two kinds of casts. One kind of casts cast down with in the
   type hierarchy. It imposes more specific views onto more generic memory
   regions. The other kind of casts cast up with in the type hierarchy. It strips
   away more specific views on top of the more generic memory regions.

   We simulate the down casts by layering another TypedViewRegion on top of the
   original region. We simulate the up casts by striping away the top
   TypedViewRegion. Down casts is usually simple. For up casts, if the there is
   no TypedViewRegion to be stripped, we return the original region. If the
   underlying region is of the different type than the cast-to type, we flag an
   error state.

   For toll-free bridging casts, we return the original region.

   We can set up a partial order for pointer types, with the most general type
   'void*' at the top. The partial order forms a tree with 'void*' as its root
   node.

   Every MemRegion has a root position in the type tree. For example, the pointee
   region of 'void *p' has its root position at the root node of the tree.
   VarRegion of 'int x' has its root position at the 'int type' node.

   TypedViewRegion is used to move the region down or up in the tree. Moving
   down in the tree adds a TypedViewRegion. Moving up in the tree removes a
   TypedViewRegion.

   Do we want to allow moving up beyond the root position? This happens when:
     int x;
     void *p = &x;

   The region of 'x' has its root position at 'int*' node. the cast to void*
   moves that region up to the 'void*' node. I propose to not allow such casts,
   and assign the region of 'x' for 'p'.

 Region Bindings

   The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
   StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.

   When binding regions, we perform canonicalization on element regions and field
   regions. This is because we can have different views on the same region, some
   of which are essentially the same view with different sugar type names.

   To canonicalize a region, we get the canonical types for all TypedViewRegions
   along the way up to the root region, and make new TypedViewRegions with those
   canonical types.

   For ObjC and C++, perhaps another canonicalization rule should be added: for
   FieldRegion, the least derived class that has the field is used as the type
   of the super region of the FieldRegion.

   All bindings and retrievings are done on the canonicalized regions.

   Canonicalization is transparent outside the region store manager, and more
   specifically, unaware outside the Bind() and Retrieve() method. We don't need
   to consider region canonicalization when doing pointer cast.

 Constraint Manager

   The constraint manager reasons about the abstract location of memory
   objects. We can have different views on a region, but none of these views
   changes the location of that object. Thus we should get the same abstract
   location for those regions.
	Static Analyzer: 'Regions'
	--------------------------

	INTRODUCTION

	The path-sensitive analysis engine in libAnalysis employs an extensible API
	for abstractly modeling the memory of an analyzed program. This API employs
	the concept of "memory regions" to abstractly model chunks of program memory
	such as program variables and dynamically allocated memory such as those
	returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
	modeling subtyping relationships, field and array offsets into larger chunks
	of memory, and so on.

	The region API consists of two components. The first is the taxonomy and
	representation of regions themselves within the analyzer engine. The primary
	definitions and interfaces are described in
	'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
	hierarchy is the class 'MemRegion' with specific subclasses refining the
	region concept for variables, heap allocated memory, and so forth.

	The second component in the region API is the modeling of the binding of
	values to regions. For example, modeling the value stored to a local variable
	'x' consists of recording the binding between the region for 'x' (which
	represents the raw memory associated with 'x') and the value stored to 'x'.
	This binding relationship is captured with the notion of "symbolic stores."

	Symbolic stores, which can be thought of as representing the relation 'regions
	-> values', are implemented by subclasses of the StoreManager class (Store.h).
	A particular StoreManager implementation has complete flexibility concerning
	(a) how to model the binding between regions and values and (b) what
	bindings are recorded. Together, both points allow different StoreManagers to
	tradeoff between different levels of analysis precision and scalability
	concerning the reasoning of program memory. Meanwhile, the core path-sensitive
	engine makes no assumptions about (a) or (b), and queries a StoreManager about
	the bindings to a memory region through a generic interface that all
	StoreManagers share. If a particular StoreManager cannot reason about the
	potential bindings of a given memory region (e.g., 'BasicStoreManager' does
	not reason about fields of structures) then the StoreManager can simply return
	'unknown' (represented by 'UnknownVal') for a particular region-binding. This
	separation of concerns not only isolates the core analysis engine from the
	details of reasoning about program memory but also facilities the option of a
	client of the path-sensitive engine to easily swap in different StoreManager
	implementations that internally reason about program memory in very different
	ways.

	The rest of this document is divided into two parts. We first discuss region
	taxonomy and the semantics of regions. We then discuss the StoreManager
	interface, and details of how the currently available StoreManager classes
	implement region bindings.

	MEMORY REGIONS and REGION TAXONOMY

	POINTERS

	Before talking about the memory regions, we would talk about the pointers
	since memory regions are essentially used to represent pointer values.

	The pointer is a type of values. Pointer values have two semantic aspects. One
	is its physical value, which is an address or location. The other is the type
	of the memory object residing in the address.

	Memory regions are designed to abstract these two properties of the
	pointer. The physical value of a pointer is represented by MemRegion
	pointers. The rvalue type of the region corresponds to the type of the pointee
	object.

	One complication is that we could have different view regions on the same
	memory chunk. They represent the same memory location, but have different
	abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
	the abstract locations to get a unique abstract location for one physical
	location.

	Furthermore, these different view regions may or may not represent memory
	objects of different types. Some different types are semantically the same,
	for example, 'struct s' and 'my_type' are the same type.
	struct s;
	typedef struct s my_type;

	But 'char' and 'int' are not the same type in the code below:
	void *p;
	int q = (int) p;
	char r = (char) p;

	Thus we need to canonicalize the MemRegion which is used in binding and
	retrieving.

	SYMBOLIC REGIONS

	A symbolic region is a map of the concept of symbolic values into the domain
	of regions. It is the way that we represent symbolic pointers. Whenever a
	symbolic pointer value is needed, a symbolic region is created to represent
	it.

	A symbolic region has no type. It wraps a SymbolData. But sometimes we have
	type information associated with a symbolic region. For this case, a
	TypedViewRegion is created to layer the type information on top of the
	symbolic region. The reason we do not carry type information with the symbolic
	region is that the symbolic regions can have no type. To be consistent, we
	don't let them to carry type information.

	Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
	and represents a generic chunk of memory.

	NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
	gradually.

	Symbolic regions get their rvalue types through the following ways:
	* through the parameter or global variable that points to it, e.g.:

	void f(struct s* p) {
	...
	}

	The symbolic region pointed to by 'p' has type 'struct s'.

	* through explicit or implicit casts, e.g.:
	void f(void* p) {
	struct s* q = (struct s*) p;
	...
	}

	We attach the type information to the symbolic region lazily. For the first
	case above, we create the TypedViewRegion only when the pointer is actually
	used to access the pointee memory object, that is when the element or field
	region is created. For the cast case, the TypedViewRegion is created when
	visiting the CastExpr.

	The reason for doing lazy typing is that symbolic regions are sometimes only
	used to do location comparison.

	Pointer Casts

	Pointer casts allow people to impose different 'views' onto a chunk of memory.

	Usually we have two kinds of casts. One kind of casts cast down with in the
	type hierarchy. It imposes more specific views onto more generic memory
	regions. The other kind of casts cast up with in the type hierarchy. It strips
	away more specific views on top of the more generic memory regions.

	We simulate the down casts by layering another TypedViewRegion on top of the
	original region. We simulate the up casts by striping away the top
	TypedViewRegion. Down casts is usually simple. For up casts, if the there is
	no TypedViewRegion to be stripped, we return the original region. If the
	underlying region is of the different type than the cast-to type, we flag an
	error state.

	For toll-free bridging casts, we return the original region.

	We can set up a partial order for pointer types, with the most general type
	'void' at the top. The partial order forms a tree with 'void' as its root
	node.

	Every MemRegion has a root position in the type tree. For example, the pointee
	region of 'void *p' has its root position at the root node of the tree.
	VarRegion of 'int x' has its root position at the 'int type' node.

	TypedViewRegion is used to move the region down or up in the tree. Moving
	down in the tree adds a TypedViewRegion. Moving up in the tree removes a
	TypedViewRegion.

	Do we want to allow moving up beyond the root position? This happens when:
	int x;
	void *p = &x;

	The region of 'x' has its root position at 'int' node. the cast to void
	moves that region up to the 'void*' node. I propose to not allow such casts,
	and assign the region of 'x' for 'p'.

	Region Bindings

	The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
	StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.

	When binding regions, we perform canonicalization on element regions and field
	regions. This is because we can have different views on the same region, some
	of which are essentially the same view with different sugar type names.

	To canonicalize a region, we get the canonical types for all TypedViewRegions
	along the way up to the root region, and make new TypedViewRegions with those
	canonical types.

	For ObjC and C++, perhaps another canonicalization rule should be added: for
	FieldRegion, the least derived class that has the field is used as the type
	of the super region of the FieldRegion.

	All bindings and retrievings are done on the canonicalized regions.

	Canonicalization is transparent outside the region store manager, and more
	specifically, unaware outside the Bind() and Retrieve() method. We don't need
	to consider region canonicalization when doing pointer cast.

	Constraint Manager

	The constraint manager reasons about the abstract location of memory
	objects. We can have different views on a region, but none of these views
	changes the location of that object. Thus we should get the same abstract
	location for those regions.