Blame - docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt - fp2-dev/platform/external/llvm

blob: a7457846395b5e370924f993935e9bf959b06cfc [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	Wed Jun 25 15:13:51 CDT 2003
				2
				3	First-level instrumentation
				4	---------------------------
				5
				6	We use opt to do Bytecode-to-bytecode instrumentation. Look at
				7	back-edges and insert llvm_first_trigger() function call which takes
				8	no arguments and no return value. This instrumentation is designed to
				9	be easy to remove, for instance by writing a NOP over the function
				10	call instruction.
				11
				12	Keep count of every call to llvm_first_trigger(), and maintain
				13	counters in a map indexed by return address. If the trigger count
				14	exceeds a threshold, we identify a hot loop and perform second-level
				15	instrumentation on the hot loop region (the instructions between the
				16	target of the back-edge and the branch that causes the back-edge). We
				17	do not move code across basic-block boundaries.
				18
				19
				20	Second-level instrumentation
				21	---------------------------
				22
				23	We remove the first-level instrumentation by overwriting the CALL to
				24	llvm_first_trigger() with a NOP.
				25
				26	The reoptimizer maintains a map between machine-code basic blocks and
				27	LLVM BasicBlock*s. We only keep track of paths that start at the
				28	first machine-code basic block of the hot loop region.
				29
				30	How do we keep track of which edges to instrument, and which edges are
				31	exits from the hot region? 3 step process.
				32
				33	1) Do a DFS from the first machine-code basic block of the hot loop
				34	region and mark reachable edges.
				35
				36	2) Do a DFS from the last machine-code basic block of the hot loop
				37	region IGNORING back edges, and mark the edges which are reachable in
				38	1) and also in 2) (i.e., must be reachable from both the start BB and
				39	the end BB of the hot region).
				40
				41	3) Mark BBs which end in edges that exit the hot region; we need to
				42	instrument these differently.
				43
				44	Assume that there is 1 free register. On SPARC we use %g1, which LLC
				45	has agreed not to use. Shift a 1 into it at the beginning. At every
				46	edge which corresponds to a conditional branch, we shift 0 for not
				47	taken and 1 for taken into a register. This uniquely numbers the paths
				48	through the hot region. Silently fail if we need more than 64 bits.
				49
				50	At the end BB we call countPath and increment the counter based on %g1
				51	and the return address of the countPath call. We keep track of the
				52	number of iterations and the number of paths. We only run this
				53	version 30 or 40 times.
				54
				55	Find the BBs that total 90% or more of execution, and aggregate them
				56	together to form our trace. But we do not allow more than 5 paths; if
				57	we have more than 5 we take the ones that are executed the most. We
				58	verify our assumption that we picked a hot back-edge in first-level
				59	instrumentation, by making sure that the number of times we took an
				60	exit edge from the hot trace is less than 10% of the number of
				61	iterations.
				62
				63	LLC has been taught to recognize llvm_first_trigger() calls and NOT
				64	generate saves and restores of caller-saved registers around these
				65	calls.
				66
				67
				68	Phase behavior
				69	--------------
				70
				71	We turn off llvm_first_trigger() calls with NOPs, but this would hide
				72	phase behavior from us (when some funcs/traces stop being hot and
				73	others become hot.)
				74
				75	We have a SIGALRM timer that counts time for us. Every time we get a
				76	SIGALRM we look at our priority queue of locations where we have
				77	removed llvm_first_trigger() calls. Each location is inserted along
				78	with a time when we will next turn instrumentation back on for that
				79	call site. If the time has arrived for a particular call site, we pop
				80	that off the prio. queue and turn instrumentation back on for that
				81	call site.
				82
				83
				84	Generating traces
				85	-----------------
				86
				87	When we finally generate an optimized trace we first copy the code
				88	into the trace cache. This leaves us with 3 copies of the code: the
				89	original code, the instrumented code, and the optimized trace. The
				90	optimized trace does not have instrumentation. The original code and
				91	the instrumented code are modified to have a branch to the trace
				92	cache, where the optimized traces are kept.
				93
				94	We copy the code from the original to the instrumentation version
				95	by tracing the LLVM-to-Machine code basic block map and then copying
				96	each machine code basic block we think is in the hot region into the
				97	trace cache. Then we instrument that code. The process is similar for
				98	generating the final optimized trace; we copy the same basic blocks
				99	because we might need to put in fixup code for exit BBs.
				100
				101	LLVM basic blocks are not typically used in the Reoptimizer except
				102	for the mapping information.
				103
				104	We are restricted to using single instructions to branch between the
				105	original code, trace, and instrumented code. So we have to keep the
				106	code copies in memory near the original code (they can't be far enough
				107	away that a single pc-relative branch would not work.) Malloc() or
				108	data region space is too far away. this impacts the design of the
				109	trace cache.
				110
				111	We use a dummy function that is full of a bunch of for loops which we
				112	overwrite with trace-cache code. The trace manager keeps track of
				113	whether or not we have enough space in the trace cache, etc.
				114
				115	The trace insertion routine takes an original start address, a vector
				116	of machine instructions representing the trace, index of branches and
				117	their corresponding absolute targets, and index of calls and their
				118	corresponding absolute targets.
				119
				120	The trace insertion routine is responsible for inserting branches from
				121	the beginning of the original code to the beginning of the optimized
				122	trace. This is because at some point the trace cache may run out of
				123	space and it may have to evict a trace, at which point the branch to
				124	the trace would also have to be removed. It uses a round-robin
				125	replacement policy; we have found that this is almost as good as LRU
				126	and better than random (especially because of problems fitting the new
				127	trace in.)
				128
				129	We cannot deal with discontiguous trace cache areas. The trace cache
				130	is supposed to be cache-line-aligned, but it is not page-aligned.
				131
				132	We generate instrumentation traces and optimized traces into separate
				133	trace caches. We keep the instrumented code around because you don't
				134	want to delete a trace when you still might have to return to it
				135	(i.e., return from a llvm_first_trigger() or countPath() call.)
				136
				137