Merge in a somewhat modified patch version of Jeremy Fitzhardinge's
translation chaining patch.

47-chained-bb

This implements basic-block chaining. Rather than always going through
the dispatch loop, a BB may jump directly to a successor BB if it is
present in the translation cache.

When the BB's code is first generated, the jumps to the successor BBs
are filled with undefined instructions. When the BB is inserted into
the translation cache, the undefined instructions are replaced with a
call to VG_(patch_me). When VG_(patch_me) is called, it looks up the
desired target address in the fast translation cache. If present, it
backpatches the call to patch_me with a jump to the translated target
BB. If the fast lookup fails, it falls back into the normal dispatch
loop.

When the parts of the translation cache are discarded, all translations
are unchained, so as to ensure we don't have direct jumps to code which
has been thrown away.

This optimisation only has effect on direct jumps; indirect jumps
(including returns) still go through the dispatch loop.  The -v stats
indicate a worst-case rate of about 16% of jumps having to go via the
slow mechanism.  This will be a combination of function returns and
genuine indirect jumps.

Certain parts of the dispatch loop's actions have to be moved into
each basic block; namely: updating the virtual EIP and keeping track
of the basic block counter.

At present, basic block chaining seems to improve performance by up to
25% with --skin=none.  Gains for skins adding more instrumentation
will be correspondingly smaller.

There is a command line option: --chain-bb=yes|no (defaults to yes).


git-svn-id: svn://svn.valgrind.org/valgrind/trunk@1336 a5019735-40e9-0310-863c-91ae7b9d1cf9
diff --git a/coregrind/vg_scheduler.c b/coregrind/vg_scheduler.c
index f11ed79..4e6561b 100644
--- a/coregrind/vg_scheduler.c
+++ b/coregrind/vg_scheduler.c
@@ -316,12 +316,17 @@
 static
 void create_translation_for ( ThreadId tid, Addr orig_addr )
 {
-   Addr trans_addr;
-   Int  orig_size, trans_size;
+   Addr   trans_addr;
+   Int    orig_size, trans_size;
+   UShort jumps[VG_MAX_JUMPS];
+   Int    i;
+
+   for(i = 0; i < VG_MAX_JUMPS; i++)
+      jumps[i] = (UShort)-1;
 
    /* Make a translation, into temporary storage. */
    VG_(translate)( &VG_(threads)[tid],
-                   orig_addr, &orig_size, &trans_addr, &trans_size );
+                   orig_addr, &orig_size, &trans_addr, &trans_size, jumps );
 
    /* Copy data at trans_addr into the translation cache. */
    /* Since the .orig_size and .trans_size fields are
@@ -329,7 +334,7 @@
    vg_assert(orig_size > 0 && orig_size < 65536);
    vg_assert(trans_size > 0 && trans_size < 65536);
 
-   VG_(add_to_trans_tab)( orig_addr, orig_size, trans_addr, trans_size );
+   VG_(add_to_trans_tab)( orig_addr, orig_size, trans_addr, trans_size, jumps );
 
    /* Free the intermediary -- was allocated by VG_(emit_code). */
    VG_(arena_free)( VG_AR_JITTER, (void*)trans_addr );
@@ -1579,7 +1584,7 @@
    VG_(printf)(
       "======vvvvvvvv====== LAST TRANSLATION ======vvvvvvvv======\n");
    VG_(translate)( &VG_(threads)[tid], 
-                   VG_(threads)[tid].m_eip, NULL, NULL, NULL );
+                   VG_(threads)[tid].m_eip, NULL, NULL, NULL, NULL );
    VG_(printf)("\n");
    VG_(printf)(
       "======^^^^^^^^====== LAST TRANSLATION ======^^^^^^^^======\n");