ART: JNI thread state transition optimization

This patch improves the JNI performance by removing the explicit acquiring and
releasing the mutator lock when a thread state transits between suspended and
runnable states.

The functions responsible for changing the state were found to be the costliest
part of the JNI. Originally, a thread needs to acquire a shared mutator lock by
a CAS instruction when entering the runnable state and also needs to release
the lock by a CAS when entering the native state from runnable. This patch
removes these CAS operations when a thread state transits between suspended and
runnable. A thread in the runnable state is considered to have shared ownership
of the mutator lock and therefore transitions in and out of the runnable state
have associated implication on the mutator lock ownership. Meanwhile, a barrier
is added to control suspending all threads from running.

JNI transition overhead was reduced by 25% on IA platform and by 17% on ARM
platform by this patch, while it has little impact on GC pause time (measured
with "suspend all histogram").

Change-Id: Icee95d8ffff1bbfc95309a41cc48836536fec689
Signed-off-by: Yu, Li <yu.l.li@intel.com>
Signed-off-by: Haitao, Feng <haitao.feng@intel.com>
Signed-off-by: Lei, Li <lei.l.li@intel.com>
diff --git a/runtime/entrypoints_order_test.cc b/runtime/entrypoints_order_test.cc
index 0a5ebfa..656944a 100644
--- a/runtime/entrypoints_order_test.cc
+++ b/runtime/entrypoints_order_test.cc
@@ -116,7 +116,7 @@
     EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, last_no_thread_suspension_cause, checkpoint_functions,
                         sizeof(void*));
     EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, checkpoint_functions, interpreter_entrypoints,
-                        sizeof(void*) * 3);
+                        sizeof(void*) * 6);
 
     // Skip across the entrypoints structures.