Performance improvements by removing a DMB and inlining.

Correct the version of CAS used by Mutex::Lock to be acquire and not release.
Don't do a memory barrier in thread transitions when there is already a
barrier associated with the mutator lock.
Force inlining of the hot thread and shared lock code, heavily used by down
calls and JNI.
Force inlining of mirror routines that are used by runtime support and hot.

Performance was measured and improved using perf and maps.

Change-Id: I012580e337143236d8b6d06c1e270183ae51083c
diff --git a/src/runtime_support.h b/src/runtime_support.h
index a504237..09ca0aa 100644
--- a/src/runtime_support.h
+++ b/src/runtime_support.h
@@ -25,6 +25,7 @@
 #include "jni_internal.h"
 #include "mirror/abstract_method.h"
 #include "mirror/array.h"
+#include "mirror/class-inl.h"
 #include "mirror/throwable.h"
 #include "object_utils.h"
 #include "thread.h"