Use CLREX in ARM/ARM64 CAS intrinsic Baker read barrier slow paths.

Follow clang's implementation, which uses CLREX in
compare-and-exchange operations on the failure path, i.e.
when the value read by the LDREX (ARM) or LDXR (ARM64)
instruction is not the expected value, in order to release
the monitor.  The previous implementation was perfectly
correct, but this one may improve performance on some
micro-architectures.

This change only affects the
art::arm::ReadBarrierMarkAndUpdateFieldSlowPathARM and
art::arm64::ReadBarrierMarkAndUpdateFieldSlowPathARM64 slow
paths.

Test: make test-art-target-run-test-004-UnsafeTest
Bug: 29516905
Bug: 12687968
Change-Id: I99edd1ae6489dcec4a0089bfef52736114c6cd48
diff --git a/compiler/optimizing/code_generator_arm64.cc b/compiler/optimizing/code_generator_arm64.cc
index 60d7faf..642d883 100644
--- a/compiler/optimizing/code_generator_arm64.cc
+++ b/compiler/optimizing/code_generator_arm64.cc
@@ -794,13 +794,16 @@
     //   tmp_value = [tmp_ptr] - expected;
     // } while (tmp_value == 0 && failure([tmp_ptr] <- r_new_value));
 
-    vixl::aarch64::Label loop_head, exit_loop;
+    vixl::aarch64::Label loop_head, comparison_failed, exit_loop;
     __ Bind(&loop_head);
     __ Ldxr(tmp_value, MemOperand(tmp_ptr));
     __ Cmp(tmp_value, expected);
-    __ B(&exit_loop, ne);
+    __ B(&comparison_failed, ne);
     __ Stxr(tmp_value, value, MemOperand(tmp_ptr));
     __ Cbnz(tmp_value, &loop_head);
+    __ B(&exit_loop);
+    __ Bind(&comparison_failed);
+    __ Clrex();
     __ Bind(&exit_loop);
 
     if (kPoisonHeapReferences) {