Parellel mark stack processing

Enabled parallel mark stack processing by using a thread pool.

Optimized object scanning by removing dependent loads for IsClass.

Performance:
Prime: ~10% speedup of partial GC.
Nakasi: ~50% speedup of partial GC.

Change-Id: I43256a068efc47cb52d93108458ea18d4e02fccc
diff --git a/src/gc/heap_bitmap.h b/src/gc/heap_bitmap.h
index 1610df8..666fcc7 100644
--- a/src/gc/heap_bitmap.h
+++ b/src/gc/heap_bitmap.h
@@ -38,9 +38,9 @@
         EXCLUSIVE_LOCKS_REQUIRED(Locks::heap_bitmap_lock_) {
       SpaceBitmap* bitmap = GetSpaceBitmap(obj);
       if (LIKELY(bitmap != NULL)) {
-        return bitmap->Clear(obj);
+        bitmap->Clear(obj);
       } else {
-        return large_objects_->Clear(obj);
+        large_objects_->Clear(obj);
       }
     }