More parallel GC, rewritten parallel mark stack processing.

Card scanning may now be done in parallel. This speeds up sticky and
reduces pause times for all GC types.

Speedup on my mako (ritz perf):
Average pause time for sticky GC (~250 samples):
Without parallel cards scanning enabled: 2.524904215ms
Parallel card scanning (num_gc_threads_): 1.552123552ms
Throughput (~250 samples):
Sticky GC throughput with parallel card scanning: 69MB/s
Sticky GC throughput without parallel card scanning: 51MB/s

Rewrote the mark stack processing to be LIFO and use a prefetch queue
like the non parallel version.

Cleaned up some of the logcat printing for the activity manager
process state listening.

Added unlikely hints to object scanning since arrays and classes are
scanned much less often than normal objects.

Fixed a bug where the number of GC threads was clamped to 1 due to a
bool instead of a size_t.

Fixed a race condition when we added references to the reference
queues. Sharded the reference queue lock into one lock for each reference
type (weak, soft, phatom, finalizer).

Changed timing splits to be different for processing gray objects with
and without mutators paused since sticky GC does both.

Mask out the class bit when visiting fields as an optimization, this is
valid since classes are held live by the class linker.

Partially completed: Parallel recursive mark + finger.

Bug: 10245302
Bug: 9969166
Bug: 9986532
Bug: 9961698

Change-Id: I142d09718c4609b7c2387cb28f517a6983c73288
diff --git a/runtime/thread_pool.cc b/runtime/thread_pool.cc
index 067ef2d..f7fdcfb 100644
--- a/runtime/thread_pool.cc
+++ b/runtime/thread_pool.cc
@@ -23,6 +23,8 @@
 
 namespace art {
 
+static const bool kMeasureWaitTime = false;
+
 ThreadPoolWorker::ThreadPoolWorker(ThreadPool* thread_pool, const std::string& name,
                                    size_t stack_size)
     : thread_pool_(thread_pool),
@@ -64,7 +66,7 @@
   MutexLock mu(self, task_queue_lock_);
   tasks_.push_back(task);
   // If we have any waiters, signal one.
-  if (waiting_count_ != 0) {
+  if (started_ && waiting_count_ != 0) {
     task_queue_condition_.Signal(self);
   }
 }
@@ -129,11 +131,13 @@
       // We may be done, lets broadcast to the completion condition.
       completion_condition_.Broadcast(self);
     }
-    const uint64_t wait_start = NanoTime();
+    const uint64_t wait_start = kMeasureWaitTime ? NanoTime() : 0;
     task_queue_condition_.Wait(self);
-    const uint64_t wait_end = NanoTime();
-    total_wait_time_ += wait_end - std::max(wait_start, start_time_);
-    waiting_count_--;
+    if (kMeasureWaitTime) {
+      const uint64_t wait_end = NanoTime();
+      total_wait_time_ += wait_end - std::max(wait_start, start_time_);
+    }
+    --waiting_count_;
   }
 
   // We are shutting down, return NULL to tell the worker thread to stop looping.