Thread local bump pointer allocator.

Added a thread local allocator to the heap, each thread has three
pointers which specify the thread local buffer: start, cur, and
end. When the remaining space in the thread local buffer isn't large
enough for the allocation, the allocator allocates a new thread
local buffer using the bump pointer allocator.

The bump pointer space had to be modified to accomodate thread
local buffers. These buffers are called "blocks", where a block
is a buffer which contains a set of adjacent objects. Blocks
aren't necessarily full and may have wasted memory towards the
end. Blocks have an 8 byte header which specifies their size and is
required for traversing bump pointer spaces.

Memory usage is in between full bump pointer and ROSAlloc since
madvised memory limits wasted ram to an average of 1/2 page per
block.

Added a runtime option -XX:UseTLAB which specifies whether or
not to use the thread local allocator. Its a NOP if the garbage
collector is not the semispace collector.

TODO: Smarter block accounting to prevent us reading objects until
we either hit the end of the block or GetClass() == null which
signifies that the block isn't 100% full. This would provide a
slight speedup to BumpPointerSpace::Walk.

Timings: -XX:HeapMinFree=4m -XX:HeapMaxFree=8m -Xmx48m
ritzperf memalloc:
Dalvik -Xgc:concurrent: 11678
Dalvik -Xgc:noconcurrent: 6697
-Xgc:MS: 5978
-Xgc:SS: 4271
-Xgc:CMS: 4150
-Xgc:SS -XX:UseTLAB: 3255

Bug: 9986565
Bug: 12042213

Change-Id: Ib7e1d4b199a8199f3b1de94b0a7b6e1730689cad
diff --git a/runtime/thread.h b/runtime/thread.h
index 44b2186..b01ec94 100644
--- a/runtime/thread.h
+++ b/runtime/thread.h
@@ -577,10 +577,8 @@
   ~Thread() LOCKS_EXCLUDED(Locks::mutator_lock_,
                            Locks::thread_suspend_count_lock_);
   void Destroy();
-  friend class ThreadList;  // For ~Thread and Destroy.
 
   void CreatePeer(const char* name, bool as_daemon, jobject thread_group);
-  friend class Runtime;  // For CreatePeer.
 
   // Avoid use, callers should use SetState. Used only by SignalCatcher::HandleSigQuit, ~Thread and
   // Dbg::Disconnected.
@@ -589,8 +587,6 @@
     state_and_flags_.as_struct.state = new_state;
     return old_state;
   }
-  friend class SignalCatcher;  // For SetStateUnsafe.
-  friend class Dbg;  // F or SetStateUnsafe.
 
   void VerifyStackImpl() SHARED_LOCKS_REQUIRED(Locks::mutator_lock_);
 
@@ -731,9 +727,6 @@
   // If we're blocked in MonitorEnter, this is the object we're trying to lock.
   mirror::Object* monitor_enter_object_;
 
-  friend class Monitor;
-  friend class MonitorInfo;
-
   // Top of linked list of stack indirect reference tables or NULL for none
   StackIndirectReferenceTable* top_sirt_;
 
@@ -799,13 +792,20 @@
   PortableEntryPoints portable_entrypoints_;
   QuickEntryPoints quick_entrypoints_;
 
- private:
   // How many times has our pthread key's destructor been called?
   uint32_t thread_exit_check_count_;
 
-  friend class ScopedThreadStateChange;
+  // Thread-local allocation pointer.
+  byte* thread_local_start_;
+  byte* thread_local_pos_;
+  byte* thread_local_end_;
+  size_t thread_local_objects_;
+  // Returns the remaining space in the TLAB.
+  size_t TLABSize() const;
+  // Doesn't check that there is room.
+  mirror::Object* AllocTLAB(size_t bytes);
+  void SetTLAB(byte* start, byte* end);
 
- public:
   // Thread-local rosalloc runs. There are 34 size brackets in rosalloc
   // runs (RosAlloc::kNumOfSizeBrackets). We can't refer to the
   // RosAlloc class due to a header file circular dependency issue.
@@ -814,6 +814,15 @@
   static const size_t kRosAllocNumOfSizeBrackets = 34;
   void* rosalloc_runs_[kRosAllocNumOfSizeBrackets];
 
+ private:
+  friend class Dbg;  // F or SetStateUnsafe.
+  friend class Monitor;
+  friend class MonitorInfo;
+  friend class Runtime;  // For CreatePeer.
+  friend class ScopedThreadStateChange;
+  friend class SignalCatcher;  // For SetStateUnsafe.
+  friend class ThreadList;  // For ~Thread and Destroy.
+
   DISALLOW_COPY_AND_ASSIGN(Thread);
 };