Use partial TLAB regions

Instead of having 256K TLAB regions, have 256K TLABs split into
16K regions. This fixes pathological cases with multithreaded
allocation that caused many GCs since each thread reserving
256K would often bump the counter past the GC start threshold. Now
threads only bump the counter every 16K.

System wide results (average of 5 samples on N6P):
Total GC time 60s after starting shell: 45s -> 24s
Average .Heap PSS 60s after starting shell: 57900k -> 58682k

BinaryTrees gets around 5% slower, numbers are noisy.

Boot time: 13.302 -> 12.899 (average of 100 runs)

Bug: 35872915
Bug: 36216292

Test: test-art-host

(cherry picked from commit bf48003fa32d2845f2213c0ba31af6677715662d)

Change-Id: I5ab22420124eeadc0a53519c70112274101dfb39
12 files changed