Greatly reduce BenchmarkState overhead

CL reduces BenchmarkState to minimal levels. It also adds
a warmup loop to get things going first before starting measurements.

With this change with clocks /not/ locked on bullhead the test for
RenderNodeJniOverhead is showing a stable (0ns std dev) result
of 54ns, which is approximately the expected amount.

Test: Ran a few perf benchmarks

Change-Id: If01e455884711ebd9cfb89f076efa19dc0b5436d
3 files changed