Use __rdtsc on Windows.

This seems to be ~100x higher resolution than QueryPerformanceCounter.  AFAIK, all our Windows perf bots have constant_tsc, so we can be a bit more direct about using rdtsc directly: it'll always tick at the max CPU frequency.

Now, the question remains, what is the max CPU frequency to divide through by?  It looks like QueryPerformanceFrequency actually gives the CPU frequency in kHz, suspiciously exactly what we need to divide through to get elapsed milliseconds.  That was a freebie.

I did some before/after comparison on slow benchmarks.  Timings look the same.  Going to land this without review tonight to see what happens on the bots; happy to review carefully tomorrow.

R=mtklein@google.com
TBR=bungeman

BUG=skia:

Review URL: https://codereview.chromium.org/394363003
diff --git a/tools/Stats.h b/tools/Stats.h
index 4fddc9b..8487a94 100644
--- a/tools/Stats.h
+++ b/tools/Stats.h
@@ -1,8 +1,6 @@
 #ifndef Stats_DEFINED
 #define Stats_DEFINED
 
-#include <math.h>
-
 #include "SkString.h"
 #include "SkTSort.h"
 
@@ -50,7 +48,7 @@
             s -= min;
             s /= (max - min);
             s *= (SK_ARRAY_COUNT(kBars) - 1);
-            const size_t bar = (size_t)round(s);
+            const size_t bar = (size_t)(s + 0.5);
             SK_ALWAYSBREAK(bar < SK_ARRAY_COUNT(kBars));
             plot.append(kBars[bar]);
         }