Add sk_parallel_for()

This should be a drop-in replacement for most for-loops to make them run in parallel:
   for (int i = 0; i < N; i++) { code... }
   ~~~>
   sk_parallel_for(N, [&](int i) { code... });

This is just syntax sugar over SkTaskGroup to make this use case really easy to write.
There's no more overhead that we weren't already forced to add using an interface like batch(),
and no extra heap allocations.

I've replaced 3 uses of SkTaskGroup with sk_parallel_for:
  1) My unit tests for SkOnce.
  2) Cary's path fuzzer.
  3) SkMultiPictureDraw.
Performance should be the same.  Please compare left and right for readability. :)

BUG=skia:

No public API changes.
TBR=reed@google.com

Review URL: https://codereview.chromium.org/1184373003
diff --git a/tests/OnceTest.cpp b/tests/OnceTest.cpp
index 034c5d9..35c2015 100644
--- a/tests/OnceTest.cpp
+++ b/tests/OnceTest.cpp
@@ -28,42 +28,14 @@
     REPORTER_ASSERT(r, 5 == x);
 }
 
-static void add_six(int* x) {
-    *x += 6;
-}
-
-namespace {
-
-class Racer : public SkRunnable {
-public:
-    SkOnceFlag* once;
-    int* ptr;
-
-    void run() override {
-        SkOnce(once, add_six, ptr);
-    }
-};
-
-}  // namespace
-
 SK_DECLARE_STATIC_ONCE(mt_once);
 DEF_TEST(SkOnce_Multithreaded, r) {
-    const int kTasks = 16;
-
-    // Make a bunch of tasks that will race to be the first to add six to x.
-    Racer racers[kTasks];
     int x = 0;
-    for (int i = 0; i < kTasks; i++) {
-        racers[i].once = &mt_once;
-        racers[i].ptr = &x;
-    }
-
-    // Let them race.
-    SkTaskGroup tg;
-    for (int i = 0; i < kTasks; i++) {
-        tg.add(&racers[i]);
-    }
-    tg.wait();
+    // Run a bunch of tasks to be the first to add six to x.
+    sk_parallel_for(1021, [&](int) {
+        void(*add_six)(int*) = [](int* p) { *p += 6; };
+        SkOnce(&mt_once, add_six, &x);
+    });
 
     // Only one should have done the +=.
     REPORTER_ASSERT(r, 6 == x);