Threaded generation of software paths

Re-land of: https://skia-review.googlesource.com/36560

All information needed by the thread is captured by the prepare
callback object, the lambda captures a pointer to that, and does the
mask render. Once it's done, it signals the semaphore (also owned by the
callback). The callback defers the semaphore wait even longer (into the
ASAP upload), so the odds of waiting for the thread are REALLY low.

Also did a bunch of cleanup along the way, and put in some trace markers
so we can monitor how well this is working.

Traces of a GM that includes GPU and SW path rendering (path-reverse):

Original:
    https://screenshot.googleplex.com/f5BG3901tQg.png
Threaded, with wait in the callback (notice pre flush callback blocking):
    https://screenshot.googleplex.com/htOSZFE2s04.png
Current version, with wait deferred to ASAP upload function:
    https://screenshot.googleplex.com/GHjD0U3C34q.png
Bug: skia:
Change-Id: Idb92f385590749f41328a9aec65b2a93f4775079
Reviewed-on: https://skia-review.googlesource.com/40775
Reviewed-by: Brian Salomon <bsalomon@google.com>
Commit-Queue: Brian Osman <brianosman@google.com>
19 files changed