assemble directly into mmap'd memory

Instead of allocating into a std::vector, we do one quick first pass to
measure how much memory we need to allocate, mmap enough pages for that,
then another real writing pass.

This cuts a microsecond or so off the profile.  There's another
microsecond left to cut if we could eliminate that first measuring pass,
but I'm no longer sure it's easy to come up with a good upper limit on
the program size now that I'm thinking about the data part of the
program as well.

vpshufb is our current max instruction at 9 bytes of code, but that also
implies another 32 bytes of control data.  I'm not sure I feel very
clever allocating 41 * |instructions| bytes to be conservatively safe...
it seems like ridiculous overkill.

Ultimately I found it easier to just measure twice, cut once.

Change-Id: I16ccdafbc789711837b41b3d5a557808798eb1b4
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/223305
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
3 files changed