some armv7 hacking

We can splice these stages if we drop them down to 2 at a time.
Turns out this is significantly (2-3x) faster than the status quo.

    SkRasterPipeline_…
    …f16_compile 1x  …srgb_compile 2.06x  …f16_run 3.08x  …srgb_run 4.61x

Added a couple ways to detect (likely) the required VFPv4 support:
   - use hwcap when available (NDK ≥21, Android framework)
   - use cpu-features when not (NDK <21)

The code in SkSplicer_generated.h is ARM, not Thumb2.  SkSplicer seems
to be blx'ing into it, so that's great, and we bx lr out.  There's no
point in attempting to use Thumb2 in vector heavy code... it'll all be
4 byte anyway.

Follow ups:
   - vpush {d8-d9} before the loop, vpop {d8-d9} afterwards,
     skip these instructions when splicing;
   - (probably) drop jumping stages down to 2-at-a-time also.

Change-Id: If151394ec10e8cbd6a05e2d81808488d743bfe15
Reviewed-on: https://skia-review.googlesource.com/6940
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
6 files changed