attempt 3: add experimental bilerp_clamp_8888 stage

It looks like we can specialize hot image shaders into their
own single stages for a good speedup on both x86 and ARM.

I've started here with bilerp_clamp_8888, and will
follow up with bgra and 565, and lowp versions of those,
and probably also the same for nearest neighbors.

All pixels are identical in GMs.

This time, rewrite the loop over sample points to be a little
friendlier to 32-bit x86 code generation.  The previous version
created an object file indirection feature build_stages.py can't handle.

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Android-Clang-NexusPlayer-CPU-Moorefield-x86-Release-All-Android,Test-Android-Clang-NexusPlayer-GPU-PowerVR-x86-Release-All-Android

Change-Id: I150b6af4a5b89e009dc04ca69e1857892e173deb
Reviewed-on: https://skia-review.googlesource.com/89180
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
7 files changed