restore _DXDY image shader on ARM

This is mostly the patch we've been looking at, rebased,
with some of my comments from the review folded in.

The perf speedup is qualitatively the same as I saw on the other patch.
On that same Snapdragon 835, with draw_bitmap_aa_rotate runs about 30%
faster (543.39 vs 712.71us) and draw_bitmap_noaa_rotate about 15% faster
(481.93 vs.  572.13us).

The main thing I have omitted is the NEON specialization of matrix
procs.  It looks like both nofilter_affine() and filter_affine() are
autovectorized well, and we seem to perform fine enough without manual
specialization here.  I'm even tempted to remove [no]filter_scale_neon()
as a follow up.

Image diffs look mostly fine.  This unexpectedly fixes rotated lighting
shaders in GMs.  Clearly that lighting shader must get a lot of use...

Change-Id: I67ee0b3ab92d6e56584ece05feb6e66d6fb7c660
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249860
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Mike Reed <reed@google.com>
7 files changed