Port SkMatrix opts to SkOpts.

No changes to the code, just moved around.

This will have the effect of enabling vectorized code on ARMv7.
Should be no effect on ARMv8 or x86, which would have been vectorized already.

nanobench --match mappoints changes on Nexus 5 (ARMv7):

_affine: 132 -> 95
_scale: 118 -> 47
_trans: 60 -> 37

A teaser:
We should next look at the ABCD->BADC shuffle we've noted that we need in _affine.  A quick hack showed doing that optimally is another ~35% speedup on x86.  Got to figure out how to do it best on ARM though: that same quick hack was a 2x slowdown there.  Good reason to resurrect that SkNx_shuffle() CL!

(I believe the answers are vrev64q_f32(v) and _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,3,0,1), but we should probably find out in another CL.)

BUG=skia:4117

Review URL: https://codereview.chromium.org/1320673014
5 files changed