better NEON 8-bit stages

Our interlaced approach works pretty well for x86, but on ARM we're a
lot better off deinterlacing in loads and reinterlacing in stores.

This leaves the stages mostly looking like the float stages, and cuts
out some awkward parts from the code generation.

Diffs are all invisible.  Performance is noticeably better for some
blend modes like Overlay.

Change-Id: Ie599e823602bfd14552de78df44a621aea66e1a2
Reviewed-on: https://skia-review.googlesource.com/40100
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
1 file changed