somewhat less silly tail loads and stores

No reason to keep going one at a time when we know there are generally
better ways to handle loading a power-of-two number of low lanes.

This strategy scales up too, with quick answers for 8 (one 8 byte load),
12 (one 8 byte, one 4 byte), etc.

$ ninja -C out monobench; and out/monobench SkRasterPipeline_compile 300

    Before: 46.946ns
    After:  43.341ns

(This happens to be _lowp.  Expect similar small speedups elsewhere.)

Change-Id: I08f87769ea3c9f06ad13d2b1d5326e542b9b63a8
Reviewed-on: https://skia-review.googlesource.com/20903
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
4 files changed