jumper, factor out load4() and from_half()

load_f16 gets slightly worse codegen for ARMv7, SSE2, SSE4.1, and AVX
from splitting it apart compared to the previous fused versions.  But
the stage code becomes much simpler.

I'm happy to make those trades until someone complains.

load4() will be useful on its own to implement a couple other stages.

Everything draws the same.  I intend to follow up with more of the
same sort of refactoring, but this was tricky enough a change I want
to do them in small steps.

Change-Id: Ib4aa86a58d000f2d7916937cd4f22dc2bd135a49
Reviewed-on: https://skia-review.googlesource.com/11186
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
4 files changed