jumper, split store_f16 into to_half, store4

Pretty much the same deal as the last CL going the other direction:
split store_f16 into to_half() and store4().  Platforms that had fused
strategies here get a little less optimal, but the code's easier to
follow, maintain, and reuse.

Also adds widen_cast() to encapsulate the fairly common pattern of
expanding one of our logical vector types (e.g. 8-byte U16) up to the
width of the physical vector type (e.g. 16-byte __m128i).  This
operation is deeply understood by Clang, and often is a no-op.

I could make bit_cast() do this, but it seems clearer to have two names.

Change-Id: I7ba5bb4746acfcaa6d486379f67e07baee3820b2
Reviewed-on: https://skia-review.googlesource.com/11204
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
5 files changed