Flush to zero when loading f16 with sse2/sse4.1.

The multiply by 0x77800000 is quite slow when the input is denormalized.
We don't mind flushing those values (in the range of 1e-5) to zero.

Implement portable load_f16() / store_f16() too.

Change-Id: I125cff1c79ca71d9abe22ac7877136d86707cb56
Reviewed-on: https://skia-review.googlesource.com/8467
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
4 files changed