SKPMFloat: we can beat the naive loops when clamping

Clamping 4 at a time is now about 15% faster than 1 at a time with SSSE3.
Clamping 4 at a time is now about 20% faster with SSE2,
and this applies to non-clamping too (we still just clamp there).

In all cases, 4 at a time is never worse than 1 at a time,
and not clamping is never slower than clamping.

Here's all the bench results, with the numbers for portable code as a fun point
of reference:

SSSE3:
maxrss  loops   min median  mean    max stddev  samples     config  bench
  10M   2291    4.66ns  4.66ns  4.66ns  4.68ns  0%  ▆█▁▁▁▇▁▇▁▃  nonrendering    SkPMFloat_get_1x
  10M   2040    5.29ns  5.3ns   5.3ns   5.32ns  0%  ▃▆▃▃▁▁▆▃▃█  nonrendering    SkPMFloat_clamp_1x
  10M   7175    4.62ns  4.62ns  4.62ns  4.63ns  0%  ▁▄▃████▃▄▇  nonrendering    SkPMFloat_get_4x
  10M   5801    4.89ns  4.89ns  4.89ns  4.91ns  0%  █▂▄▃▁▃▄█▁▁  nonrendering    SkPMFloat_clamp_4x

SSE2:
maxrss  loops   min median  mean    max stddev  samples     config  bench
  10M   1601    6.02ns  6.05ns  6.04ns  6.08ns  0%  █▅▄▅▄▂▁▂▂▂  nonrendering    SkPMFloat_get_1x
  10M   2918    6.05ns  6.06ns  6.05ns  6.06ns  0%  ▂▇▁▇▇▁▇█▇▂  nonrendering    SkPMFloat_clamp_1x
  10M   3569    5.43ns  5.45ns  5.44ns  5.45ns  0%  ▄█▂██▇▁▁▇▇  nonrendering    SkPMFloat_get_4x
  10M   4168    5.43ns  5.43ns  5.43ns  5.44ns  0%  █▄▇▁▇▄▁▁▁▁  nonrendering    SkPMFloat_clamp_4x

Portable:
maxrss  loops   min median  mean    max stddev  samples     config  bench
  10M   500     27.8ns  28.1ns  28ns    28.2ns  0%  ▃█▆▃▇▃▆▁▇▂  nonrendering    SkPMFloat_get_1x
  10M   770     40.1ns  40.2ns  40.2ns  40.3ns  0%  ▅▁▃▂▆▄█▂▅▂  nonrendering    SkPMFloat_clamp_1x
  10M   1269    28.4ns  28.8ns  29.1ns  32.7ns  4%  ▂▂▂█▂▁▁▂▁▁  nonrendering    SkPMFloat_get_4x
  10M   1439    40.2ns  40.4ns  40.4ns  40.5ns  0%  ▆▆▆█▁▆▅█▅▆  nonrendering    SkPMFloat_clamp_4x

SkPMFloat_neon.h is still one big TODO as far as 4-at-a-time APIs go.

BUG=skia:

Review URL: https://codereview.chromium.org/982123002
5 files changed