Sk4x4f: Simplify x86 down to SSE2.

  - This drops the minimum requirement for Sk4x4f on x86 to SSE2 by
    removing calls to _mm_shuffle_epi8().  Instead we use good old
    shifting and masking.

  - Performance is very similar to SSSE3, close enough I'm having trouble
    telling which is faster.  I think we should let ourselves circle back
    on whether we need an SSSE3 version later.  When possible it's nice
    to stick to SSE2: it's most available, and performs most uniformly
    across different chips.

This makes Sk4x4f fast on Windows and Linux, and may help mobile x86.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1817353005

Review URL: https://codereview.chromium.org/1817353005
1 file changed