custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like overflow in a few GMs, most noticeably in hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1813263002

Review URL: https://codereview.chromium.org/1813263002
1 file changed