same 16->8 bit packing trick for SSE2/SSE4.1
It's funny how now that I'm on a machine that doesn't support AVX2,
it's suddenly important for me that pack() is optimized for SSE!
This is basically the same as this morning, without any weird AVX2
pack ordering issues. This replaces something like
movdqa 2300(%rip), %xmm0
pshufb %xmm0, %xmm3
pshufb %xmm0, %xmm2
punpcklqdq %xmm3, %xmm2
(This is SSE4.1; the SSE2 version is worse.)
with
psrlw $8, %xmm3
psrlw $8, %xmm2
packuswb %xmm3, %xmm2
(SSE2 and SSE4.1 both.)
It's always nice to not need to load a shuffle mask out of memory.
Change-Id: I56fb30b31fcedc0ee84a4a71c483a597c8dc1622
Reviewed-on: https://skia-review.googlesource.com/30583
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
3 files changed