i965/tiled_memcpy: Provide SSE2 for RGBA8 <-> BGRA8 swizzle.

The existing code uses SSSE3, and because it isn't compiled in a
separate file compiled with that, it is usually not used (that, of
course, could be fixed...), whereas SSE2 is always present with 64-bit
builds.  This should be pretty much as fast as the pshufb version,
albeit those code paths aren't really used on chips without llc in any
case.

v2: fix andnot argument order, add comments
v3: use pshuflw/hw instead of shifts (suggested by Matt Turner), cut comments
v4: [mattst88] Rebase

Reviewed-by: Matt Turner <mattst88@gmail.com>
1 file changed