expand unit tests, fix extract

The mask-only special case for extract is wrong...
it never looked it its input!

This not only makes things correct-er, but oddly it also
makes them faster by breaking inter-loop data dependencies.

Disable tests for _I32... they're actually still broken
because of a much more systemic flaw in how I've evaluated
programs.  The _F32 and _I32_SWAR JIT code and all interpreted
code is just getting lucky.  o_O

While here, update the I32_SWAR code to use the same math as I32,
(x*y+x)/256 for unorm8 mul.  This just helps keep me sane.

Change-Id: I1acc09adb84c426fca4b2be5ca8c2d46d9678dd8
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220577
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
diff --git a/resources/SkVMTest.expected b/resources/SkVMTest.expected
index 08a63cc..1d60641 100644
--- a/resources/SkVMTest.expected
+++ b/resources/SkVMTest.expected
@@ -590,7 +590,7 @@
 store32 arg(1) r9
 
 I32 (SWAR) 8888 over 8888
-7 registers, 20 instructions:
+8 registers, 20 instructions:
 r0 = splat FF00FF (2.3418409e-38)
 r1 = splat FF (3.5733111e-43)
 loop:
@@ -602,14 +602,14 @@
 r4 = extract r4 8 r0
 r6 = shr r2 16
 r6 = sub_i32 r1 r6
-r5 = mul_i32 r5 r6
-r5 = add_i32 r5 r0
-r5 = extract r5 8 r0
-r5 = add_i32 r3 r5
+r7 = mul_i32 r5 r6
+r7 = add_i32 r7 r5
+r7 = extract r7 8 r0
+r7 = add_i32 r3 r7
 r6 = mul_i32 r4 r6
-r6 = add_i32 r6 r0
+r6 = add_i32 r6 r4
 r6 = extract r6 8 r0
 r6 = add_i32 r2 r6
-r6 = pack r5 r6 8
+r6 = pack r7 r6 8
 store32 arg(1) r6