restore Op::round

While I think trunc(mad(x, scale, 0.5)) is fine for doing our float
to fixed point conversions, round(mul(x, scale)) was kind of better
all around:

   - better rounding than +0.5 and trunc
   - faster when mad() is not an fma
   - often now no need to use the constant 0.5f or have it in a register
   - allows the mul() in to_unorm to use mul_f32_imm

Those last two points are key... this actually frees up 2 registers in
the x86 JIT when using to_unorm().

So I think maybe we can resurrect round and still guarantee our desired
intra-machine stability by committing to using instructions that follow
the current rounding mode, which is what [v]cvtps2dq inextricably uses.

Left some notes on the ARM impl... we're rounding to nearest even there,
which is probably the current mode anyway, but to be more correct we
need a slightly longer impl that rounds float->float then "truncates".
Unsure whether it matters in practice.  Same deal in the unit test that
I added back, now testing negative and 0.5 cases too. The expectations
assume the current mode is nearest even.

I had the idea to resurrect this when I was looking at adding _imm Ops
for fma_f32.  I noticed that the y and z arguments to an fma_f32 were by
far most likely to be constants, and when they are, they're by far likely
to both be constants, e.g. 255.0f & 0.5f from to_unorm(8,...).

llvm disassembly for SkVM_round unit test looks good:

~ $ llc -mcpu=haswell /tmp/skvm-jit-1231521224.bc -o -
	.section	__TEXT,__text,regular,pure_instructions
	.macosx_version_min 10, 15
	.globl	"_skvm-jit-1231521224"  ## -- Begin function skvm-jit-1231521224
	.p2align	4, 0x90
"_skvm-jit-1231521224":                 ## @skvm-jit-1231521224
	.cfi_startproc
	cmpl	$8, %edi
	jl	LBB0_3
	.p2align	4, 0x90
LBB0_2:                                 ## %loopK
                                        ## =>This Inner Loop Header: Depth=1
	vcvtps2dq	(%rsi), %ymm0
	vmovupd	%ymm0, (%rdx)
	addl	$-8, %edi
	addq	$32, %rsi
	addq	$32, %rdx
	cmpl	$8, %edi
	jge	LBB0_2
LBB0_3:                                 ## %hoist1
	xorl	%eax, %eax
	testl	%edi, %edi
	jle	LBB0_6
	.p2align	4, 0x90
LBB0_5:                                 ## %loop1
                                        ## =>This Inner Loop Header: Depth=1
	vcvtss2si	(%rsi,%rax), %ecx
	movl	%ecx, (%rdx,%rax)
	decl	%edi
	addq	$4, %rax
	testl	%edi, %edi
	jg	LBB0_5
LBB0_6:                                 ## %leave
	vzeroupper
	retq
	.cfi_endproc
                                        ## -- End function

Change-Id: Ib59eb3fd8a6805397850d93226c6c6d37cc3ab84
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/276738
Auto-Submit: Mike Klein <mtklein@google.com>
Commit-Queue: Herb Derby <herb@google.com>
Reviewed-by: Herb Derby <herb@google.com>
6 files changed