SkRasterPipeline: fuse clamp_01 into stores.

This is a less generally applicable trick than I have previously hoped.  The need to thread through contexts into each stage really means you can only include one context-dependent stage in each fused batch.

We can still manually fuse these, of course, as you can see in SkRasterPipelineBench.  It's just that we can't really write a generic compile-time template to do it except for context-free stages.  And since we can't write a generic version, and I have only this one specific use case right now, I've kept it quite specific to that use case.

This does work pretty well for this use case, though.  Here's the fused clamp-then-store-565:
+0x00	pushq               %rbp
+0x01	movq                %rsp, %rbp
+0x04	movq                8(%rdi), %rax
+0x08	xorps               %xmm4, %xmm4
+0x0b	maxps               %xmm4, %xmm3
+0x0e	maxps               %xmm4, %xmm0
+0x11	maxps               %xmm4, %xmm1
+0x14	maxps               %xmm4, %xmm2
+0x17	minps               4262818(%rip), %xmm3
+0x1e	minps               %xmm3, %xmm0
+0x21	minps               %xmm3, %xmm1
+0x24	minps               %xmm3, %xmm2
+0x27	movaps              4965378(%rip), %xmm3
+0x2e	mulps               %xmm3, %xmm0
+0x31	cvtps2dq            %xmm0, %xmm0
+0x35	pslld               $11, %xmm0
+0x3a	mulps               4965375(%rip), %xmm1
+0x41	cvtps2dq            %xmm1, %xmm1
+0x45	pslld               $5, %xmm1
+0x4a	mulps               %xmm3, %xmm2
+0x4d	cvtps2dq            %xmm2, %xmm2
+0x51	orpd                %xmm0, %xmm2
+0x55	orpd                %xmm1, %xmm2
+0x59	pshufb              4474510(%rip), %xmm2
+0x62	movq                %xmm2, (%rax,%rsi,2)
+0x67	popq                %rbp
+0x68	retq

BUG=skia:

GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2745

Change-Id: Ia7d66aecc6cbff154158d2600d7874feed1a76f6
Reviewed-on: https://skia-review.googlesource.com/2745
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
1 file changed