trim another instruction off SkRasterPipeline overhead

The overhead of a stage today is 3 x86 instructions, typically looking something like this:
   - movq (%rdi), %rax  // Load the next stage function pointer.
   - addq $0x10, %rdi   // Step our progress ahead 16 bytes to that next stage.
   - jmpq *%rax         // Transfer control to that stage.

But if we make sure the pointer's in esi/rsi, we can use lodsd/lodsq to do those first two steps in one instruction:
   - lodsq (%rsi), %rax    (≈ movq (%rdi), %rax; addq $0x8, %rsi).
   - jmpq *%rax

This CL rearranges things so that we can take advantage of this and generally trim off an instruction of overhead.  Instead of a vector of {Fn, ctx} pairs, we'll flatten it down into a single interlaced program vector of void*, basically just ommitting any null context pointers.  We pass the pointer to program as the second argument to Fn, putting it in rsi.  These two changes together make getting the next Fn to call or the current context the same cheap lodsq instruction, encapsulated as load_and_increment().

Here's how the simple "modulate" blend stage changes:

    vmulps  %ymm4, %ymm0, %ymm0
    vmulps  %ymm5, %ymm1, %ymm1
    vmulps  %ymm6, %ymm2, %ymm2
    vmulps  %ymm7, %ymm3, %ymm3
    movq    (%rdi), %rax
    addq    $0x10, %rdi
    jmpq    *%rax

    ~~~~~~~~>

    vmulps  %ymm4, %ymm0, %ymm0
    vmulps  %ymm5, %ymm1, %ymm1
    vmulps  %ymm6, %ymm2, %ymm2
    vmulps  %ymm7, %ymm3, %ymm3
    lodsq   (%rsi), %rax
    jmpq    *%rax

This does make getting the current context a one-time, destructive operation. It's switched from referring to ctx as a void* directly to using ctx() as a thunk that returns a void*.  No stage so far has ever referred to ctx twice, and it all appears to inline, so this seems harmless.  "matrix_2x3" is a good example of what stages that use context pointers end up looking like:

    lodsq   (%rsi), %rax
    vbroadcastss    (%rax), %ymm9
    vbroadcastss    0x8(%rax), %ymm10
    vbroadcastss    0x10(%rax), %ymm8
    vfmadd231ps     %ymm10, %ymm1, %ymm8
    vfmadd231ps     %ymm9, %ymm0, %ymm8
    vbroadcastss    0x4(%rax), %ymm10
    vbroadcastss    0xc(%rax), %ymm11
    vbroadcastss    0x14(%rax), %ymm9
    vfmadd231ps     %ymm11, %ymm1, %ymm9
    vfmadd231ps     %ymm10, %ymm0, %ymm9
    lodsq   (%rsi), %rax
    vmovaps %ymm8, %ymm0
    vmovaps %ymm9, %ymm1
    jmpq    *%rax

We can't do this with MSVC, as there's no intrinsic for it I can find, and they disallow inline assembly, and rsi is not used to pass arguments to functions there anyway.  ARM doesn't need it... it does this in two instructions naturally anyway.  We could do this for 32-bit x86 but I'd just rather focus on x86-64.

It's unclear to me that this makes things any faster, but doesn't appear to make things any slower, and makes I think both the code and disassembly simpler.

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD

Change-Id: Ia7b543a6718c75a33095371924003c5402b3445a
Reviewed-on: https://skia-review.googlesource.com/6271
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
1 file changed