e61c40707e70a2be9e32227a929173864f7895e1 - platform/external/skia

commit	e61c40707e70a2be9e32227a929173864f7895e1	[log] [tgz]
author	Mike Klein <mtklein@chromium.org>	Thu Dec 29 11:06:34 2016 -0500
committer	Skia Commit-Bot <skia-commit-bot@chromium.org>	Tue Jan 03 18:13:21 2017 +0000
tree	a12726f4f2f2d1d3788f7cf99a8477ceae8191dd
parent	7551898f8eba322acb04c74ae12aae1ed3548105 [diff]

trim another instruction off SkRasterPipeline overhead

The overhead of a stage today is 3 x86 instructions, typically looking something like this:
   - movq (%rdi), %rax  // Load the next stage function pointer.
   - addq $0x10, %rdi   // Step our progress ahead 16 bytes to that next stage.
   - jmpq *%rax         // Transfer control to that stage.

But if we make sure the pointer's in esi/rsi, we can use lodsd/lodsq to do those first two steps in one instruction:
   - lodsq (%rsi), %rax    (≈ movq (%rdi), %rax; addq $0x8, %rsi).
   - jmpq *%rax

This CL rearranges things so that we can take advantage of this and generally trim off an instruction of overhead.  Instead of a vector of {Fn, ctx} pairs, we'll flatten it down into a single interlaced program vector of void*, basically just ommitting any null context pointers.  We pass the pointer to program as the second argument to Fn, putting it in rsi.  These two changes together make getting the next Fn to call or the current context the same cheap lodsq instruction, encapsulated as load_and_increment().

Here's how the simple "modulate" blend stage changes:

    vmulps  %ymm4, %ymm0, %ymm0
    vmulps  %ymm5, %ymm1, %ymm1
    vmulps  %ymm6, %ymm2, %ymm2
    vmulps  %ymm7, %ymm3, %ymm3
    movq    (%rdi), %rax
    addq    $0x10, %rdi
    jmpq    *%rax

    ~~~~~~~~>

    vmulps  %ymm4, %ymm0, %ymm0
    vmulps  %ymm5, %ymm1, %ymm1
    vmulps  %ymm6, %ymm2, %ymm2
    vmulps  %ymm7, %ymm3, %ymm3
    lodsq   (%rsi), %rax
    jmpq    *%rax

This does make getting the current context a one-time, destructive operation. It's switched from referring to ctx as a void* directly to using ctx() as a thunk that returns a void*.  No stage so far has ever referred to ctx twice, and it all appears to inline, so this seems harmless.  "matrix_2x3" is a good example of what stages that use context pointers end up looking like:

    lodsq   (%rsi), %rax
    vbroadcastss    (%rax), %ymm9
    vbroadcastss    0x8(%rax), %ymm10
    vbroadcastss    0x10(%rax), %ymm8
    vfmadd231ps     %ymm10, %ymm1, %ymm8
    vfmadd231ps     %ymm9, %ymm0, %ymm8
    vbroadcastss    0x4(%rax), %ymm10
    vbroadcastss    0xc(%rax), %ymm11
    vbroadcastss    0x14(%rax), %ymm9
    vfmadd231ps     %ymm11, %ymm1, %ymm9
    vfmadd231ps     %ymm10, %ymm0, %ymm9
    lodsq   (%rsi), %rax
    vmovaps %ymm8, %ymm0
    vmovaps %ymm9, %ymm1
    jmpq    *%rax

We can't do this with MSVC, as there's no intrinsic for it I can find, and they disallow inline assembly, and rsi is not used to pass arguments to functions there anyway.  ARM doesn't need it... it does this in two instructions naturally anyway.  We could do this for 32-bit x86 but I'd just rather focus on x86-64.

It's unclear to me that this makes things any faster, but doesn't appear to make things any slower, and makes I think both the code and disassembly simpler.

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD

Change-Id: Ia7b543a6718c75a33095371924003c5402b3445a
Reviewed-on: https://skia-review.googlesource.com/6271
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>

src/opts/SkRasterPipeline_opts.h[diff]

1 file changed

tree: a12726f4f2f2d1d3788f7cf99a8477ceae8191dd