reorder to minimize register pressure

Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction.  This
reorders blocks of instructions to reduce them number of temporary
registers in flight.

Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:

    src = load32 ...          (1)
    sr = extract src ...      (2)
    sg = extract src ...      (3)
    sb = extract src ...      (4)
    sa = extract src ...      (4, src dies)

    dst = load32 ...          (5)
    dr = extract dst ...      (6)
    dg = extract dst ...      (7)
    db = extract dst ...      (8)
    da = extract dst ...      (8, dst dies)

    r = add sr dr             (7, sr and dr die)
    g = add sg dg             (6, sg and dg die)
    b = add sb db             (5, sb and db die)
    a = add sa da             (4, sa and da die)

    rg   = pack r g ...       (3, r and g die)
    ba   = pack b a ...       (2, b and a die)
    rgba = pack rg ba ...     (1, rg and ba die)
    store32 rgba ...          (0, rgba dies)

That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here).  This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:

    src = load32 ...       (1)
    sr  = extract src ...  (2)
    dst = load32 ...       (3)
    dr  = extract dst ...  (4)
     r  = add sr dr        (3, sr and dr die)

    sg  = extract src ...  (4)
    dg  = extract dst ...  (5)
     g  = add sg dg        (4, sg and dg die)

    rg  = pack r g         (3, r and g die)

    sb  = extract src ...  (4)
    db  = extract dst ...  (5)
     b  = add sb db        (4, sb and db die)

    sa  = extract src ...  (4, src dies)
    da  = extract dst ...  (4, dst dies)
     a  = add sa da        (3, sa and da die)

    ba  = pack b a         (2, b and a die)

    rgba = pack rg ba ...  (1, rg and ba die)
    store32 rgba  ...      (0)

That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).

I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error.  I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.

This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.

Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
3 files changed