rearrange SkJumper registers on 32-bit x86

There are not many registers on 32-bit x86, and we're using most to pass
Stage function arguments.  This means few are available as temporaries,
and we're forced to hit the stack all the time.  xmm registers are the
most egregious example: we use all 8 registers pass data, leaving none
free as temporaries.

This CL cuts things down pretty dramatically, from passing 5 general
purpose and 8 xmm registers to 2 general purpose and 4 xmm registers.
One of the two general purpose registers is a pointer to space on the
stack where we store all those other values.

Every stage function needs to use the program pointer, so that stays in
a general purpose register.  Almost every stage uses the r,g,b,a
vectors, so they stay in xmm registers.  The rest (destination x,y, the
tail mask, a pointer to tricky constants, and the dr,dg,db,da vectors)
now live on the stack.

The generated code is about 20K smaller and runs about 20% faster.

    $ out/monobench SkRasterPipeline_srgb 200
    Before: 358.784ns
    After:  282.563ns

Change-Id: Icc117af95c1a81c41109984b32e0841022f0d1a6
Reviewed-on: https://skia-review.googlesource.com/27620
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
3 files changed