De-templatize Sk4pxXfermode code a bit.

This deduplicates a few pieces of code:
  - we end up with one copy of each xfer32() driver loop instead of one per xfermode;
  - we end up with two* copies of each xfermode implementation instead of ten**.

* For a given Mode: Mode() itself and xfer_aa<Mode>().
** From unrolling: twice at a stride of 8, once at 4, once at 2, and once at 1, then all again for when we have AA.

This decreases the size of SkXfermode.o from 1.5M to 620K on x86-64 and from 1.3M to 680K on ARMv7+NEON.

If we wanted to, we could eliminate the xfer_aa<Mode>() copy by tagging each Mode() function as __attribute__((noinline)) or its equivalent.  This would result in another ~100K space savings.

Performance is affected in proportion to the original xfermode speed:
fast modes like Plus take the largest proportional hit, and slow modes
like HardLight or SoftLight see essentially no hit at all.

This adds SK_VECTORCALL to help keep this code fast on ARMv7 and Windows.  I've looked at the ARMv7 generated code... it looks good, even pretty.

For compatibility with SK_VECTORCALL, we now pass the vector-sized arguments by value instead of by reference.  Some refactoring now allows us to declare each mode as just a static function instead of a struct, which simplifies things.

TBR=reed@google.com
No public API changes.

BUG=skia:

Review URL: https://codereview.chromium.org/1242743004
2 files changed