aacc + bbdd

SkMatrix::mapPts() using aacc/bbdd was always worse than using badc():
  - On Intel, it was faster than exisiting swizzle, but badc() is 10% faster still (one pshufd instead of two).
  - On ARM, existing swizzle < badc() < aacc()+bbdd(), even though aacc() then bbdd() is really a single vtrn instruction.

I will revert SkMatrix.cpp before submitting.  Just thought you might like to look.

Will think more and try to gear up Instruments on ARM.

BUG=skia:

Review URL: https://codereview.chromium.org/1012573003
5 files changed