Some performance tweaks for DAA

1. Always inline (Clang previously ignored inline and got 25% slower)
2. SIMD everywhere other than x86 gcc:
   non-SIMD is only faster in my desktop with gcc;
   with Clang on my desktop, SIMD is 50% faster than non-SIMD.
3. Allocate 4x memory instead of 2x when running out of space:
   on old Android devices with Linux kernel 3.10 (e.g., Nexus 6P, 5X),
   the alloc/memcpy will triger a major bottleneck in kernel (30% of
   the running time). Such bottleneck goes away (the kernel is no
   longer doing stupid things during alloc/memcpy) in Linux kernel
   3.18 (e.g., Pixel), and that's why DAA is much faster on Pixel than
   on Nexus 6P.

I think maybe I should adopt SkRasterPipeline for device-specific
optimizations.

Bug: skia:
Change-Id: I0408aa7671a5f1b39aad3bec25f8fc994ff5a1bb
Reviewed-on: https://skia-review.googlesource.com/30820
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Yuqian Li <liyuqian@google.com>
2 files changed