cut SkMaskBlurFilter code size by about half

Replace the templated loaders with function pointers instead,
like we do for the BlurX functions already, and favor A8.

And let the compiler/build config decide all the inlining.

A stripped, optimized ARM64 build goes from about 56K to about 26K.

Speed on our mask blur benches (bench/BlurBench.cpp) on a Galaxy S9 is mixed:
some slowdowns, some speedups.  This seems to perform between the
all-template version at head and all function-pointer version in PS2.

Change-Id: Ia27d92c08ca68e0b44c89e8a77d7b6e7297239c4
Reviewed-on: https://skia-review.googlesource.com/137889
Reviewed-by: Ben Wagner <bungeman@google.com>
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
1 file changed