Add AVX2 version of ConvolveVertically

ConvolveVertically time is reduced about 60% using haswell cpu.
Nanobench results:
                             before    after
bitmap_scale_filter_64_256    611us    302us
bitmap_scale_filter_80_90     101us    64.9us
bitmap_scale_filter_30_90    82.3us    51.4us
bitmap_scale_filter_10_90    73.6us    42.4us

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002
CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD

Review-Url: https://codereview.chromium.org/2526733002
3 files changed