Implementing Color32 functions for Neon platforms.

Besides the raw processing improvement provided by Neon, the code uses memory
preteches (pld) which seem to improve performance greatly when dealing with
very large counts.

This was tested using bench where color32 accounts for the majority of the
workload:
bench -match rects_1 -config 8888 -repeat 500 -forceBlend 1
(the forceBlend is there so that the Color32 code does not go through the
special cases where alpha == 0xFF as it would transform color32 into
a sk_memset32.

Numbers averaged over 3 runs:
bench name      | Before | Neon, no pld | Neon with pld | full boost
rrects_1        | 153.9  | 128.3        | 92            | 1.66x
rects_1_stroke_4| 32.8   | 31.4         | 28.45         | 1.15x
rects_1         | 125.35 | 97.2         | 63.59         | 1.97x

Credits: various googletv team members.

Committed on behalf of evannier.
Review URL: http://codereview.appspot.com/5569077/

git-svn-id: http://skia.googlecode.com/svn/trunk@4779 2bbb7eff-a529-9590-31e7-b0007b416f81
2 files changed