Optimise YuvToRGB using 16-bit arithmetic.

Reimplement YuvToRGB intrinsic using 16-bit SIMD arithmetic to increase
throughput.  Implementations in AArch32 and AArch64 NEON.

Change-Id: Idd43e383f5147c33b0b546fa822c970de432c19d
5 files changed