Optimisations to 3DLUT assembly.

Process more pixels at once to try to keep the register file fuller and more
tightly packed and allow more concurrency.  Implementations in AArch32 and
AArch64 assembly.

Change-Id: I683078ff02155cc14bacce35bce3d3fe06857095
5 files changed