Optimisations to blur intrinsic.

Try to keep all data in-register whereever possible, and use only a minimal
circular buffer on the stack when necessary.  Implementations in AArch32 and
AArch64 NEON.

Change-Id: If3dd4932a94ee1cadde46e298b8f6bf14b6c2bdc
5 files changed