Simplify code of convolve3x3

Instead of first doing all multiplications and then adding the results in
a tree manner, we just repetitively perform a load/multiply/add patter.
With and without tuning for A15, this yields a 5% performance increase for N10.

This commit also exposes more instructions to be transformed into fused
multiply adds.

Change-Id: I1215d75da236e6b2d6b6aa48b3ab35606cdba7b8
1 file changed