Implement sum_sqr_shift() using two passes with no branch inside the loops

Slightly slower on x86, about the same speed on ARMv7, should be faster on
DSPs.
1 file changed