improve handling at end of buffer

a prior change reduced iterations through the input buffer to avoid the
NEON operations from overrunning the end of the locally allocated
buffer. While avoiding the overrun, it generated bad results.
Here we instead extend the locally allocated buffers enough that the
original iteration count won't overrun.

Some pre-existing bit-exact issues remain.

Bug: 136616344
Test: CTS + bit-exact cross-checks.
1 file changed