string: Optimize SIMD memcpy

Further optimize SIMD memcpy. Small cases now include copies up
to 32 bytes. 64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies. Comments have been rewritten.
Performance on the random memcpy benchmark is ~10% faster.
1 file changed