tile: optimize and clean up string functions
This change cleans up the string code in a number of ways:
- For memcpy(), fix bug in prefetch and increase distance to 3 lines;
optimize for unaligned data; do all loads before wh64 to make memcpy
safe for forward-overlapping calls; etc. Performance is improved.
- Use new copy_byte() function on tilegx to spread a single byte value
out into a full word using the shufflebytes instruction.
- Clean up header include ordering to be more canonical, and remove
spurious #undefs of function names.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
diff --git a/arch/tile/lib/strchr_64.c b/arch/tile/lib/strchr_64.c
index f39f9dc..fe6e31c 100644
--- a/arch/tile/lib/strchr_64.c
+++ b/arch/tile/lib/strchr_64.c
@@ -26,7 +26,7 @@
const uint64_t *p = (const uint64_t *)(s_int & -8);
/* Create eight copies of the byte for which we are looking. */
- const uint64_t goal = 0x0101010101010101ULL * (uint8_t) c;
+ const uint64_t goal = copy_byte(c);
/* Read the first aligned word, but force bytes before the string to
* match neither zero nor goal (we make sure the high bit of each