Optimize __memset_chk, __memcpy_chk. DO NOT MERGE.

This change creates assembler versions of __memcpy_chk/__memset_chk
that is implemented in the memcpy/memset assembler code. This change
avoids an extra call to memcpy/memset, instead allowing a simple fall
through to occur from the chk code into the body of the real
implementation.

Testing:

- Ran the libc_test on __memcpy_chk/__memset_chk on all nexus devices.
- Wrote a small test executable that has three calls to __memcpy_chk and
  three calls to __memset_chk. First call dest_len is length + 1. Second
  call dest_len is length. Third call dest_len is length - 1.
  Verified that the first two calls pass, and the third fails. Examined
  the logcat output on all nexus devices to verify that the fortify
  error message was sent properly.
- I benchmarked the new __memcpy_chk and __memset_chk on all systems. For
  __memcpy_chk and large copies, the savings is relatively small (about 1%).
  For small copies, the savings is large on cortex-a15/krait devices
  (between 5% to 30%).
  For cortex-a9 and small copies, the speed up is present, but relatively
  small (about 3% to 5%).
  For __memset_chk and large copies, the savings is also small (about 1%).
  However, all processors show larger speed-ups on small copies (about 30% to
  100%).

Bug: 9293744

Merge from internal master.

(cherry-picked from 7c860db0747f6276a6e43984d43f8fa5181ea936)

Change-Id: I916ad305e4001269460ca6ebd38aaa0be8ac7f52
diff --git a/libc/arch-arm/cortex-a15/bionic/memcpy.S b/libc/arch-arm/cortex-a15/bionic/memcpy.S
index d297064..2394024 100644
--- a/libc/arch-arm/cortex-a15/bionic/memcpy.S
+++ b/libc/arch-arm/cortex-a15/bionic/memcpy.S
@@ -59,6 +59,7 @@
 
 #include <machine/cpu-features.h>
 #include <machine/asm.h>
+#include "libc_events.h"
 
         .text
         .syntax unified
@@ -66,6 +67,13 @@
 
 #define CACHE_LINE_SIZE 64
 
+ENTRY(__memcpy_chk)
+        cmp     r2, r3
+        bgt     fortify_check_failed
+
+        // Fall through to memcpy...
+END(__memcpy_chk)
+
 ENTRY(memcpy)
         // Assumes that n >= 0, and dst, src are valid pointers.
         // For any sizes less than 832 use the neon code that doesn't
@@ -321,4 +329,21 @@
 
         // Src is guaranteed to be at least word aligned by this point.
         b       word_aligned
+
+
+        // Only reached when the __memcpy_chk check fails.
+fortify_check_failed:
+        ldr     r0, error_message
+        ldr     r1, error_code
+1:
+        add     r0, pc
+        bl      __fortify_chk_fail
+error_code:
+        .word   BIONIC_EVENT_MEMCPY_BUFFER_OVERFLOW
+error_message:
+        .word   error_string-(1b+8)
 END(memcpy)
+
+        .data
+error_string:
+        .string "memcpy buffer overflow"