ART: Improve VisitStringGetCharsNoCheck intrinsic for compressed strings, using SIMD

The previous implementation of VisitStringGetCharsNoCheck
copies one character at a time for compressed strings (that
use 8 bits per char).

Instead, use SIMD instructions to copy 8 chars at once
where possible.

On a Pixel 3 phone:

Microbenchmarks for getCharsNoCheck on varying string
lengths show a speedup of up to 80% (big cores) and
70% (little cores) on long strings, and around 30% (big)
and 20% (little) on strings of only 8 characters.

The overhead for strings of < 8 characters is ~3%,
and is immediately amortized for strings of more
than 8 characters.

Dhrystone shows a consistent speedup of around 6% (big)
and 4% (little).

The getCharsNoCheck intrinsic is used by the StringBuilder
append() method, which is used by the String concatenate
operator ('+').

Image size change:
  Before:
    boot-core-libart.oat:  549040
    boot.oat:             3789080
    boot-framework.oat:  13356576
  After:
    boot-core-libart.oat:  549024 (-16B)
    boot.oat:             3789144 (+64B)
    boot-framework.oat:  13356576 (+ 0B)

Test: test_art_target.sh, test_art_host.sh
Test: 536-checker-intrinsic-optimization

Change-Id: I865e3df6d4725e151ae195a86e02e090dae8dd29
2 files changed