ARM64: FP16.floor() intrinsic for ARMv8

This CL implements an intrinsic for floor() method with ARMv8.2 FP16
instructions. This intrinsic calls a template GenerateFP16Round function
which will be used to implement other intrinisics such as ceil and
rint.

This intrinsic implementation achieves bit-level compatibility with the
original Java implementation android.util.Half.floor().

The time required in milliseconds to execute the below code on Pixel3:
- Java implementation android.util.Half.floor():
    - big cluster only: 18623
    - little cluster only: 60424
- arm64 Intrinisic implementation:
    - big cluster only: 14213 (~24% faster)
    - little cluster only: 54398 (~10% faster)

Analysis of this function with simpleperf showed that approximately only
60-65% of the time is spent in libcore.util.FP16.floor. So the percentage
improvement using intrinsics is likely to be more than the numbers stated
above.

Another reason that the performance improvement with intrinsic is lower
than expected is because the java implementation for values between -1 and
1 (abs < 0x3c00) only requires a few instructions and should almost give
a similar performance to the intrinsic in this case. In the benchmark function
below, 46.8% of the values tested are between -1 and 1.

public static short benchmarkFloor(){
    short ret = 0;
    long before = 0;
    long after = 0;
    before = System.currentTimeMillis();
    for(int i = 0; i < 50000; i++){
        for (short h = Short.MIN_VALUE; h < Short.MAX_VALUE; h++) {
            ret += FP16.floor(h);
        }
    }
    after = System.currentTimeMillis();
    System.out.println("Time of FP16.floor (ms): " + (after - before));
    System.out.println(ret);
    return ret;
}

Test: 580-fp16
Test: art/test/testrunner/run_build_test_target.py -j80 art-test-javac

Change-Id: Iad1dd032d456af54932f13c5cf27228f8652a0b5
11 files changed