AMDGPU: Match load d16 hi instructions

Also starts selecting global loads for constant address
in some cases. Some end up selecting to mubuf still, which
requires investigation.

We still get sub-optimal regalloc and extra waitcnts inserted
due to not really tracking the liveness of the separate register
halves.

llvm-svn: 313716
diff --git a/llvm/test/CodeGen/AMDGPU/fabs.f16.ll b/llvm/test/CodeGen/AMDGPU/fabs.f16.ll
index 9da2479..4429cfa 100644
--- a/llvm/test/CodeGen/AMDGPU/fabs.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fabs.f16.ll
@@ -7,7 +7,7 @@
 ; unless isFabsFree returns true
 
 ; GCN-LABEL: {{^}}s_fabs_free_f16:
-; GCN: flat_load_ushort [[VAL:v[0-9]+]],
+; GCN: {{flat|global}}_load_ushort [[VAL:v[0-9]+]],
 ; GCN: v_and_b32_e32 [[RESULT:v[0-9]+]], 0x7fff, [[VAL]]
 ; GCN: {{flat|global}}_store_short v{{\[[0-9]+:[0-9]+\]}}, [[RESULT]]
 
@@ -75,8 +75,8 @@
 }
 
 ; GCN-LABEL: {{^}}fabs_fold_f16:
-; GCN: flat_load_ushort [[IN0:v[0-9]+]]
-; GCN: flat_load_ushort [[IN1:v[0-9]+]]
+; GCN: {{flat|global}}_load_ushort [[IN0:v[0-9]+]]
+; GCN: {{flat|global}}_load_ushort [[IN1:v[0-9]+]]
 
 ; CI-DAG: v_cvt_f32_f16_e32 [[CVT0:v[0-9]+]], [[IN0]]
 ; CI-DAG: v_cvt_f32_f16_e64 [[ABS_CVT1:v[0-9]+]], |[[IN1]]|