add ERMS (enhanced rep mov/sto) SkOpts slice

Intel's got two CPUID bits indicating the speed of rep mov/sto
(memcpy/memset),

    - ERMS, Enhanced Rep Mov/Sto, older, large copies are fast?
    - FSRM, Fast Short Rep Mov, newer, small copies are fast?

ERMS has been around a long time on Intel, but is relatively recent on
Ryzen, and FSRM is new across the board.  The startup cost for
ERMS-but-not-FSRM copies really is noticeable, so we cut over to the
previous SSE/AVX routines when N is small.

I've left the memset benchmarks as I found them most useful when
tuning the small/large cutoff in this CL.

Change-Id: I3ac4e3f34796aba0ea86aabbe9dda7526919456a
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/332580
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
6 files changed