Add Matrix colorfilter pipeline stages.

This breaks the color filter down into a couple logical steps:
  - go to unpremul
  - apply the 4x5 matrix
  - clamp to [0,1]
  - go to premul

Because we already have handy premul clamp stages, we swap the order of clamp and premul.  This is lossless.

While adding our stages to the pipeline, we analyze the matrix to see if we can skip any steps:
  - we can skip unpremul if the shader is opaque (alphas are all 1 ~~~> we're already unpremul);
  - we can skip the premul back if the color filter always produces opaque (here, are the inputs opaque and do we keep them that way, but we could also check for an explicit 0 0 0 0 1 alpha row);
  - we can skip the clamp_0 if the matrix can never produce a value less than 0;
  - we can skip the clamp_1 if the matrix can never produce a value greater than 1.

The only thing that should seem missing is per-pixel alpha checks.  We don't do those here, but instead make up for it by operating on 4-8 pixels at a time.
We don't split the 4x5 matrix into a 4x4 and 1x4 translate.  We could, but when we have FMA (new x86, all ARMv8) we might as well work the translate for free into the FMAs.

This makes gm/fadefilter.cpp draw differently in sRGB and F16 modes, bringing them in line with the GPU sRGB and GPU f16 configs.  It's unclear to me what was wrong with the old CPU implementation.

GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=4346
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Change-Id: I14082ded8fb8d63354167d9e6b3f8058f840253e
Reviewed-on: https://skia-review.googlesource.com/4346
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
diff --git a/src/core/SkColorMatrixFilterRowMajor255.cpp b/src/core/SkColorMatrixFilterRowMajor255.cpp
index 0fcd36a..b27d40d 100644
--- a/src/core/SkColorMatrixFilterRowMajor255.cpp
+++ b/src/core/SkColorMatrixFilterRowMajor255.cpp
@@ -9,6 +9,7 @@
 #include "SkColorPriv.h"
 #include "SkNx.h"
 #include "SkPM4fPriv.h"
+#include "SkRasterPipeline.h"
 #include "SkReadBuffer.h"
 #include "SkRefCnt.h"
 #include "SkString.h"
@@ -230,6 +231,30 @@
 //  End duplication
 //////
 
+bool SkColorMatrixFilterRowMajor255::onAppendStages(SkRasterPipeline* p,
+                                                    bool shaderIsOpaque) const {
+    bool willStayOpaque = shaderIsOpaque && (fFlags & kAlphaUnchanged_Flag);
+    bool needsClamp0 = false,
+         needsClamp1 = false;
+    for (int i = 0; i < 4; i++) {
+        SkScalar min = fTranspose[i+16],
+                 max = fTranspose[i+16];
+        (fTranspose[i+ 0] < 0 ? min : max) += fTranspose[i+ 0];
+        (fTranspose[i+ 4] < 0 ? min : max) += fTranspose[i+ 4];
+        (fTranspose[i+ 8] < 0 ? min : max) += fTranspose[i+ 8];
+        (fTranspose[i+12] < 0 ? min : max) += fTranspose[i+12];
+        needsClamp0 = needsClamp0 || min < 0;
+        needsClamp1 = needsClamp1 || max > 1;
+    }
+
+    if (!shaderIsOpaque) { p->append(SkRasterPipeline::unpremul); }
+    if (           true) { p->append(SkRasterPipeline::matrix_4x5, fTranspose); }
+    if (!willStayOpaque) { p->append(SkRasterPipeline::premul); }
+    if (    needsClamp0) { p->append(SkRasterPipeline::clamp_0); }
+    if (    needsClamp1) { p->append(SkRasterPipeline::clamp_a); }
+    return true;
+}
+
 sk_sp<SkColorFilter>
 SkColorMatrixFilterRowMajor255::makeComposed(sk_sp<SkColorFilter> innerFilter) const {
     SkScalar innerMatrix[20];