Sk4px blit mask.

Local SKP nanobenching ranges SSE between 1.05x and 0.87x, much more heavily weighted toward <1.0x ratios (speedups).
I profiled the top five regressions (1.05x-1.01x) and they look like noise.  Will follow up after broad bot results.

NEON looks similar but less extreme than SSE changes, ranging between 1.02x and 0.95x, again mostly speedups in 0.99x-0.97x range.

The old code trifurcated into black, opaque-but-not-black, and general versions as a function of the constant src color.  I did not see a significant difference between general and opaque-but-not-black, and I don't think a black version would be faster using SIMD.  So we have here just one version of the code, the general version.

Somewhat fantastically, I see no pixel diffs on GMs or SKPs.

I will be following up with more CLs for the other procs called by SkBlitMask.
BUG=skia:

Review URL: https://codereview.chromium.org/1278253003
diff --git a/src/opts/SkBlitMask_opts.h b/src/opts/SkBlitMask_opts.h
new file mode 100644
index 0000000..9129560
--- /dev/null
+++ b/src/opts/SkBlitMask_opts.h
@@ -0,0 +1,37 @@
+/*
+ * Copyright 2015 Google Inc.
+ *
+ * Use of this source code is governed by a BSD-style license that can be
+ * found in the LICENSE file.
+ */
+
+#ifndef SkBlitMask_opts_DEFINED
+#define SkBlitMask_opts_DEFINED
+
+#include "Sk4px.h"
+
+namespace SK_OPTS_NS {
+
+static void blit_mask_d32_a8(SkPMColor* dst, size_t dstRB,
+                             const SkAlpha* mask, size_t maskRB,
+                             SkColor color, int w, int h) {
+    auto s = Sk4px::DupPMColor(SkPreMultiplyColor(color));
+
+    auto fn = [&](const Sk4px& d, const Sk4px& aa) {
+        //  = (s + d(1-sa))aa + d(1-aa)
+        //  = s*aa + d(1-sa*aa)
+        auto left  = s.approxMulDiv255(aa),
+             right = d.approxMulDiv255(left.alphas().inv());
+        return left + right;  // This does not overflow (exhaustively checked).
+    };
+
+    while (h --> 0) {
+        Sk4px::MapDstAlpha(w, dst, mask, fn);
+        dst  +=  dstRB / sizeof(*dst);
+        mask += maskRB / sizeof(*mask);
+    }
+}
+
+}  // SK_OPTS_NS
+
+#endif//SkBlitMask_opts_DEFINED