div255(x) as ((x+128)*257)>>16 with SSE _mm_mulhi_epu16 makes the (...*257)>>16 part simple. This seems to speed up every transfermode that uses div255(), in the 7-25% range. It even appears to obviate the need for approxMulDiv255() on SSE. I'm not sure about NEON yet, so I'll keep approxMulDiv255() for now. Should be no pixels change: https://gold.skia.org/search2?issue=1452903004&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1452903004

commit: cbf4fba43933302a846872e4c5ce8f1adb8b325e [log] [tgz]
author: mtklein <mtklein@chromium.org> Tue Nov 17 14:19:52 2015 -0800
committer: Commit bot <commit-bot@chromium.org> Tue Nov 17 14:19:52 2015 -0800
tree: 96dad6cc0a2241544a0cf52cccdc7a0fbe89f9b1
parent: 56847a65648af4d06da9c26c55242949a1bf31ab [diff] [blame]
diff --git a/src/opts/Sk4px_none.h b/src/opts/Sk4px_none.h
index 540edb8..efbd780 100644
--- a/src/opts/Sk4px_none.h
+++ b/src/opts/Sk4px_none.h

@@ -62,6 +62,12 @@
                  r.kth<12>(), r.kth<13>(), r.kth<14>(), r.kth<15>());
 }
 
+inline Sk4px Sk4px::Wide::div255() const {
+    // Calculated as ((x+128) + ((x+128)>>8)) >> 8.
+    auto v = *this + Sk16h(128);
+    return v.addNarrowHi(v>>8);
+}
+
 inline Sk4px Sk4px::alphas() const {
     static_assert(SK_A32_SHIFT == 24, "This method assumes little-endian.");
     return Sk16b(this->kth< 3>(), this->kth< 3>(), this->kth< 3>(), this->kth< 3>(),
commit	cbf4fba43933302a846872e4c5ce8f1adb8b325e	[log] [tgz]
author	mtklein <mtklein@chromium.org>	Tue Nov 17 14:19:52 2015 -0800
committer	Commit bot <commit-bot@chromium.org>	Tue Nov 17 14:19:52 2015 -0800
tree	96dad6cc0a2241544a0cf52cccdc7a0fbe89f9b1
parent	56847a65648af4d06da9c26c55242949a1bf31ab [diff] [blame]