New analytic AA scan converter using delta (I call it DAA for now)

DAA is:

1. Much simpler than AAA.
   SkScan_AAAPath.cpp is about 1700 lines.
   SkScan_DAAPath.cpp is about 300 lines.
   The whole DAA CL is only about 800 lines.

2. Much faster than AAA for complicated paths.
   The speedup applies to GL backend (including ccpr)!
   Here's the frame time of 'SampleApp --slide Chart' on macbook pro:
     AAA-raster: 33ms
     DAA-raster: 21ms
     AAA-gl:     30ms
     DAA-gl:     20ms
     AAA-ccpr:   18ms
     DAA-ccpr:   12ms
   My linux desktop doesn't have SSE3 so the speedup is smaller
   (~25% for Chart). I believe that DAA is so fast that I can enable
   it for any paths (AAA is not enabled by default for complicated
   paths because it is slow; hence our older supersampling scan
   converter is used for stroking on Chart for AAA-xxx config.)

3. The SkCoverageDelta is suitable for threaded backend with
   out-of-order concurrent scan conversion as commented in the source
   code. Maybe we can also just send deltas to GPU.

4. Similar to most analytic path renderers, the quality is on the best
   ground-truth level, unless there are intersections within a pixel.
   The intersections look good to my eyes although theoretically that
   could be arbitrary far from the ground truth (see my AAA slides).

5. For simple paths, such as circle, triangle, rrect, etc., DAA is
   slower than AAA. But DAA is faster than our older supersampling
   scan converter in most cases. As those simple paths usually don't
   constitute the bottleneck of a picture (skp or svg), I strongly
   recommend use DAA.

6. DAA also heavily favors blitMask so it may work quite well with
   SkRasterPipeline and SkRasterPipelineBlitter.

Finally, please check https://skia-review.googlesource.com/c/22420/
which accelerate DAA by specializing blitCoverageDeltas for
SkARGB32_Blitter and SkARGB32_Black_Blitter. It brings a little(<5%)
speedup. But I couldn't figure out how to reduce the duplicate code
so I don't intend to land it.

Bug: skia:
Change-Id: I3b7ed6a727447922e645b1acb737a506e7c09a4c
Reviewed-on: https://skia-review.googlesource.com/19666
Reviewed-by: Mike Reed <reed@google.com>
Reviewed-by: Cary Clark <caryclark@google.com>
Commit-Queue: Yuqian Li <liyuqian@google.com>
diff --git a/src/core/SkScan_DAAPath.cpp b/src/core/SkScan_DAAPath.cpp
new file mode 100644
index 0000000..295d621
--- /dev/null
+++ b/src/core/SkScan_DAAPath.cpp
@@ -0,0 +1,352 @@
+/*
+ * Copyright 2016 The Android Open Source Project
+ *
+ * Use of this source code is governed by a BSD-style license that can be
+ * found in the LICENSE file.
+ */
+
+#include "SkAnalyticEdge.h"
+#include "SkAntiRun.h"
+#include "SkAutoMalloc.h"
+#include "SkBlitter.h"
+#include "SkCoverageDelta.h"
+#include "SkEdge.h"
+#include "SkEdgeBuilder.h"
+#include "SkGeometry.h"
+#include "SkMask.h"
+#include "SkPath.h"
+#include "SkQuadClipper.h"
+#include "SkRasterClip.h"
+#include "SkRegion.h"
+#include "SkScan.h"
+#include "SkScanPriv.h"
+#include "SkTSort.h"
+#include "SkTemplates.h"
+#include "SkUtils.h"
+
+///////////////////////////////////////////////////////////////////////////////
+
+/*
+
+DAA stands for Delta-based Anti-Aliasing.
+
+This is an improved version of our analytic AA algorithm (in SkScan_AAAPath.cpp)
+which first scan convert paths into coverage deltas (this step can happen out of order,
+and we don't seem to be needed to worry about the intersection, clamping, etc.),
+and then use a single blitter run to convert all those deltas into the final alphas.
+
+Before we go to the final blitter run, we'll use SkFixed for all delta values so we
+don't ever have to worry about under/overflow.
+
+*/
+
+///////////////////////////////////////////////////////////////////////////////
+
+// The following helper functions are the same as those from SkScan_AAAPath.cpp
+// except that we use SkFixed everywhere instead of SkAlpha.
+
+static inline SkFixed SkFixedMul_lowprec(SkFixed a, SkFixed b) {
+    return (a >> 8) * (b >> 8);
+}
+
+// Return the alpha of a trapezoid whose height is 1
+static inline SkFixed trapezoidToAlpha(SkFixed l1, SkFixed l2) {
+    SkASSERT(l1 >= 0 && l2 >= 0);
+    return (l1 + l2) >> 1;
+}
+
+// The alpha of right-triangle (a, a*b)
+static inline SkFixed partialTriangleToAlpha(SkFixed a, SkFixed b) {
+    SkASSERT(a <= SK_Fixed1);
+    // SkFixedMul(SkFixedMul(a, a), b) >> 1
+    // return ((((a >> 8) * (a >> 8)) >> 8) * (b >> 8)) >> 1;
+    return (a >> 11) * (a >> 11) * (b >> 11);
+}
+
+static inline SkFixed getPartialAlpha(SkFixed alpha, SkFixed partialMultiplier) {
+    // DAA should not be so sensitive to the precision (compared to AAA).
+    return SkFixedMul_lowprec(alpha, partialMultiplier);
+}
+
+///////////////////////////////////////////////////////////////////////////////
+
+template<bool isPartial, class Deltas>
+static inline void add_coverage_delta_segment(int y, SkFixed rowHeight, const SkAnalyticEdge* edge,
+        SkFixed nextX, Deltas* deltas) { // rowHeight=fullAlpha
+    SkASSERT(rowHeight <= SK_Fixed1 && rowHeight >= 0);
+
+    // Let's see if multiplying sign is faster than multiplying edge->fWinding.
+    // (Compiler should be able to optimize multiplication with 1/-1?)
+    int sign = edge->fWinding == 1 ? 1 : -1;
+
+    SkFixed l = SkTMin(edge->fX, nextX);
+    SkFixed r = edge->fX + nextX - l;
+    int     L = SkFixedFloorToInt(l);
+    int     R = SkFixedCeilToInt(r);
+    int     len = R - L;
+
+    switch (len) {
+        case 0: {
+            deltas->addDelta(L, y, rowHeight * sign);
+            return;
+        }
+        case 1: {
+            SkFixed fixedR  = SkIntToFixed(R);
+            SkFixed alpha   = trapezoidToAlpha(fixedR - l, fixedR - r);
+            if (isPartial) {
+                alpha = getPartialAlpha(alpha, rowHeight);
+            }
+            deltas->addDelta(L,     y,  alpha * sign);
+            deltas->addDelta(L + 1, y,  (rowHeight - alpha) * sign);
+            return;
+        }
+        case 2: {
+            SkFixed middle  = SkIntToFixed(L + 1);
+            SkFixed x1      = middle - l;
+            SkFixed x2      = r - middle;
+            SkFixed alpha1  = partialTriangleToAlpha(x1, edge->fDY);
+            SkFixed alpha2  = rowHeight - partialTriangleToAlpha(x2, edge->fDY);
+            deltas->addDelta(L,     y,  alpha1 * sign);
+            deltas->addDelta(L + 1, y,  (alpha2 - alpha1) * sign);
+            deltas->addDelta(L + 2, y,  (rowHeight - alpha2) * sign);
+            return;
+        }
+    }
+
+    // When len > 2, computations are similar to computeAlphaAboveLine in SkScan_AAAPath.cpp
+    SkFixed dY      = edge->fDY;
+    SkFixed fixedL  = SkIntToFixed(L);
+    SkFixed fixedR  = SkIntToFixed(R);
+    SkFixed first   = SK_Fixed1 + fixedL - l; // horizontal edge of the left-most triangle
+    SkFixed last    = r - (fixedR - SK_Fixed1); // horizontal edge of the right-most triangle
+    SkFixed firstH  = SkFixedMul_lowprec(first, dY); // vertical edge of the left-most triangle
+
+    SkFixed alpha0  = SkFixedMul_lowprec(first, firstH) >> 1;   // triangle alpha
+    SkFixed alpha1  = firstH + (dY >> 1);                       // rectangle plus triangle
+    deltas->addDelta(L, y, alpha0 * sign);
+    deltas->addDelta(L + 1, y, (alpha1 - alpha0) * sign);
+    for(int i = 2; i < len - 1; ++i) {
+        deltas->addDelta(L + i, y, dY * sign); // the increment is always a rect of height = dY
+    }
+
+    SkFixed alphaR2     = alpha1 + dY * (len - 3);                      // the alpha at R - 2
+    SkFixed lastAlpha   = rowHeight - partialTriangleToAlpha(last, dY); // the alpha at R - 1
+    deltas->addDelta(R - 1, y, (lastAlpha - alphaR2) * sign);
+    deltas->addDelta(R,     y, (rowHeight - lastAlpha) * sign);
+}
+
+class XLessThan {
+public:
+    bool operator()(const SkBezier* a, const SkBezier* b) {
+        return a->fP0.fX + a->fP1.fX < b->fP0.fX + b->fP1.fX;
+    }
+};
+
+class YLessThan {
+public:
+    bool operator()(const SkBezier* a, const SkBezier* b) {
+        return a->fP0.fY + a->fP1.fY < b->fP0.fY + b->fP1.fY;
+    }
+};
+
+template<class Deltas> static SK_ALWAYS_INLINE
+void gen_alpha_deltas(const SkPath& path, const SkRegion& clipRgn, Deltas& result,
+        SkBlitter* blitter, bool skipRect, bool pathContainedInClip) {
+    // 1. Build edges
+    SkEdgeBuilder builder;
+    SkIRect ir               = path.getBounds().roundOut();
+    const SkIRect& clipRect  = clipRgn.getBounds();
+    int  count               = builder.build_edges(path, &clipRect, 0, pathContainedInClip,
+                                                   SkEdgeBuilder::kBezier);
+    if (count == 0) {
+        return;
+    }
+    SkBezier** list = builder.bezierList();
+
+    // 2. Try to find the rect part because blitAntiRect is so much faster than blitCoverageDeltas
+    int rectTop = ir.fBottom;   // the rect is initialized to be empty as top = bot
+    int rectBot = ir.fBottom;
+    if (skipRect) {             // only find that rect is skipRect == true
+        YLessThan lessThan;     // sort edges in YX order
+        SkTQSort(list, list + count - 1, lessThan);
+        for(int i = 0; i < count - 1; ++i) {
+            SkBezier* lb = list[i];
+            SkBezier* rb = list[i + 1];
+
+            bool lDX0 = lb->fP0.fX == lb->fP1.fX;
+            bool rDX0 = rb->fP0.fX == rb->fP1.fX;
+            if (!lDX0 || !rDX0) { // make sure that the edges are vertical
+                continue;
+            }
+
+            SkAnalyticEdge l, r;
+            l.setLine(lb->fP0, lb->fP1);
+            r.setLine(rb->fP0, rb->fP1);
+
+            SkFixed xorUpperY = l.fUpperY ^ r.fUpperY;
+            SkFixed xorLowerY = l.fLowerY ^ r.fLowerY;
+            if ((xorUpperY | xorLowerY) == 0) { // equal upperY and lowerY
+                rectTop = SkFixedCeilToInt(l.fUpperY);
+                rectBot = SkFixedFloorToInt(l.fLowerY);
+                if (rectBot > rectTop) { // if bot == top, the rect is too short for blitAntiRect
+                    int L = SkFixedCeilToInt(l.fUpperX);
+                    int R = SkFixedFloorToInt(r.fUpperX);
+                    if (L <= R) {
+                        SkAlpha la = (SkIntToFixed(L) - l.fUpperX) >> 8;
+                        SkAlpha ra = (r.fUpperX - SkIntToFixed(R)) >> 8;
+                        result.setAntiRect(L - 1, rectTop, R - L, rectBot - rectTop, la, ra);
+                    } else { // too thin to use blitAntiRect; reset the rect region to be emtpy
+                        rectTop = rectBot = ir.fBottom;
+                    }
+                }
+                break;
+            }
+
+        }
+    }
+
+    // 3. Sort edges in x so we may need less sorting for delta based on x. This only helps
+    //    SkCoverageDeltaList. And we don't want to sort more than SORT_THRESHOLD edges where
+    //    the log(count) factor of the quick sort may become a bottleneck; when there are so
+    //    many edges, we're unlikely to make deltas sorted anyway.
+    constexpr int SORT_THRESHOLD = 4096;
+    if (std::is_same<Deltas, SkCoverageDeltaList>::value && count < SORT_THRESHOLD) {
+        XLessThan lessThan;
+        SkTQSort(list, list + count - 1, lessThan);
+    }
+
+    // Future todo: parallize and SIMD the following code.
+    // 4. iterate through edges and generate deltas
+    for(int index = 0; index < count; ++index) {
+        SkAnalyticCubicEdge storage;
+        SkASSERT(sizeof(SkAnalyticQuadraticEdge) >= sizeof(SkAnalyticEdge));
+        SkASSERT(sizeof(SkAnalyticCubicEdge) >= sizeof(SkAnalyticQuadraticEdge));
+
+        SkBezier* bezier        = list[index];
+        SkAnalyticEdge* currE   = &storage;
+        bool edgeSet            = false;
+
+        switch (bezier->fCount) {
+            case 2: {
+                edgeSet = currE->setLine(bezier->fP0, bezier->fP1);
+                break;
+            }
+            case 3: {
+                SkQuad* quad = static_cast<SkQuad*>(bezier);
+                SkPoint pts[3] = {quad->fP0, quad->fP1, quad->fP2};
+                edgeSet = static_cast<SkAnalyticQuadraticEdge*>(currE)->setQuadratic(pts);
+                break;
+            }
+            case 4: {
+                SkCubic* cubic = static_cast<SkCubic*>(bezier);
+                SkPoint pts[4] = {cubic->fP0, cubic->fP1, cubic->fP2, cubic->fP3};
+                edgeSet = static_cast<SkAnalyticCubicEdge*>(currE)->setCubic(pts);
+                break;
+            }
+        }
+
+        if (!edgeSet) {
+            continue;
+        }
+
+        do {
+            currE->fX =  currE->fUpperX;
+
+            SkFixed upperFloor  = SkFixedFloorToFixed(currE->fUpperY);
+            SkFixed lowerCeil   = SkFixedCeilToFixed(currE->fLowerY);
+            int     iy          = SkFixedFloorToInt(upperFloor);
+
+            if (lowerCeil <= upperFloor + SK_Fixed1) { // only one row is affected by the currE
+                SkFixed rowHeight = currE->fLowerY - currE->fUpperY;
+                SkFixed nextX = currE->fX + SkFixedMul(currE->fDX, rowHeight);
+                add_coverage_delta_segment<true>(iy, rowHeight, currE, nextX, &result);
+                continue;
+            }
+
+            // check first row
+            SkFixed rowHeight = upperFloor + SK_Fixed1 - currE->fUpperY;
+            SkFixed nextX;
+            if (rowHeight != SK_Fixed1) {   // it's a partial row
+                nextX = currE->fX + SkFixedMul(currE->fDX, rowHeight);
+                add_coverage_delta_segment<true>(iy, rowHeight, currE, nextX, &result);
+            } else {                        // it's a full row so we can leave it to the while loop
+                iy--;                       // compensate the iy++ in the while loop
+                nextX = currE->fX;
+            }
+
+            while (true) { // process the full rows in the middle
+                iy++;
+                SkFixed y = SkIntToFixed(iy);
+                currE->fX = nextX;
+                nextX += currE->fDX;
+
+                if (y + SK_Fixed1 > currE->fLowerY) {
+                    break; // no full rows left, break
+                }
+
+                // Check whether we're in the rect part that will be covered by blitAntiRect
+                if (iy >= rectTop && iy < rectBot) {
+                    SkASSERT(currE->fDX == 0);  // If yes, we must be on an edge with fDX = 0.
+                    iy = rectBot - 1;           // Skip the rect part by advancing iy to the bottom.
+                    continue;
+                }
+
+                // Add current edge's coverage deltas on this full row
+                add_coverage_delta_segment<false>(iy, SK_Fixed1, currE, nextX, &result);
+            }
+
+            // last partial row
+            if (SkIntToFixed(iy) < currE->fLowerY) {
+                rowHeight = currE->fLowerY - SkIntToFixed(iy);
+                nextX = currE->fX + SkFixedMul(currE->fDX, rowHeight);
+                add_coverage_delta_segment<true>(iy, rowHeight, currE, nextX, &result);
+            }
+        } while (currE->update(currE->fLowerY));
+    }
+}
+
+void SkScan::DAAFillPath(const SkPath& path, const SkRegion& origClip, SkBlitter* blitter,
+                         bool forceRLE) {
+
+    FillPathFunc fillPathFunc = [](const SkPath& path, SkBlitter* blitter, bool isInverse,
+            const SkIRect& ir, const SkRegion* clipRgn, const SkIRect* clipRect, bool forceRLE){
+        bool isEvenOdd  = path.getFillType() & 1;
+        bool isConvex   = path.isConvex();
+        bool skipRect   = isConvex && !isInverse;
+
+        const SkIRect& clipBounds = clipRgn->getBounds();
+        SkIRect clippedIR = ir;
+        clippedIR.intersect(clipBounds);
+
+        SkRect rect;
+        if (!isInverse && path.isRect(&rect) && clippedIR.height() >= 3 && clippedIR.width() >= 3) {
+            // The overhead of even constructing SkCoverageDeltaList/Mask is too big. So just blit.
+            bool nonEmpty = rect.intersect(SkRect::MakeFromIRect(clipBounds));
+            SkASSERT(nonEmpty); // do_fill_path should have already handled the empty case
+            if (nonEmpty) {
+                blitter->blitFatAntiRect(rect);
+            }
+            return;
+        }
+
+        // Only blitter->blitXXX need to be done in order in the threaded backend.
+        // Everything before can be done out of order in the threaded backend.
+        if (!forceRLE && !isInverse && SkCoverageDeltaMask::Suitable(clippedIR)) {
+            SkCoverageDeltaMask deltaMask(clippedIR);
+            gen_alpha_deltas(path, *clipRgn, deltaMask, blitter, skipRect, clipRect == nullptr);
+            deltaMask.convertCoverageToAlpha(isEvenOdd, isInverse, isConvex);
+            blitter->blitMask(deltaMask.prepareSkMask(), clippedIR);
+        } else {
+            SkCoverageDeltaAllocator alloc;
+            SkCoverageDeltaList deltaList(&alloc, clippedIR.fTop, clippedIR.fBottom, forceRLE);
+            gen_alpha_deltas(path, *clipRgn, deltaList, blitter, skipRect, clipRect == nullptr);
+            blitter->blitCoverageDeltas(&deltaList, clipBounds, isEvenOdd, isInverse, isConvex);
+        }
+    };
+
+    // For threaded backend with out-of-order init-once (and therefore out-of-order do_fill_path),
+    // we probably have to take care of the blitRegion, sk_blit_above, sk_blit_below in do_fill_path
+    // to maintain the draw order. If we do that, be caureful that blitRect may throw exception is
+    // the rect is empty.
+    do_fill_path(path, origClip, blitter, forceRLE, 2, std::move(fillPathFunc));
+}