Real fix for stage perf regression.

When we made start_pipeline() return void, the call into the tail!=0 run
of the pipeline became eligble to be a tail-call, and Clang made that
choice.  This had the side effect of not going through vzeroupper on
those tails.

We now mark start_pipeline() as inelligible for tail calls when
targeting AVX+.  All paths go through the vzeroupper at the end.

BUG=chromium:729237

Change-Id: I2099931284214f24c67b38979b3ad4b4d10e8bba
Reviewed-on: https://skia-review.googlesource.com/18591
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
5 files changed