Move looping logic into start_pipeline().

This should be a big win on Windows, but I haven't timed there yet.
On my Mac, it's a solid 2% speedup.

PS1 was insufficiently ambitious, but was this for posterity:
    No need to vzeroupper twice on Windows.

    On Windows start_pipeline() will vzeroupper,
    so no need to do it in just_return().

Change-Id: I099320b95da85900a60ce96fdb7a216a36db1858
Reviewed-on: https://skia-review.googlesource.com/8821
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
4 files changed