Use PyThreadState_GET() in performance critical code

It seems like _PyThreadState_UncheckedGet() is not inlined as expected, even
when using gcc -O3.
3 files changed