Major speedup for new-style class creation.  Turns out there was some
trampolining going on with the tp_new descriptor, where the inherited
PyType_GenericNew was overwritten with the much slower slot_tp_new
which would end up calling tp_new_wrapper which would eventually call
PyType_GenericNew.  Add a special case for this to update_one_slot().

XXX Hope there isn't a loophole in this.  I'll buy the first person to
point out a bug in the reasoning a beer.

Backport candidate (but I won't do it).
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index f46734b..020cbf2 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -4081,6 +4081,28 @@
 					use_generic = 1;
 			}
 		}
+		else if (descr->ob_type == &PyCFunction_Type &&
+			 PyCFunction_GET_FUNCTION(descr) ==
+			 (PyCFunction)tp_new_wrapper &&
+			 strcmp(p->name, "__new__") == 0)
+		{
+			/* The __new__ wrapper is not a wrapper descriptor,
+			   so must be special-cased differently.
+			   If we don't do this, creating an instance will
+			   always use slot_tp_new which will look up
+			   __new__ in the MRO which will call tp_new_wrapper
+			   which will look through the base classes looking
+			   for a static base and call its tp_new (usually
+			   PyType_GenericNew), after performing various
+			   sanity checks and constructing a new argument
+			   list.  Cut all that nonsense short -- this speeds
+			   up instance creation tremendously. */
+			specific = type->tp_new;
+			/* XXX I'm not 100% sure that there isn't a hole
+			   in this reasoning that requires additional
+			   sanity checks.  I'll buy the first person to
+			   point out a bug in this reasoning a beer. */
+		}
 		else {
 			use_generic = 1;
 			generic = p->function;