SF patch 936813: fast modular exponentiation

This checkin is adapted from part 1 (of 3) of Trevor Perrin's patch set.

x_mul()
  - sped a little by optimizing the C
  - sped a lot (~2X) if it's doing a square; note that long_pow() squares
    often
k_mul()
  - more cache-friendly now if it's doing a square
KARATSUBA_CUTOFF
  - boosted; gradeschool mult is quicker now, and it may have been too low
    for many platforms anyway
KARATSUBA_SQUARE_CUTOFF
  - new
  - since x_mul is a lot faster at squaring now, the point at which
    Karatsuba pays for squaring is much higher than for general mult
diff --git a/Misc/ACKS b/Misc/ACKS
index 6eb0f64..dfdf005 100644
--- a/Misc/ACKS
+++ b/Misc/ACKS
@@ -442,6 +442,7 @@
 Eduardo Pérez
 Fernando Pérez
 Mark Perrego
+Trevor Perrin
 Tim Peters
 Chris Petrilli
 Bjorn Pettersen
diff --git a/Misc/NEWS b/Misc/NEWS
index 4656fa2..431b343 100644
--- a/Misc/NEWS
+++ b/Misc/NEWS
@@ -12,6 +12,16 @@
 Core and builtins
 -----------------
 
+- Some speedups for long arithmetic, thanks to Trevor Perrin.  Gradeschool
+  multiplication was sped a little by optimizing the C code.  Gradeschool
+  squaring was sped by about a factor of 2, by exploiting that about half
+  the digit products are duplicates in a square.  Because exponentiation
+  uses squaring often, this also speeds long power.  For example, the time
+  to compute 17**1000000 dropped from about 14 seconds to 9 on my box due
+  to this much.  The cutoff for Karatsuba multiplication was raised,
+  since gradeschool multiplication got quicker, and the cutoff was
+  aggressively small regardless.
+
 - OverflowWarning is no longer generated.  PEP 237 scheduled this to
   occur in Python 2.3, but since OverflowWarning was disabled by default,
   nobody realized it was still being generated.  On the chance that user