Making division in Python faster

July 8, 2008

A very useful feature of Python is that it comes with built-in arbitrary-precision integers. The implementation is not the fastest in the world, but it is adequate for many purposes.

Python is clever enough to use the Karatsuba algorithm for multiplication of large integers, which gives an O(n^1.6) asymptotic time complexity for n-digit multiplication. This is a huge improvement over the O(n²) schoolbook algorithm when multiplying anything larger than a few hundred digits.

Unfortunately, division in Python is not so well equipped: Python uses a brute force O(n²) algorithm for quotients of any size. Several algorithms in mpmath perform several multiplications followed by a single large division; at high precision, the final division can take as much time as all the preceding multiplications together. Division is needed much less often than multiplication, but it is needed nonetheless.

Newton’s method to the rescue. If f(x) can be evaluated to n-digit accuracy in time O(q(n)), then O(q(n)) is also the complexity for evaluating f⁻¹(x) using Newton’s method (under reasonable assumptions). Thus Newton’s method allows one to divide as fast as multiplying, compute logarithms as fast as exponentials, etc. The overhead hidden in the O()-expression is typically a small factor, of the order of 2-5.

To divide with the same asymptotic speed as multiplying, we rewrite p/q as p · (1/q), and use Newton’s method to calculate r = 1/q. Newton’s method leads to the iteration r_n+1 = 2r_n – qr_n², which converges quadratically (i.e. each iteration doubles the accuracy). It is now important to exploit the self-correcting nature of Newton’s method by performing each step with an arithmetic precision equal to the accuracy. This way only a single step has to be performed at full precision. If this optimization is not used, the time complexity is just O((log n) q(n)), not O(q(n)).

Here is my implementation of Newton division in Python:


from mpmath.lib import giant_steps, lshift, rshift
from math import log

START_PREC = 15

def size(x):
    if isinstance(x, (int, long)):
        return int(log(x,2))
    # GMPY support
    return x.numdigits(2)

def newdiv(p, q):
    szp = size(p)
    szq = size(q)
    szr = szp - szq
    if min(szp, szq, szr) < 2*START_PREC:
        return p//q
    r = (1 << (2*START_PREC)) // (q >> (szq - START_PREC))
    last_prec = START_PREC
    for prec in giant_steps(START_PREC, szr):
        a = lshift(r, prec-last_prec+1)
        b = rshift(r**2 * rshift(q, szq-prec), 2*last_prec)
        r = a - b
        last_prec = prec
    return ((p >> szq) * r) >> szr

The core algorithm is just a few lines, although those lines took me a few moments to write (getting the shifts right gives me a headache). I imported the functions lshift, rshift (which are like the << and >> operators but allow negative shifts) and giant_steps (which counts up from START_PREC to … szr/4, szr/2, szr) from mpmath just for convenience; the implementation above requires essentially nothing but Python builtins.

How does it fare? A little benchmarking code:


from time import clock

def timing(INT):
    fmt = "%10s %12f %12f %12f %8s"
    print "%10s %12s %12s %12s %8s" % ("size", "old time", "new time",
        "faster", "error")
    print "-"*78
    for i in range(4,30):
        n = 2**i
        Q = INT(10)**n
        P = Q**2
        t1 = clock()
        R1 = P // Q
        t2 = clock()
        R2 = newdiv(P,Q)
        t3 = clock()
        size, old_time, new_time = n, t2-t1, t3-t2
        faster, error = old_time/new_time, R2-R1
        print fmt % (size, old_time, new_time, faster, error)

I choose to benchmark the performance of dividing a 2n-digit integer by an n-digit integer, for a result (and required precision) of n digits. The improvement is likely to be smaller if the denominator and quotient have unequal size, but the balanced case is the most important to optimize.

Now the performance compared to Python’s builtin division operator:


timing(int)

      size     old time     new time       faster    error
--------------------------------------------------------------
        16     0.000005     0.000101     0.050000       -1
        32     0.000005     0.000044     0.107595       -1
        64     0.000006     0.000052     0.123656        0
       128     0.000013     0.000064     0.197368        0
       256     0.000033     0.000088     0.377778       -1
       512     0.000115     0.000173     0.663430        0
      1024     0.000413     0.000399     1.035689        1
      2048     0.001629     0.001099     1.481576        0
      4096     0.006453     0.004292     1.503352       -1
      8192     0.025596     0.009966     2.568285        0
     16384     0.101964     0.030327     3.362182       -1
     32768     0.408266     0.091811     4.446792       -1
     65536     1.633296     0.278531     5.863967       -1
    131072     6.535185     0.834847     7.828001        0
    262144    26.108710     2.517134    10.372397        0
    524288   104.445635     7.576984    13.784593       -1
   1048576   418.701976    22.760790    18.395758        0

The results are excellent. The Newton division breaks even with the builtin division already at 1000 digits (despite the interpreter overhead), and is 10x faster at 260,000 digits.

The implementation in newdiv is not exact; as you can see, the results differ from the (presumably correct!) values computed by Python by a few units in the last place. This is not a big concern, as I intend to use this mainly for numerical algorithms, and it is always possible to simply add a few guard digits. For number theory and other applications that require exact division, I think it is sufficient to multiply a few of the least significant bits in q and r to see if they agree with p, and correct as necessary.

The point of this exercise was to try to give division of Python long ints the advantage of the builtin Karatsuba multiplication. Just for fun, I also tried it with gmpy integers:


from gmpy import mpz
timing(mpz)

      size     old time     new time       faster    error
--------------------------------------------------------------
        16     0.000011     0.000068     0.168033       -3
        32     0.000008     0.000045     0.185185       -2
        64     0.000006     0.000047     0.137725       -1
       128     0.000005     0.000053     0.089005       -2
       256     0.000007     0.000070     0.099602        0
       512     0.000015     0.000083     0.184564       -1
      1024     0.000044     0.000156     0.282143        0
      2048     0.000134     0.000279     0.481000       -1
      4096     0.000404     0.000657     0.614960       -2
      8192     0.001184     0.001719     0.688882       -1
     16384     0.003481     0.004585     0.759322       -2
     32768     0.009980     0.012043     0.828671       -1
     65536     0.027762     0.028783     0.964516       -1
    131072     0.072987     0.067773     1.076926       -1
    262144     0.186714     0.151604     1.231588       -1
    524288     0.462057     0.342160     1.350411       -1
   1048576     1.119103     0.788582     1.419134       -2
   2097152     2.726458     1.838570     1.482923       -1
   4194304     6.607204     4.069614     1.623546       -1
   8388608    15.627027     8.980883     1.740032       -1
  16777216    36.581787    19.167144     1.908567       -4
  33554432    83.568330    40.131393     2.082368       -1
  67108864   190.596889    81.412867     2.341115       -1

Interestingly, the Newton division breaks even at 130,000 digits and is twice as fast at 30 million digits. So it is clear that either gmpy does not use GMP optimally, or GMP does not use the asymptotically fastest possible division algorithm. I think the latter is the case; this page contains development code for the next version of GMP, including improved division code, and notes that “the new code has complexity O(M(n)) for all operations, while the old mpz code is O(M(n)log(n))”. This seems consistent with my timings.

A small notice: I will be away for about a week starting tomorrow. I will bring my laptop to continue working on my GSoC project during the time, but I won’t have internet access.

Edit: Fixed a misplaced parenthesis in newdiv that gave the wrong criterion for falling back to p//q. Benchmarks unaffected.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor