<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fredrik Johansson&#039;s blog</title>
	<atom:link href="http://fredrikj.net/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://fredrikj.net/blog</link>
	<description></description>
	<lastBuildDate>Wed, 16 May 2012 02:01:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Experiments with low-level ball arithmetic</title>
		<link>http://fredrikj.net/blog/2012/05/experiments-with-low-level-ball-arithmetic/</link>
		<comments>http://fredrikj.net/blog/2012/05/experiments-with-low-level-ball-arithmetic/#comments</comments>
		<pubDate>Wed, 16 May 2012 01:43:09 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=648</guid>
		<description><![CDATA[In the implementation of ball arithmetic I wrote about previously, arithmetic is done naively using FLINT (fmpz) integers. This is easy to implement but adds quite a lot of overhead at low precision (up to a precision of several hundred &#8230; <a href="http://fredrikj.net/blog/2012/05/experiments-with-low-level-ball-arithmetic/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In the implementation of <a href="http://fredrikj.net/blog/2012/04/high-precision-ball-arithmetic/">ball arithmetic</a> I wrote about previously, arithmetic is done naively using FLINT (fmpz) integers. This is easy to implement but adds quite a lot of overhead at low precision (up to a precision of several hundred digits).</p>
<p>I&#8217;ve started writing some experimental code for a low level implementation (based on mpn functions). The goal is that a single ball addition or multiplication should be as cheap as or cheaper than a single MPFR operation, ideally close to a mpz fixed-point arithmetic with no error bounding at all.</p>
<p>There are many different ways to do the low-level implementation, giving various tradeoffs between simplicity, speed, and accuracy. So far I&#8217;ve tried two approaches.</p>
<p><a href="https://github.com/fredrik-johansson/arb/blob/master/experimental/llarb1.c">The first</a> is to do straight up floating-point arithmetic with a full-limb radix (i.e. the digits are base 2<sup>32</sup> or 2<sup>64</sup>, and the exponent measures the number of digits), like the GMP mpf type, but with a single limb (aligned with the bottom of the mantissa) for the radius. When the radius overflows one limb, the precision of the midpoint is reduced one limb. This has the advantage of requiring few adjustments, and adding numbers can be done with a single mpn_add without temporary space or extra shifts. One disadvantage is that since the top limb can be nearly empty, about half an extra limb is needed on average to obtain the same effective precision as with base 2 arithmetic.</p>
<p><a href="https://github.com/fredrik-johansson/arb/blob/master/experimental/llarb2.c">The second</a> is to do base-2 floating-point arithmetic, i.e. with a base-2 exponent, but with the mantissa still having a precision measured in a whole number of limbs, and with a single-limb radius aligned with the last limb as in the first representation. This requires some more adjustment code to make sure that the top bit of the top limb always is set. Advantages of this representation include the fact that the full representable precision is used, and that the error propagated in multiplication can be bounded very easily if one accepts always losing 1-2 bits of precision (this is too much in general, but can be useful for dot products).</p>
<p>Both representations have some minor disadvantages: whenever the radius overflows a limb, the error is amplified by a large factor (however, this is a rare event on average), the results are machine-dependent (this is not a problem if one uses the ball arithmetic as a lower-level routine for a computation intended to produce an exact or precisely defined result in the end, using adaptive precision or Ziv&#8217;s strategy for correct rounding), and the lack of precise semantics makes it somewhat difficult to write strong test code.</p>
<p>Other possible approaches include using base-2 arithmetic with a precision measured in bits, perhaps with the radius taking up less than a whole limb (making radius calculations even cheaper). This is likely to require more code, with more (potentially expensive) adjustment steps, but might make the arithmetic more well-behaved with fewer strange worst-case situations.</p>
<p>In any case, I&#8217;ve compared how long time it takes to multiply two n-limb numbers using mpn, mpz, FLINT fmpz, and MPFR functions, as well as the working fmpz-based ball arithmetic code (arb), and the two experimental implementations mentioned above (arb1 and arb2). To be precise, the mpz and fmpz timings indicate the combined time to do a multiplication and a shift, necessary to do a fixed-point multiplication.</p>
<table>
<tr>
<th>limbs</th>
<th>mpn</th>
<th>mpz</th>
<th>fmpz</th>
<th>mpfr</th>
<th>arb</th>
<th>arb1</th>
<th>arb2</th>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>30</td>
<td>57</td>
<td>48</td>
<td>139</td>
<td>26</td>
<td>32</td>
</tr>
<tr>
<td>2</td>
<td>18</td>
<td>44</td>
<td>51</td>
<td>58</td>
<td>230</td>
<td>57</td>
<td>51</td>
</tr>
<tr>
<td>3</td>
<td>30</td>
<td>54</td>
<td>68</td>
<td>89</td>
<td>259</td>
<td>70</td>
<td>64</td>
</tr>
<tr>
<td>4</td>
<td>44</td>
<td>68</td>
<td>83</td>
<td>100</td>
<td>279</td>
<td>88</td>
<td>82</td>
</tr>
<tr>
<td>5</td>
<td>62</td>
<td>88</td>
<td>100</td>
<td>129</td>
<td>300</td>
<td>110</td>
<td>100</td>
</tr>
<tr>
<td>6</td>
<td>83</td>
<td>110</td>
<td>129</td>
<td>150</td>
<td>340</td>
<td>130</td>
<td>130</td>
</tr>
<tr>
<td>7</td>
<td>110</td>
<td>129</td>
<td>150</td>
<td>180</td>
<td>380</td>
<td>160</td>
<td>150</td>
</tr>
<tr>
<td>8</td>
<td>139</td>
<td>170</td>
<td>180</td>
<td>200</td>
<td>400</td>
<td>180</td>
<td>190</td>
</tr>
<tr>
<td>9</td>
<td>170</td>
<td>210</td>
<td>220</td>
<td>240</td>
<td>440</td>
<td>230</td>
<td>220</td>
</tr>
<tr>
<td>10</td>
<td>210</td>
<td>259</td>
<td>270</td>
<td>300</td>
<td>510</td>
<td>280</td>
<td>270</td>
</tr>
<tr>
<td>11</td>
<td>259</td>
<td>310</td>
<td>320</td>
<td>490</td>
<td>550</td>
<td>320</td>
<td>310</td>
</tr>
<tr>
<td>12</td>
<td>300</td>
<td>340</td>
<td>360</td>
<td>530</td>
<td>610</td>
<td>360</td>
<td>360</td>
</tr>
<tr>
<td>13</td>
<td>350</td>
<td>400</td>
<td>420</td>
<td>519</td>
<td>660</td>
<td>410</td>
<td>410</td>
</tr>
<tr>
<td>14</td>
<td>410</td>
<td>450</td>
<td>470</td>
<td>640</td>
<td>700</td>
<td>480</td>
<td>470</td>
</tr>
<tr>
<td>15</td>
<td>470</td>
<td>519</td>
<td>530</td>
<td>640</td>
<td>769</td>
<td>540</td>
<td>530</td>
</tr>
</table>
<p>We see that arb1 and arb2 are nearly equal, and both are faster than MPFR and nearly as fast as mpz, accomplishing the goal. There are a few special cases left (e.g. zero radius) that would need handling in a proper implementation, which should add a few nanoseconds to each, but it&#8217;s also possible that the code for either could be streamlined a bit.</p>
<p>I also implemented addition for the arb1 type:</p>
<table>
<tr>
<th>limbs</th>
<th>mpn</th>
<th>mpz</th>
<th>fmpz</th>
<th>mpfr</th>
<th>arb</th>
<th>arb1</th>
</tr>
<tr>
<td>1</td>
<td>7</td>
<td>17</td>
<td>30</td>
<td>83</td>
<td>87</td>
<td>31</td>
</tr>
<tr>
<td>2</td>
<td>6</td>
<td>17</td>
<td>29</td>
<td>88</td>
<td>95</td>
<td>39</td>
</tr>
<tr>
<td>3</td>
<td>7</td>
<td>17</td>
<td>29</td>
<td>92</td>
<td>97</td>
<td>45</td>
</tr>
<tr>
<td>4</td>
<td>8</td>
<td>20</td>
<td>32</td>
<td>79</td>
<td>100</td>
<td>48</td>
</tr>
<tr>
<td>5</td>
<td>11</td>
<td>20</td>
<td>32</td>
<td>82</td>
<td>100</td>
<td>51</td>
</tr>
<tr>
<td>6</td>
<td>11</td>
<td>23</td>
<td>36</td>
<td>83</td>
<td>100</td>
<td>58</td>
</tr>
<tr>
<td>7</td>
<td>12</td>
<td>23</td>
<td>34</td>
<td>100</td>
<td>100</td>
<td>59</td>
</tr>
<tr>
<td>8</td>
<td>16</td>
<td>26</td>
<td>40</td>
<td>98</td>
<td>110</td>
<td>61</td>
</tr>
<tr>
<td>9</td>
<td>17</td>
<td>31</td>
<td>41</td>
<td>110</td>
<td>120</td>
<td>62</td>
</tr>
<tr>
<td>10</td>
<td>14</td>
<td>32</td>
<td>44</td>
<td>110</td>
<td>110</td>
<td>67</td>
</tr>
<tr>
<td>11</td>
<td>20</td>
<td>27</td>
<td>45</td>
<td>120</td>
<td>110</td>
<td>72</td>
</tr>
<tr>
<td>12</td>
<td>15</td>
<td>27</td>
<td>39</td>
<td>110</td>
<td>110</td>
<td>71</td>
</tr>
<tr>
<td>13</td>
<td>18</td>
<td>28</td>
<td>40</td>
<td>110</td>
<td>120</td>
<td>73</td>
</tr>
<tr>
<td>14</td>
<td>18</td>
<td>31</td>
<td>43</td>
<td>110</td>
<td>110</td>
<td>78</td>
</tr>
<tr>
<td>15</td>
<td>19</td>
<td>28</td>
<td>41</td>
<td>120</td>
<td>120</td>
<td>83</td>
</tr>
</table>
<p>Again, speed is much better for arb1 than for MPFR and arb. The timings for arb1 should actually be a bit better, closer to the mpz numbers. Right now I&#8217;m actually not sure why the difference is so large. The numbers get better if you only time it with a fixed set of inputs.</p>
<p>So far, the code is to incomplete to draw any conclusions about which approach is better. Code complexity is probably the biggest issue. In both versions, multiplication takes 150 lines or so of very dense code, and this code does not even handle all cases yet. It would be great to get rid of more branches, but that alone is difficult. Addition is not any easier (the arb1 implementation, which is supposed to be the one where addition is simple, has 150 lines of code and about 20 branches).</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/05/experiments-with-low-level-ball-arithmetic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Logarithms as well</title>
		<link>http://fredrikj.net/blog/2012/05/logarithms-as-well/</link>
		<comments>http://fredrikj.net/blog/2012/05/logarithms-as-well/#comments</comments>
		<pubDate>Wed, 02 May 2012 11:27:55 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=616</guid>
		<description><![CDATA[In the last post, I discussed computing the exponential function. With that being implemented, it&#8217;s not much work to do logarithms as well. If we can evaluate the exponential function, then we can compute $y = \log x$ by using &#8230; <a href="http://fredrikj.net/blog/2012/05/logarithms-as-well/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://fredrikj.net/blog/2012/04/revisiting-transcendental-functions/">last post</a>, I discussed computing the exponential function. With that being implemented, it&#8217;s not much work to do logarithms as well. If we can evaluate the exponential function, then we can compute $y = \log x$ by using  <a href="http://en.wikipedia.org/wiki/Newton%27s_method">Newton&#8217;s method</a> to find a root of the equation $\exp(y) &#8211; x = 0$, i.e. using the iteration $y_{n+1} + 1 &#8211; x / \exp(y_n)$.</p>
<p>Newton iteration is extremely efficient. Starting from a sufficiently accurate initial value, the number of correct digits roughly doubles with each iteration; moreover, the algorithm is &#8220;self-correcting&#8221;, so we only need to use full precision for the last iteration, 1/2 the precision for the second last iteration, 1/4 the precision for the iteration before that, etc. The total cost turns out to be only a fraction more than the time to evaluate a single exponential to the target precision.</p>
<p>In fact, it seems even better to use <a href="http://en.wikipedia.org/wiki/Halley%27s_method">Halley&#8217;s method</a>, which has the update step $y_{n+1} = y_n + 2 &#8211; 4t/(x + t)$ where $t = \exp(y_n)$. The amount of work per step is basically the same (one exponential and one division), but now the number of digits <i>triples</i> with each iteration! (We could avoid the division in the ordinary Newton iteration using $1/\exp(y_n) = \exp(-y_n)$, but it is more convenient to limit the basecase exponential to positive arguments at least for now.)</p>
<p>Newton&#8217;s method and Halley&#8217;s method are just the first two instances of <a href="http://en.wikipedia.org/wiki/Householder%27s_method">Householder&#8217;s method</a>, of order $d = 1$ and $d = 2$ respectively. Using higher-order Householder iterations, we can obtain arbitrarily rapid convergence, but for $d > 2$ the formulas start to become unwieldy. For $d = 3$, we have</p>
<p>$$y_{n+1} = \frac{6 x \left(x+2 t\right)}{x^2+4 x t+t^2}+t-3.$$</p>
<p>I have now <a href="https://github.com/fredrik-johansson/arb/blob/master/mpr/log_basecase.c">implemented a first version</a> of a basecase logarithm function, which evaluates $\log(1+x)$ on the standard interval $[0,1)$. It starts from a double-precision approximation and refines it to the target precision (modulo some rounding error) using Halley iteration. Here is how it performs (timings are in nanoseconds):</p>
<table>
<tr>
<th>prec</th>
<th>system</th>
<th>dd</th>
<th>qd</th>
<th>mpfr</th>
<th>mpmath</th>
<th>new code</th>
</tr>
<tr>
<td>53</td>
<td>67</td>
<td></td>
<td></td>
<td>4900</td>
<td>22000</td>
<td>390</td>
</tr>
<tr>
<td>64</td>
<td></td>
<td></td>
<td></td>
<td>4800</td>
<td>22000</td>
<td>390</td>
</tr>
<tr>
<td>106</td>
<td></td>
<td>900</td>
<td></td>
<td>6700</td>
<td>24000</td>
<td>590</td>
</tr>
<tr>
<td>128</td>
<td></td>
<td></td>
<td></td>
<td>7800</td>
<td>26000</td>
<td>600</td>
</tr>
<tr>
<td>192</td>
<td></td>
<td></td>
<td></td>
<td>9200</td>
<td>29000</td>
<td>1100</td>
</tr>
<tr>
<td>212</td>
<td></td>
<td></td>
<td>20000</td>
<td>10000</td>
<td>30000</td>
<td>1500</td>
</tr>
<tr>
<td>256</td>
<td></td>
<td></td>
<td></td>
<td>13000</td>
<td>32000</td>
<td>1500</td>
</tr>
<tr>
<td>320</td>
<td></td>
<td></td>
<td></td>
<td>14000</td>
<td>35000</td>
<td>2100</td>
</tr>
<tr>
<td>384</td>
<td></td>
<td></td>
<td></td>
<td>16000</td>
<td>39000</td>
<td>2500</td>
</tr>
</table>
<p>Comparing the numbers to the table from the last post, we find that log runs at around half the speed of exp. This ratio should improve at higher precision, but even at low precision, it is not a bad result, and the code is still about an order of magnitude faster than MPFR. Some savings are possible: spending around 70 nanoseconds on the initial value by calling the system logarithm is a bit excessive (a faster, sloppier double-precision logarithm could be used instead), and the division could probably be done slightly faster. But apart from that, the code is quite lean, and the only way to improve it is to improve the exponential function.</p>
<p>I should mention that the current code typically gives 1-2 bits less than full accuracy. To guarantee full accuracy, it is necessary to add a few guard bits. It shouldn&#8217;t actually be all that difficult to compute a rigorous error bound for the Halley iteration to make this automatic, at least in principle (translating it to correct and efficient code is probably a bit hairy), but this will have to wait since it requires error bounding for exp first.</p>
<p>This is actually probably the best way to compute the logarithm, up to a precision of several thousand digits. We could of course use a table lookup and a Taylor series correction, just as for exp, avoiding the overhead of Newton/Halley iteration. But a table lookup requires at least one division to transform the argument, and moreover nested table lookups aren&#8217;t possible, so we would need a much larger table for the same speed. If we use the convergence acceleration formula $\log(1+x) = 2^n \log((1+x)^{1/2^n})$), the complexity is roughly the same as for exp (with $\exp(x) = (\exp(x/2^n))^{2^n}$), but square roots are so expensive that this only becomes competitive at very high precision. Either way, it would be difficult to get rid of the 2x overhead compared to exp.</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/05/logarithms-as-well/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Revisiting transcendental functions</title>
		<link>http://fredrikj.net/blog/2012/04/revisiting-transcendental-functions/</link>
		<comments>http://fredrikj.net/blog/2012/04/revisiting-transcendental-functions/#comments</comments>
		<pubDate>Mon, 30 Apr 2012 12:30:55 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=569</guid>
		<description><![CDATA[The overwhelming majority of the time when I use arbitrary-precision arithmetic, I only need precision slightly higher than hardware precision; typically 30-40 digits, occasionally perhaps 100 digits, and only very rarely 1000 digits or more. About two years ago, I &#8230; <a href="http://fredrikj.net/blog/2012/04/revisiting-transcendental-functions/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The overwhelming majority of the time when I use arbitrary-precision arithmetic, I only need precision slightly higher than hardware precision; typically 30-40 digits, occasionally perhaps 100 digits, and only very rarely 1000 digits or more.  About two years ago, I started doing some experiments to see how quickly the most common transcendental functions (exp, log, cos, sin, atan, gamma) can be computed in this range (the old test code, now obsolete, is available <a href="http://code.google.com/p/fastfunlib/">here</a>).</p>
<p>There are three main ideas. The first is simply to do arithmetic very efficiently, only using the <a href="http://gmplib.org">GMP</a> functions that are written in assembly (<tt>mpn_add_n</tt>, <tt>mpn_addmul_1</tt>, <tt>mpn_mul_basecase</tt>, etc.), avoiding divisions like the plague, eliminating temporary memory allocations and copies, and so on.</p>
<p>The second is to use complexity reduction techniques aggressively. For elementary (or hypergeometric) functions, one can get a complexity of $d^{2.333}$ instead of the naive $d^3$ at $d$-digit precision by combining argument reduction with baby-step/giant-step polynomial evaluation. With the proper low-level implementation, this should be effective essentially from $d = 1$.</p>
<p>The third is to take advantage of precomputation to speed up repeated evaluations. RAM is cheap, and in the range up to 100 digits or so, the space needed to speed up the most commonly used functions by large factors is measured in kilobytes. Of course, use of lookup tables increases cache pressure, but a cache miss is much cheaper than a whole sequence of operations on multi-limb integers, so this tradeoff is generally worth it (some benchmarking I did a while back also indicated that prefetch instructions worked wonderfully for this purpose, at least on my CPU).</p>
<p>I have now written a new basecase implementation for exp(), substantially improved compared to my previous experiments; the <a href="https://github.com/fredrik-johansson/arb/tree/master/mpr">code can be found</a> here. Using a modest 224 KB lookup table (with a small modification, this could be trimmed to about 160 KB) for 16-bit argument reduction allows going up to a precision of about 384 bits, or 115 decimal digits, on a 64-bit system. In other words plenty for most practical purposes. The lookup table itself takes just a few microseconds to generate dynamically.</p>
<p>Here is how it compares to some other libraries: the system double-precision exp, the double-double (dd) and quad-double (qd) types from D. H. Bailey&#8217;s <a href="http://crd-legacy.lbl.gov/~dhbailey/mpdist/">QD</a> library (measured using the supplied <tt>tests/qd_timer</tt> program), <a href="http://mpfr.org">MPFR</a>, and <a href="http://mpmath.org">mpmath</a> (using the Python implementation with <a href="http://code.google.com/p/gmpy/">GMPY</a> types, not the code in Sage which wraps MPFR). All timings are in nanoseconds.</p>
<table>
<tr>
<th>prec</th>
<th>system</th>
<th>dd</th>
<th>qd</th>
<th>mpfr</th>
<th>mpmath</th>
<th>new code</th>
</tr>
<tr>
<td>53</td>
<td>49</td>
<td></td>
<td></td>
<td>4300</td>
<td>20000</td>
<td>180</td>
</tr>
<tr>
<td>64</td>
<td></td>
<td></td>
<td></td>
<td>5500</td>
<td>20000</td>
<td>190</td>
</tr>
<tr>
<td>106</td>
<td></td>
<td>790</td>
<td></td>
<td>7200</td>
<td>24000</td>
<td>330</td>
</tr>
<tr>
<td>128</td>
<td></td>
<td></td>
<td></td>
<td>7800</td>
<td>26000</td>
<td>340</td>
</tr>
<tr>
<td>192</td>
<td></td>
<td></td>
<td></td>
<td>10000</td>
<td>29000</td>
<td>520</td>
</tr>
<tr>
<td>212</td>
<td></td>
<td></td>
<td>6500</td>
<td>10000</td>
<td>31000</td>
<td>690</td>
</tr>
<tr>
<td>256</td>
<td></td>
<td></td>
<td></td>
<td>11000</td>
<td>31400</td>
<td>780</td>
</tr>
<tr>
<td>320</td>
<td></td>
<td></td>
<td></td>
<td>12000</td>
<td>34000</td>
<td>1100</td>
</tr>
<tr>
<td>384</td>
<td></td>
<td></td>
<td></td>
<td>14000</td>
<td>38000</td>
<td>1500</td>
</tr>
</table>
<p>A couple of things need to be pointed out.</p>
<p>Right now, this code is experimental. I have not tested it thoroughly, and it does not come with an error bound (however, the error is never larger than a few ulp, and a bound can be bounded quite easily, for which I intend to include code later on).</p>
<p>All other libraries take an arbitrary $x$, not necessarily restricted to $[0,\ln 2)$. The initial argument reduction is a division with remainder, where a low-precision value of $1/\ln 2$ can be precomputed, meaning that only a single <tt>mpn_submul_1</tt> plus some adjustments should be necessary.</p>
<p>The system exponential function, MPFR and mpmath also effectively use higher internal precision to compute the exponential with 0.5 or 1 ulp error, so to be fair a few guard bits should be added.</p>
<p>Even after adding all remaining corrections, the code can actually probably be made faster. At 1 or 2 limb precision, replacing all calls to GMP functions with inline assembly should improve performance; the current code also contains a redundant division, which should save 60 nanoseconds or so when removed.</p>
<p>Among the elementary functions, exp is the easiest to implement; cos, sin, atan and log can be computed using similar principles, although there is some more overhead, so they will perhaps be a factor two slower. The next logical step is to add a second version of exp, for precisions between 384 bits and a few thousand digits. It will necessarily be a bit slower than the basecase version, but should still be a decent factor faster than the MPFR function in this range.</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/04/revisiting-transcendental-functions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Algorithm selection for zeta(n)</title>
		<link>http://fredrikj.net/blog/2012/04/algorithm-selection-for-zetan/</link>
		<comments>http://fredrikj.net/blog/2012/04/algorithm-selection-for-zetan/#comments</comments>
		<pubDate>Thu, 19 Apr 2012 13:24:22 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=534</guid>
		<description><![CDATA[In my last post, I mentioned using binary splitting to compute $\zeta(n)$ to extremely high precision for small $n$. I have now added a function to Arb for evaluating $\zeta(n)$ that selects between several different algorithms depending on both $n$ &#8230; <a href="http://fredrikj.net/blog/2012/04/algorithm-selection-for-zetan/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://fredrikj.net/blog/2012/04/high-precision-ball-arithmetic/">last post</a>, I mentioned using binary splitting to compute $\zeta(n)$ to extremely high precision for small $n$. I have now added a function to <a href="https://github.com/fredrik-johansson/arb">Arb</a> for evaluating $\zeta(n)$ that selects between several different algorithms depending on both $n$ and the precision.</p>
<p>Here are timings compared to <a href="http://mpfr.org">MPFR</a> for even $n$:</p>
<div align="center"><img class="size-full wp-image-543" title="zeta_even" src="http://fredrikj.net/blog/wp-content/uploads/2012/04/zeta_even.png" alt="Timings for zeta(n), even n" width="560" height="350" /></div>
<p>Here the Arb function either uses the formula $\zeta(2n) = (-1)^{n+1} B_{2n} (2\pi)^{2n} / (2(2n)!)$ (computing the Bernoulli number $B_{2n}$ exactly using <a href="http://flintlib.org">FLINT</a>), or the <a href="http://en.wikipedia.org/wiki/Euler_product">Euler product</a> $(\zeta(n))^{-1} = \prod_{p} (1-p^{-n})$ for large $n$. You can see the cutoff quite clearly in the plot. At any size, this strategy is much faster than the generic algorithm used by MPFR.</p>
<p>In fact, since FLINT uses the Euler product to compute Bernoulli numbers, the Euler product effectively always ends up being used. That is, one basically uses the Euler product to compute $\zeta(n)$ to full precision, or to compute $B_n$ exactly, whichever requires less precision.</p>
<p>Timings for odd $n$ look a bit different:</p>
<div align="center"><img class="size-full wp-image-543" title="zeta_odd" src="http://fredrikj.net/blog/wp-content/uploads/2012/04/zeta_odd.png" alt="Timings for zeta(n), odd n" width="560" height="350" /></div>
<p>There is a visible speedup at very high precision for small $n$ due to the use of binary splitting (timings can be improved further by writing better code for the special cases $n = 5, 7$), and a speedup for asymptotically large $n$ where Arb again uses the Euler product. In between, the code simply falls back to the MPFR function, so the timings are identical.</p>
<p>The pressing question is whether one can do better than MPFR here. Euler-Maclaurin summation might be a bit faster if optimized carefully, but I can&#8217;t really think of anything that would make a huge difference.</p>
<p>Another way to compute $\zeta(n)$ to high precision when $n$ is not too large is to use <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&#038;jrnid=ppi&#038;paperid=425&#038;option_lang=eng">the algorithm of E. Karatsuba</a>. However, my tests so far indicate that it is much slower than other methods in practice.</p>
<p>These timings are all for computing $\zeta(n)$ as an isolated value. Much better performance is possible when computing $\zeta(2), \zeta(3), \ldots, \zeta(n)$ simultaneously. More on that at a later time&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/04/algorithm-selection-for-zetan/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>High-precision ball arithmetic</title>
		<link>http://fredrikj.net/blog/2012/04/high-precision-ball-arithmetic/</link>
		<comments>http://fredrikj.net/blog/2012/04/high-precision-ball-arithmetic/#comments</comments>
		<pubDate>Thu, 05 Apr 2012 15:39:20 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[flint]]></category>
		<category><![CDATA[math]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=478</guid>
		<description><![CDATA[About a week ago, I started working on a C library implementing arbitrary-precision real balls, creatively titled Arb. It&#8217;s mostly based on FLINT types, using MPFR for testing and for fallback code. Ball arithmetic is intended to allow efficient, rigorous &#8230; <a href="http://fredrikj.net/blog/2012/04/high-precision-ball-arithmetic/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>About a week ago, I started working on a C library implementing arbitrary-precision real balls, creatively titled <a href="https://github.com/fredrik-johansson/arb">Arb</a>. It&#8217;s mostly based on <a href="http://www.flintlib.org/">FLINT</a> types, using <a href="http://mpfr.org">MPFR</a> for testing and for fallback code.</p>
<p>Ball arithmetic is intended to allow efficient, rigorous high-precision numerical evaluation. A real ball is implemented as an interval $[x-r, x+r] \times 2^e$ where $x, e, r$ are integers. This representation supports error propagation while avoiding the ~2x speed and memory overhead of the more conventional endpoint interval representation $[a \times 2^e, b \times 2^f]$.</p>
<p>There are two drawbacks: balls are somewhat less flexible, and they have slightly higher overhead at low precision. At the moment multiplying two <tt>arb_t</tt> numbers is about 3x slower than multiplying two <tt>mpfr_t</tt> numbers at 100-400 bits (which would make it somewhat less efficient than <a href="http://perso.ens-lyon.fr/nathalie.revol/software.html">MPFI</a> arithmetic). Above 1000 bits, the difference is only a few percent.</p>
<p>The current code is completely naive, and with a bit of low-level hacking it should be possible to get similar performance to a single <tt>mpfr_t</tt> multiplication at 200-1000 bits. Up to perhaps 256 bits, it would be attractive to have faster, fixed-precision interval types as a complement. (There is actually some prototype code for this hidden somewhere in a branch of my FLINT repository on GitHub&#8230;)</p>
<p>I will soon add complex numbers, polynomial balls, and matrix balls (based on exact FLINT polynomials and matrices). In fact, the main reason that I created an <tt>arb_t</tt> scalar type at all is to simplify interfacing with and writing test code for polynomial balls. But the type can clearly be useful on its own. When the code stabilizes enough, I plan to merge all this back into FLINT.</p>
<p>Part of the design of the <tt>arb_t</tt> type is that it can hold small integers efficiently, only truncating them to approximate numbers when they grow very large. This is useful for performing <a href="http://en.wikipedia.org/wiki/Binary_splitting">binary splitting</a> on infinite series, reducing memory usage (and possibly improving performance) compared to using exact integers throughout.</p>
<p>The <tt>arb_t</tt> type is still very incomplete (and some functions don&#8217;t quite work as they should in all cases), but the module is functional enough that I&#8217;ve been able to implement binary splitting of a couple of constants. It&#8217;s interesting to see how much of an improvement (if any) ball arithmetic gives over using plain integers, so I&#8217;m including some benchmarks below. These functions are not the goal of the Arb library; I plan to do more useful things with it later, but they&#8217;re fun to look at as a start.</p>
<p>First out are some timings for &pi; using the Chudnovsky algorithm, compared to the MPFR pi function, Mathematica 7, and the slightly modified version of the <a href="http://gmplib.org/pi-with-gmp.html">gmp-chudnovsky</a> program included in FLINT. All times are measured in seconds.</p>
<table>
<tr>
<td>Digits</td>
<td>MMA7</td>
<td>MPFR</td>
<td>gmp-chudnovsky</td>
<td>Arb</td>
</tr>
<tr>
<td>10<sup>5</sup></td>
<td>0.14</td>
<td>0.17</td>
<td>0.05</td>
<td>0.05</td>
</tr>
<tr>
<td>10<sup>6</sup></td>
<td>2.5</td>
<td>3.81</td>
<td>0.92</td>
<td>1.16</td>
</tr>
<tr>
<td>10<sup>7</sup></td>
<td>42.53</td>
<td>67.5</td>
<td>15.9</td>
<td>22.6</td>
</tr>
<tr>
<td>10<sup>8</sup></td>
<td></td>
<td>1380</td>
<td>252</td>
<td>339</td>
</tr>
</table>
<p>Using ball arithmetic turns out to be pretty much exactly as fast as using plain integers (timings not included), so at least the <tt>arb_t</tt> is not adding much overhead compared to integers. The gmp-chudnovsky code is around 40% faster because it works with partially factored integers, requiring a lot of extra code (it would be interesting to turn this into a general integer type and find other uses for it) and because it uses heuristic numerical code for the final square root and division. MPFR is slower because it uses the AGM algorithm.  Mathematica is probably slower because it relies on an old GMP version.</p>
<p>Timings for Euler&#8217;s constant (&gamma;) using the Brent-McMillan binary splitting algorithm, compared to Mathematica, MPFR and the integer (fmpz) binary splitting code in the FLINT arith in FLINT:</p>
<table>
<tr>
<td>Digits</td>
<td>MMA7</td>
<td>MPFR</td>
<td>fmpz</td>
<td>Arb</td>
</tr>
<tr>
<td>10<sup>4</sup></td>
<td>0.18</td>
<td>0.20</td>
<td>0.12</td>
<td>0.07</td>
</tr>
<tr>
<td>10<sup>5</sup></td>
<td>6.42</td>
<td>5.00</td>
<td>3.14</td>
<td>2.36</td>
</tr>
<tr>
<td>10<sup>6</sup></td>
<td>138</td>
<td>174</td>
<td>66.5</td>
<td>51.7</td>
</tr>
</table>
<p>Ball arithmetic gives a small speedup. Memory usage is reduced significantly compared to the fmpz code. I was also able to compute 10<sup>7</sup> digits, but I lost the output of how long it took (roughly half an hour, I think). MPFR is slower because it uses a different algorithm.</p>
<p>The speedup for the binary splitting stage is actually larger than indicated, because 1/3 of the total time is spent computing log(<i>n</i>) with MPFR, where <i>n</i> is a parameter. One can save time by choosing <i>n</i> to be a power of two (log(2) is fast), but this makes the binary splitting slower when <i>n</i> is much smaller than the next power of two. I don&#8217;t know a general way to compute log(<i>n</i>) faster. Perhaps one could round <i>n</i> up to $m 2^e$ where $m \le 7$, and try to compute log(2), log(3), log(5), log(7) quickly. It&#8217;s worth pointing out that the binary splitting for power-of-two parameter can be optimized a bit by using bit shifting; this is not implemented yet.</p>
<p>Riemann zeta constants $\zeta(n)$ for integer <i>n</i>, compared to MPFR and the fmpz binary splitting code in the FLINT arith module:</p>
<table>
<tr>
<td>n</td>
<td>Digits</td>
<td>MMA7</td>
<td>MPFR</td>
<td>fmpz</td>
<td>Arb</td>
</tr>
<tr>
<td>13</td>
<td>10<sup>4</sup></td>
<td>13.7</td>
<td>0.24</td>
<td>0.58</td>
<td>0.19</td>
</tr>
<tr>
<td>13</td>
<td>10<sup>5</sup></td>
<td></td>
<td>27.3</td>
<td>14.0</td>
<td>6.54</td>
</tr>
<tr>
<td>13</td>
<td>10<sup>6</sup></td>
<td></td>
<td>3639</td>
<td>287</td>
<td>148</td>
</tr>
<tr>
<td>43</td>
<td>10<sup>4</sup></td>
<td>13.7</td>
<td>0.36</td>
<td>2.29</td>
<td>0.50</td>
</tr>
<tr>
<td>43</td>
<td>10<sup>5</sup></td>
<td></td>
<td>43.2</td>
<td>51.6</td>
<td>18.2</td>
</tr>
<tr>
<td>43</td>
<td>10<sup>6</sup></td>
<td></td>
<td></td>
<td>1028</td>
<td>417</td>
</tr>
</tr>
</table>
<p>Here ball arithmetic gives a nice speedup which increases with larger <i>n</i>. Mathematica and MPFR both use algorithms with complexity somewhere around $\tilde O(b^2)$ where $b$ is the number of digits, whereas the algorithm I implemented has complexity $\tilde O(nb)$. Due to its overhead, it only becomes fast for small <i>n</i> and <i>b</i> in the tens of thousands.</p>
<p>Finally, using a separate hypergeometric series for $\zeta(3)$, this particular constant can be computed even faster. Here ball arithmetic gives a very small speed improvement over using fmpzs:</p>
<table>
<tr>
<td>Digits</td>
<td>MMA7</td>
<td>fmpz</td>
<td>Arb</td>
</tr>
<tr>
<td>10<sup>5</sup></td>
<td>1.68</td>
<td>0.29</td>
<td>0.24</td>
</tr>
<tr>
<td>10<sup>6</sup></td>
<td>38.9</td>
<td>6.52</td>
<td>5.30</td>
</tr>
<tr>
<td>10<sup>7</sup></td>
<td>772</td>
<td>127</td>
<td>107</td>
</tr>
</table>
<p>(I&#8217;m omitting MPFR because it doesn&#8217;t have a separate algorithm implemented for this constant.)</p>
<p>Later on, I will add faster code for $\zeta(n)$ where <i>n</i> is large or the precision is small, and code for combined evaluation of $\zeta(2), \zeta(3), \ldots, \zeta(n)$ (used for series expansions of various functions).</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/04/high-precision-ball-arithmetic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Partitions into the quintillions</title>
		<link>http://fredrikj.net/blog/2012/03/partitions-into-the-quintillions/</link>
		<comments>http://fredrikj.net/blog/2012/03/partitions-into-the-quintillions/#comments</comments>
		<pubDate>Sat, 31 Mar 2012 14:35:00 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[flint]]></category>
		<category><![CDATA[partitions]]></category>
		<category><![CDATA[sage]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=65</guid>
		<description><![CDATA[One of my biggest undertakings last year was to implement the partition function $p(n)$ in FLINT. With this code, I was able to set a record by computing the number of partitions of $10^{19}$, or 10,000,000,000,000,000,000 (ten quintillion). The number &#8230; <a href="http://fredrikj.net/blog/2012/03/partitions-into-the-quintillions/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One of my biggest undertakings last year was to implement the <a href="http://en.wikipedia.org/wiki/Partition_(number_theory)">partition function</a> $p(n)$ in <a href="http://flintlib.org/">FLINT</a>. With this code, I was able to set a record by computing the number of partitions of $10^{19}$, or 10,000,000,000,000,000,000 (ten quintillion).</p>
<p>The number $p(10^{19})$ turns out to have 3,522,804,578 digits, starting and ending with</p>
<pre>
56469284039962075996762611156427010823552403269435
56912115036797282138529825391237281687215156415242
...
          &lt;3522804378 digits omitted>
                                               ...
18143637106903146536175827132211646720469971909621
95008234537397439410884324996338596264493674631046
</pre>
<p>More simply, $p(10^{19}) \approx 5.65 \times 10^{3,522,804,577}$ (this kind of approximation is easy to compute; it is the least significant digits that are challenging &#8212; or the more significant digits, for the $p$-adically inclined). This number answers the following practically important combinatorial problem: in how many ways can one choose sticks whose lengths are multiples of one meter, such that they add up to $10^{19}$ meters &#8212; approximately the thickness of the Milky Way &#8212; when placed end to end?</p>
<p>Okay, computing $p(10^n)$ is just a frivolous benchmark problem. Perhaps more interestingly, at least for number theorists, I generated over 22 billion new Ramanujan-type congruences (identities of the form $p(Ak+B) \equiv 0 \bmod m$ for all $k$) using the <a href="http://www.springerlink.com/content/m8716053n104wp81/">algorithm of Rhiannon Weaver</a>. This greatly extends the computation of 76,065 congruences done by Weaver in 2001. An example of a new identity is</p>
<p>$$p(28995244292486005245947069 k + 28995221336976431135321047) \equiv 0 \; \operatorname{mod} \; 29$$</p>
<p>for all values of $k$.</p>
<p>Finding the 22 billion new congruences required evaluating the partition function of approximately 470,000 distinct values up to $n \approx 10^{13}$. This took approximately 150 CPU days distributed over 40-48 cores on a computer at the University of Warwick (access provided courtesy of Bill Hart).</p>
<p>The computation of $p(10^{19})$ took just less than 100 hours of CPU time and roughly 150 GiB of RAM, running on a single core. With 2-3 months of CPU time and 2 TB of swap space, it should be possible to compute the number of partitions of one sextillion ($10^{21}$) with the FLINT code, if anyone is up for the task (unfortunately, parallelization for a single $n$ is not supported, and would be difficult to implement to save more than a factor two).</p>
<p>There are actually three implementations of the partition function in FLINT: one for computing $p(0), p(1), \ldots p(n-1)$, another for computing the same set of values modulo a word-size integer, and finally a function for computing the isolated value $p(n)$. The first two are straightforward applications of FLINT&#8217;s fast power series arithmetic, and there is perhaps not that much to say about them (with several gigabytes of RAM, you can comfortably compute perhaps $10^6$ exact values or $10^9$ values modulo a small integer).</p>
<p>The computation of isolated values uses the Hardy-Ramanujan-Rademacher formula</p>
<p>$$p(n)=\frac{1}{\pi \sqrt{2}} \sum_{k=1}^\infty \sqrt{k}\, A_k(n)\,<br />
\frac{d}{dn} \left(<br />
\frac {1} {\sqrt{n-\frac{1}{24}}}<br />
\sinh \left[ \frac{\pi}{k}<br />
\sqrt{\frac{2}{3}\left(n-\frac{1}{24}\right)}\right]<br />
\right)$$</p>
<p>where</p>
<p>$$A_k(n) = \sum_{0 \le m &lt; k, (m,k) = 1} \,e^{ \pi i \left[ s(m,\, k) \;-\; \frac{1}{k} 2 nm \right]}$$</p>
<p>and $s(m,k)$ denotes a Dedekind sum. This is a numerical infinite series that has to be evaluated approximately. Truncating it appropriately and using a sufficiently high numerical precision gives an approximation that is guaranteed to round to the correct integer.</p>
<p>The FLINT implementation has complexity $O(n^{1/2+\varepsilon})$, which is quasi-optimal since $p(n)$ has about $n^{1/2}$ digits. Implementing the Hardy-Ramanujan-Rademacher formula with this complexity is rather complicated. Ostensibly, there are $n^{3/2}$ terms in the nested sums, and on top of that you need to work with numbers up to $O(n^{1/2})$ digits.</p>
<p>It was only after a lot of digging in the literature that I stumbled across a paper of A. L. Whiteman that gives identities for factoring the exponential sums $A_k(n)$ into short cosine products, reducing the number of terms. This uses quite a bit of modular arithmetic, and implementing it efficiently relies on the fast routines in FLINT for computing gcd, modular square roots, Legendre symbols, factorizations, etc. of word-size integers.</p>
<p>The numerical computations are done using a combination of <a href="http://mpfr.org">MPFR</a> arithmetic, some custom routines for high-precision transcendental functions, and machine precision arithmetic for small terms, with carefully managed precision across the whole computation.</p>
<p>Until now, the largest reported computations of $p(n)$ have involved $n$ around $10^9$. Some software certainly allowed going higher than $10^9$, but this was incredibly slow. My code makes it easy to go much higher on commodity hardware, the primary limitation being available memory. For example, computing $p(10^{16})$ with FLINT takes less than two hours on my laptop, using all the available 3 GiB of RAM (plus some swap space).</p>
<p>Compared to the previously best software (Mathematica and Sage), the FLINT code runs around 500 times faster for large $n$. For example $p(2^{32}-1)$ takes half a second in FLINT and around 200 seconds in both Sage and Mathematica.</p>
<p><a href="http://fredrikj.net/blog/wp-content/uploads/2012/01/party.png"><img class=" wp-image-88 aligncenter" title="Partitions performance" src="http://fredrikj.net/blog/wp-content/uploads/2012/01/party.png" alt="" width="505" height="343" /></a></p>
<p>The image is a loglog plot of the time needed to compute $p(n)$ using Mathematica 7 (green circles), Sage 4.7 (red triangles), and FLINT (blue squares). The thin dotted line indicates the slope of the trivial lower complexity bound $\Omega(n^{1/2})$ just for writing out the result. A quick extrapolation suggests that $p(10^{19})$ would take about a decade to compute with Mathematica and a million years with Sage (after patching Sage to allow $n$ larger than 32 bits).</p>
<p>More details about the FLINT partition function implementation (and the computational results) will be given in a forthcoming paper.</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/03/partitions-into-the-quintillions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Factorials mod n and Wilson&#8217;s theorem</title>
		<link>http://fredrikj.net/blog/2012/03/factorials-mod-n-and-wilsons-theorem/</link>
		<comments>http://fredrikj.net/blog/2012/03/factorials-mod-n-and-wilsons-theorem/#comments</comments>
		<pubDate>Mon, 19 Mar 2012 11:21:03 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[flint]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=330</guid>
		<description><![CDATA[Wilson&#8217;s theorem states that an integer greater than 1 is a prime if and only if $(n-1)! \equiv -1 \bmod n$. This immediately gives a simple algorithm to test primality of an integer: just multiply out $1 \times 2 \times &#8230; <a href="http://fredrikj.net/blog/2012/03/factorials-mod-n-and-wilsons-theorem/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Wilson%27s_theorem">Wilson&#8217;s theorem</a> states that an integer greater than 1 is a prime if and only if $(n-1)! \equiv -1 \bmod n$. This immediately gives a simple algorithm to test primality of an integer: just multiply out $1 \times 2 \times \cdots \times (n-1)$, reducing each intermediate product modulo $n$, and check that the final result equals $n &#8211; 1$.</p>
<p>The running time is obviously $O(n)$ (assuming that $n$ fits in a single machine word so that arithmetic has constant cost). While elegant, this algorithm is useless in practice because even the most basic non-stupid primality test, <a href="http://en.wikipedia.org/wiki/Trial_division">trial division</a>, only costs $O(n^{1/2})$ in the worst case.</p>
<p>Perhaps surprisingly, the complexity of primality testing using Wilson&#8217;s theorem can be brought down to about $O(n^{1/2})$ as well. The idea is to use fast polynomial arithmetic to compute the factorial faster than by the naive method. Assuming for simplicity that $n &#8211; 1$ is a perfect square, let $m = (n-1)^{1/2}$. We form the polynomial $P(x) = (x+1)(x+2) \cdots (x+m)$ and evaluate $P(x)$ simultaneously at the points $x = 0, m, \ldots m(m-1)$. Then $P(0) P(m) \cdots P(m(m-1)) = (n-1)!$. If $n &#8211; 1$ is not a perfect square, we choose $m = \lfloor n^{1/2} \rfloor$ and fill in the missing factors by repeated naive multiplication.</p>
<p>All the required operations can be done in roughly $O(n^{1/2})$ time using FFT-based polynomial multiplication and balanced subproduct trees. The complexity is actually slightly higher, about $O(n^{1/2} \log^3 n)$, and the constant overhead is huge, so the method is still considerably slower than trial division. But it is interesting to see how it performs in practice.</p>
<p>Since I recently <a href="https://github.com/fredrik-johansson/flint2/commit/0e204e477a56e08a1ae1bb409191800009a27e98">implemented</a> fast multipoint evaluation in <a href="https://github.com/fredrik-johansson/flint2">FLINT</a>, the fast factorial algorithm became easy to implement as well. In my repository, it is now enabled by default for computing factorials modulo an integer (<tt>n_factorial_mod2_preinv</tt>) when the input is large enough; the code is <a href="https://github.com/fredrik-johansson/flint2/blob/trunk/ulong_extras/factorial_fast_mod2_preinv.c">here</a>. Here is how it compares to the naive algorithm for computing $(n-1)!$ modulo $n$:</p>
<table>
<tr>
<th><i>n</i></th>
<th>Naive factorial</th>
<th>Fast factorial</th>
</tr>
<tr>
<td>10</td>
<td>12 ns</td>
<td>1.2 &mu;s</td>
</tr>
<tr>
<td>10<sup>2</sup></td>
<td>0.46 &mu;s</td>
<td>4.4 &mu;s</td>
</tr>
<tr>
<td>10<sup>3</sup></td>
<td>7 &mu;s</td>
<td>22 &mu;s </td>
</tr>
<tr>
<td>10<sup>4</sup></td>
<td>78 &mu;s</td>
<td>100 &mu;s </td>
</tr>
<tr>
<td>10<sup>5</sup></td>
<td>0.89 ms</td>
<td>0.52 ms</td>
</tr>
<tr>
<td>10<sup>6</sup></td>
<td>9.8 ms</td>
<td>3.1 ms</td>
</tr>
<tr>
<td>10<sup>7</sup></td>
<td>110 ms</td>
<td>18 ms</td>
</tr>
<tr>
<td>10<sup>8</sup></td>
<td>1.2 s</td>
<td>0.12 s</td>
</tr>
<tr>
<td>10<sup>9</sup></td>
<td>12 s</td>
<td>0.71 s</td>
</tr>
<tr>
<td>10<sup>10</sup></td>
<td>151 s</td>
<td>3.5 s</td>
</tr>
<tr>
<td>10<sup>11</sup></td>
<td>1709 s</td>
<td>15 s</td>
</tr>
<tr>
<td>10<sup>12</sup></td>
<td>5 h (est.)</td>
<td>70 s</td>
</tr>
<tr>
<td>10<sup>13</sup></td>
<td>50 h (est.) </td>
<td>307 s</td>
</tr>
<tr>
<td>10<sup>14</sup></td>
<td>500 h (est.) </td>
<td>1282 s</td>
</tr>
</table>
<p>The numbers agree reasonably well with theory (the naive algorithm does not quite take a constant multiple of $n$ time because the arithmetic is done slightly faster when several factors can be accumulated in a single word, and the ratio per order of magnitude for the fast algorithm is a bit larger than 4 and not $10^{1/2} \approx 3.16$, but that is reasonable considering the extra log factors in the complexity). Around 10<sup>15</sup>, my laptop is running out of memory. To compute larger factorials, one could choose $m$ smaller than the square root of the input, performing several multipoint evaluations. Choosing a fixed large $m$ (say 10<sup>7</sup>) gives an algorithm with $O(n)$ complexity, but a much smaller constant than the naive factorial algorithm.</p>
<p>Although we have achieved a factor 1000 speedup over the naive factorial algorithm and made Wilson&#8217;s theorem a <i>feasible</i> primality test for numbers as large as 15 digits without requiring special hardware or patience, it remains completely useless for practical purposes. Trial division is around 10,000 times faster (dividing by all integers up to $(10^{14})^{1/2}$ at 10 ns per iteration takes 0.1 seconds), and sophisticated algorithms are much faster still (the FLINT function <tt>n_is_prime</tt> tests primality of a number this size in about 7 &mu;s, and <tt>n_factor</tt> factors it in about 500 &mu;s).</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/03/factorials-mod-n-and-wilsons-theorem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Blog moved</title>
		<link>http://fredrikj.net/blog/2012/01/blog-moved/</link>
		<comments>http://fredrikj.net/blog/2012/01/blog-moved/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 12:57:18 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[blog]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=199</guid>
		<description><![CDATA[I&#8217;m moving my blog from Blogger (http://fredrik-j.blogspot.com/) to a WordPress installation on my own domain (http://fredrikj.net/blog). This way I won&#8217;t be at Google&#8217;s mercy in the future, I can tinker with things more easily, and I can use cool plugins like TeX &#8230; <a href="http://fredrikj.net/blog/2012/01/blog-moved/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m moving my blog from Blogger (<a href="http://fredrik-j.blogspot.com/">http://fredrik-j.blogspot.com/</a>) to a WordPress installation on my own domain (<a href="http://fredrikj.net/blog">http://fredrikj.net/blog</a>). This way I won&#8217;t be at Google&#8217;s mercy in the future, I can tinker with things more easily, and I can use cool plugins like TeX conversion $$24 \sum_{k=1}^{\infty} \frac{1}{k^5} = \int_0^{\infty} \frac{x^4}{e^x-1} dx.$$</p>
<p>I&#8217;ve imported the entries from the old blog. Unfortunately, the plugin I used messed up all &lt;pre&gt; text, so there are now a million broken code examples. I have fixed up the most recent half of the archive manually, but the oldest posts are still broken.</p>
<p>Not that I&#8217;ve been blogging that much recently, but you never know&#8230; it goes in waves.</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2012/01/blog-moved/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Some FLINT 2.2 highlights</title>
		<link>http://fredrikj.net/blog/2011/06/some-flint-2-2-highlights/</link>
		<comments>http://fredrikj.net/blog/2011/06/some-flint-2-2-highlights/#comments</comments>
		<pubDate>Wed, 08 Jun 2011 08:58:00 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[flint]]></category>
		<category><![CDATA[mpmath]]></category>
		<category><![CDATA[sage]]></category>
		<category><![CDATA[sympy]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=64</guid>
		<description><![CDATA[Version 2.2 of FLINT (Fast Library for Number Theory) was released last weekend. Some updated benchmarks are available. In this blog post, I&#8217;m going to talk a bit about features I contributed in this version. With apologies to Sebastian Pancratz &#8230; <a href="http://fredrikj.net/blog/2011/06/some-flint-2-2-highlights/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Version 2.2 of <a href="http://flintlib.org/">FLINT</a> (Fast Library for Number Theory) was <a href="https://groups.google.com/group/flint-devel/browse_thread/thread/442a98d8bc31470a?hl=en">released</a> last weekend. Some updated <a href="http://sage.math.washington.edu/home/fredrik/flint/timings.html">benchmarks</a> are available.</p>
<p>In this blog post, I&#8217;m going to talk a bit about features I contributed in this version. With apologies to <a href="https://github.com/SPancratz/flint2/">Sebastian Pancratz</a> who wrote a whole lot of code as well &#8212; in particular, a new module for computing with p-adic numbers, and a module for rational functions! Bill Hart also implemented a faster polynomial GCD, which is a quite important update since GCD is crucial for most polynomial business. Anyhow&#8230;</p>
<h2>Polynomial interpolation</h2>
<p>I&#8217;ve added various functions for polynomial interpolation to the <a href="https://github.com/fredrik-johansson/flint2/tree/trunk/fmpz_poly">fmpz_poly</a> module. In general, these can be used to speed up various computations involving integer or rational polynomials by mapping a given problem to (Z/nZ)[x], Z or even Z/nZ, taking advantage of fast arithmetic in those rings, and then via interpolation recovering a result in Z[x] or Q[x].</p>
<p>Firstly, there are some new Chinese Remainder Theorem functions for integer polynomials, allowing you to reconstruct an integer polynomial from a bunch of polynomials with coefficients modulo different primes. Straightforward code (the actual work is done by functions in the fmpz module), but useful to have. The CRT functions are used by the new modular GCD code.</p>
<p>There are also functions for evaluating an integer polynomial on a set of points, and forming the interpolating polynomial (generally with rational coefficients) given a set of points and the values at those points.</p>
<p>Finally, user-friendly functions for evaluation and interpolation at a power of two (Kronecker segmentation) have been added. The code for this is actually a very old part of FLINT, and possibly some of the most complicated code in the library (packing bits efficiently is surprisingly hard). The new functions just wrap this functionality, but take care of memory management and various special cases, so you can now just safely do something like:</p>
<pre>fmpz_t z;
fmpz_init(z);

long bits = fmpz_poly_max_bits(poly) + 1; /* +1 for signs */
fmpz_poly_bit_pack(z, poly, bits);
fmpz_poly_bit_unpack(poly, z, bits);  /* recover poly */
fmpz_clear(z);</pre>
<p>Apart from Kronecker segmentation, these functions are not currently asymptotically fast. Fast multi-modulus CRT for coefficient reconstruction is probably not all that important in most circumstances, because it&#8217;s more common to use evaluation-interpolation techniques for polynomials of large degree and small coefficients than the other way around. Nonetheless, polynomials with large coefficients do arise as well. For example, the vector Bernoulli number code in FLINT relies on fast CRT, and currently uses custom code for this.</p>
<p>Polynomial interpolation uses <a href="http://en.wikipedia.org/wiki/Lagrange_polynomial">Lagrange interpolation</a> with barycentric weights, with a few tricks to avoid fractions. This is all implemented using an O(n^2) algorithm, but the actual time complexity is higher due to the fact that the coefficients when working over integers usually will be large, around n! in magnitude.</p>
<p>Here are some timing examples, evaluating and recovering a length-n polynomial with +/-1 coefficients basically as follows:</p>
<pre>x = _fmpz_vec_init(n);
y = _fmpz_vec_init(n);
fmpz_poly_init(P);
fmpz_poly_init(Q);
for (i = 0; i &lt; n; i++)
    x[i] = -n/2 + i;
fmpz_poly_randtest(P, state, n, 1);
fmpz_poly_evaluate_fmpz_vec(y, P, x, n);
fmpz_poly_interpolate_fmpz_vec(Q, x, y, n);</pre>
<p>The bits column below measures the largest value in y, which grows quite large despite the input polynomial having small coefficients:</p>
<pre>n=8  eval=762 ns  interp=13 us  bits=8  ok=1
n=16  eval=3662 ns  interp=61 us  bits=42  ok=1
n=32  eval=29 us  interp=673 us  bits=-113  ok=1
n=64  eval=136 us  interp=4951 us  bits=-316  ok=1
n=128  eval=625 us  interp=45 ms  bits=-762  ok=1
n=256  eval=2500 us  interp=792 ms  bits=-1779  ok=1
n=512  eval=12 ms  interp=10 s  bits=-4089  ok=1</pre>
<p>As you can see, the interpolation speed is not too bad for small n, but eventually grows out of control. How to do better?</p>
<p>Naive Lagrange interpolation is not optimal: it is possible to do n-point evaluation and interpolation in essentially O(n log<sup>2</sup> n) operations. Such algorithms do not necessarily lead to an improvement over the integers (you still have to deal with coefficient explosion), but they should win over finite fields. So the right solution will perhaps be to add polynomial evaluation/interpolation functions based on modular arithmetic.</p>
<h2>Rational numbers and matrices</h2>
<p>A new module <a href="https://github.com/fredrik-johansson/flint2/tree/trunk/fmpq">fmpq</a> is provided for computing with arbitrary-precision rational numbers. For the user, the fmpq_t type essentially behaves identically to the MPIR mpq_t type. However, an fmpq_t only takes up two words of memory when the numerator and denominator are small (less than 2<sup>62</sup>), whereas an mpq_t always requires six words plus additional heap-allocated space for the actual number data.</p>
<p>The fmpq functions are a bit faster than mpq functions in many cases when the numerator and/or denominator is small. But the main improvement should come for vectors, matrices or polynomials of rational numbers, due to the significantly reduced memory usage and memory management overhead (especially when many entries are zero or integer-valued).</p>
<p>Some higher-level functionality is also provided in the fmpq module, e.g. for rational reconstruction. The functions for computing special rational numbers (like Bernoulli numbers) have also been switched over to the fmpq type. Another supported feature is enumeration of the rationals (using the <a href="http://en.wikipedia.org/wiki/Calkin%E2%80%93Wilf_tree">Calkin-Wilf</a> sequence or by height). Generating the 100 million &#8220;first&#8221; positive rational numbers takes 9.6 seconds done in order of height, or 2.6 seconds in Calkin-Wilf order.</p>
<p>FLINT actually does not use fmpq&#8217;s to represent polynomials over Q (fmpq_poly), and probably never will. The fmpq_poly module represents a polynomial over Q as an integer polynomial with a single common denominator, which is usually faster. The reason for adding the fmpq_t type is that it enabled developing the new <a href="https://github.com/fredrik-johansson/flint2/tree/trunk/fmpq_mat">fmpq_mat</a> module, which implements dense matrices of rational numbers. For matrices, a common-denominator representation would be less convenient and in many cases completely impractical.</p>
<p>The new FLINT fmpq_mat module is very fast, or at least very non-slow. It is easy to find examples where it does a simple computation a thousand times faster than the rational matrices in Sage.</p>
<p>There&#8217;s not actually much code in the fmpq_mat module itself; it does almost all &#8220;level 3&#8243; linear algebra (computations requiring matrix multiplication or Gaussian elimination) by clearing denominators and computing over the integers. This approach is in fact stolen shamelessly from Sage, but the functions in Sage are highly unoptimized in many cases. The code in Sage still wins for many sufficiently large problems as it has asymptotically fast algorithms for many things we do not (like computing null spaces). See the <a href="http://sage.math.washington.edu/home/fredrik/flint/timings.html">benchmarks page</a> for more details.</p>
<p>I should not forget to mention that I&#8217;ve implemented Dixon&#8217;s p-adic algorithm for solving Ax = b for nonsingular square A. (I wish I had a good link for Dixon&#8217;s algorithm here, but sadly it doesn&#8217;t appear to be described conveniently anywhere on the web. The original paper is &#8220;<a href="http://www.springerlink.com/content/g711u0m541351u71/">Exact solution of linear equations using P-adic expansions</a>&#8220;, if you have the means to get through the Springer paywall.)</p>
<p>This is now used both for solving over both Z and Q. The solver in FLINT is competitive with Sage (which uses <a href="http://www.cs.uwaterloo.ca/~astorjoh/iml.html">IML</a>+<a href="http://math-atlas.sourceforge.net/">ATLAS</a>) up to systems of dimension somewhere between perhaps 100 and 1000 (depending greatly on the size of the entries in the inputs and in the solution!). There&#8217;s much to do here &#8212; we should eventually have BLAS support in FLINT, which will speed up core matrix arithmetic, but there&#8217;s room for a lot of algorithmic tuning as well.</p>
<p>There are some other minor new matrix features as well&#8230; they can be found in the changelog.</p>
<h2>Polynomial matrices</h2>
<p>A new module (<a href="https://github.com/fredrik-johansson/flint2/tree/trunk/fmpz_poly_mat">fmpz_poly_mat</a>) is provided for dense matrices over Z[x], i.e. matrices whose entries are polynomials with integer coefficients. The available functionality includes matrix multiplication, row reduction, and determinants. Matrix multiplication is particularly fast, as it uses the Kronecker segmentation interpolation/evaluation technique described above. (A similar algorithm is provided for determinants, but it&#8217;s not really optimal as this point.)</p>
<p>The benchmarks page has detailed some detailed timings, so I won&#8217;t repeat them here &#8212; but generally speaking, the FLINT implementation is an order of magnitude faster than Sage or Magma for matrices of manageable size.</p>
<p>There&#8217;s much more to be done for polynomial matrices. Row reduction is implemented quite efficiently, but it&#8217;s too slow as an algorithm for many tasks such as computing null spaces of very large matrices. A future goal is to implement asymptotically fast algorithms (see the <a href="http://portal.acm.org/citation.cfm?id=1940489">paper on x-adic lifting</a> by Burçin Eröcal and Arne Storjohann for example).</p>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2011/06/some-flint-2-2-highlights/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>100 mpmath one-liners for pi</title>
		<link>http://fredrikj.net/blog/2011/03/100-mpmath-one-liners-for-pi/</link>
		<comments>http://fredrikj.net/blog/2011/03/100-mpmath-one-liners-for-pi/#comments</comments>
		<pubDate>Mon, 14 Mar 2011 15:37:00 +0000</pubDate>
		<dc:creator>Fredrik</dc:creator>
				<category><![CDATA[mpmath]]></category>
		<category><![CDATA[sage]]></category>
		<category><![CDATA[sympy]]></category>

		<guid isPermaLink="false">http://fredrikj.net/blog/?p=63</guid>
		<description><![CDATA[Since it&#8217;s pi day today, I thought I&#8217;d share a list of mpmath one-liners for computing the value of pi to high precision using various representations in terms of special functions, infinite series, integrals, etc. Most of them can already &#8230; <a href="http://fredrikj.net/blog/2011/03/100-mpmath-one-liners-for-pi/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Since it&#8217;s pi day today, I thought I&#8217;d share a list of mpmath one-liners for computing the value of pi to high precision using various representations in terms of special functions, infinite series, integrals, etc. Most of them can already be found as doctest examples in some form in the mpmath documentation.</p>
<p>A few of the formulas explicitly involve pi. Using those to calculate pi is rather <em>circular</em> (!), though a few of them could still be used for computing pi using numerical root-finding. In any case, most of the formulas are circular even when pi doesn&#8217;t appear explicitly since mpmath is likely using its value internally. In any <em>further</em> case, the majority of the formulas are not efficient for computing pi to very high precision (at least as written). Still, ~50 digits is no problem. Enjoy!</p>
<pre style="font-size: small;">from mpmath import *
mp.dps = 50; mp.pretty = True

+pi
180*degree
4*atan(1)
16*acot(5)-4*acot(239)
48*acot(49)+128*acot(57)-20*acot(239)+48*acot(110443)
chop(2*j*log((1-j)/(1+j)))
chop(-2j*asinh(1j))
chop(ci(-inf)/1j)
gamma(0.5)**2
beta(0.5,0.5)
(2/diff(erf, 0))**2
findroot(sin, 3)
findroot(cos, 1)*2
chop(-2j*lambertw(-pi/2))
besseljzero(0.5,1)
3*sqrt(3)/2/hyp2f1((-1,3),(1,3),1,1)
8/(hyp2f1(0.5,0.5,1,0.5)*gamma(0.75)/gamma(1.25))**2
4*(hyp1f2(1,1.5,1,1) / struvel(-0.5, 2))**2
1/meijerg([[],[]], [[0],[0.5]], 0)**2
(meijerg([[],[2]], [[1,1.5],[]], 1, 0.5) / erfc(1))**2
(1-e) / meijerg([[1],[0.5]], [[1],[0.5,0]], 1)
sqrt(psi(1,0.25)-8*catalan)
elliprc(1,2)*4
elliprg(0,1,1)*4
2*agm(1,0.5)*ellipk(0.75)
(gamma(0.75)*jtheta(3,0,exp(-pi)))**4
cbrt(gamma(0.25)**4*agm(1,sqrt(2))**2/8)
sqrt(6*zeta(2))
sqrt(6*(zeta(2,3)+5./4))
sqrt(zeta(2,(3,4))+8*catalan)
exp(-2*zeta(0,1,1))/2
sqrt(12*altzeta(2))
4*dirichlet(1,[0,1,0,-1])
2*catalan/dirichlet(-1,[0,1,0,-1],1)
exp(-dirichlet(0,[0,1,0,-1],1))*gamma(0.25)**2/(2*sqrt(2))
sqrt(7*zeta(3)/(4*diff(lerchphi, (-1,-2,1), (0,1,0))))
sqrt(-12*polylog(2,-1))
sqrt(6*log(2)**2+12*polylog(2,0.5))
chop(root(-81j*(polylog(3,root(1,3,1))+4*zeta(3)/9)/2,3))
2*clsin(1,1)+1
(3+sqrt(3)*sqrt(1+8*clcos(2,1)))/2
root(2,6)*sqrt(e)/(glaisher**6*barnesg(0.5)**4)
nsum(lambda k: 4*(-1)**(k+1)/(2*k-1), [1,inf])
nsum(lambda k: (3**k-1)/4**k*zeta(k+1), [1,inf])
nsum(lambda k: 8/(2*k-1)**2, [1,inf])**0.5
nsum(lambda k: 2*fac(k)/fac2(2*k+1), [0,inf])
nsum(lambda k: fac(k)**2/fac(2*k+1), [0,inf])*3*sqrt(3)/2
nsum(lambda k: fac(k)**2/(phi**(2*k+1)*fac(2*k+1)), [0,inf])*(5*sqrt(phi+2))/2
nsum(lambda k: (4/(8*k+1)-2/(8*k+4)-1/(8*k+5)-1/(8*k+6))/16**k, [0,inf])
2/nsum(lambda k: (-1)**k*(4*k+1)*(fac2(2*k-1)/fac2(2*k))**3, [0,inf])
nsum(lambda k: 72/(k*expm1(k*pi))-96/(k*expm1(2*pi*k))+24/(k*expm1(4*pi*k)), [1,inf])
1/nsum(lambda k: binomial(2*k,k)**3*(42*k+5)/2**(12*k+4), [0,inf])
4/nsum(lambda k: (-1)**k*(1123+21460*k)*fac2(2*k-1)*fac2(4*k-1)/(882**(2*k+1)*32**k*fac(k)**3), [0,inf])
9801/sqrt(8)/nsum(lambda k: fac(4*k)*(1103+26390*k)/(fac(k)**4*396**(4*k)), [0,inf])
426880*sqrt(10005)/nsum(lambda k: (-1)**k*fac(6*k)*(13591409+545140134*k)/(fac(k)**3*fac(3*k)*(640320**3)**k), [0,inf])
4/nsum(lambda k: (6*k+1)*rf(0.5,k)**3/(4**k*fac(k)**3), [0,inf])
(ln(8)+sqrt(48*nsum(lambda m,n: (-1)**(m+n)/(m**2+n**2), [1,inf],[1,inf]) + 9*log(2)**2))/2
-nsum(lambda x,y: (-1)**(x+y)/(x**2+y**2), [-inf,inf], [-inf,inf], ignore=True)/ln2
2*nsum(lambda k: sin(k)/k, [1,inf])+1
quad(lambda x: 2/(x**2+1), [0,inf])
quad(lambda x: exp(-x**2), [-inf,inf])**2
2*quad(lambda x: sqrt(1-x**2), [-1,1])
chop(quad(lambda z: 1/(2j*z), [1,j,-1,-j,1]))
3*(4*log(2+sqrt(3))-quad(lambda x,y: 1/sqrt(1+x**2+y**2), [-1,1],[-1,1]))/2
sqrt(8*quad(lambda x,y: 1/(1-(x*y)**2), [0,1],[0,1]))
sqrt(6*quad(lambda x,y: 1/(1-x*y), [0,1],[0,1]))
sqrt(6*quad(lambda x: x/expm1(x), [0,inf]))
quad(lambda x: (16*x-16)/(x**4-2*x**3+4*x-4), [0,1])
quad(lambda x: sqrt(x-x**2), [0,0.25])*24+3*sqrt(3)/4
mpf(22)/7 - quad(lambda x: x**4*(1-x)**4/(1+x**2), [0,1])
mpf(355)/113 - quad(lambda x: x**8*(1-x)**8*(25+816*x**2)/(1+x**2), [0,1])/3164
2*quadosc(lambda x: sin(x)/x, [0,inf], omega=1)
40*quadosc(lambda x: sin(x)**6/x**6, [0,inf], omega=1)/11
e*quadosc(lambda x: cos(x)/(1+x**2), [-inf,inf], omega=1)
8*quadosc(lambda x: cos(x**2), [0,inf], zeros=lambda n: sqrt(n))**2
2*quadosc(lambda x: sin(exp(x)), [1,inf], zeros=ln)+2*si(e)
exp(2*quad(loggamma, [0,1]))/2
2*nprod(lambda k: sec(pi/2**k), [2,inf])
s=lambda k: sqrt(0.5+s(k-1)/2) if k else 0; 2/nprod(s, [1,inf])
s=lambda k: sqrt(2+s(k-1)) if k else 0; limit(lambda k: sqrt(2-s(k))*2**(k+1), inf)
2*nprod(lambda k: (2*k)**2/((2*k-1)*(2*k+1)), [1,inf])
2*nprod(lambda k: (4*k**2)/(4*k**2-1), [1, inf])
sqrt(6*ln(nprod(lambda k: exp(1/k**2), [1,inf])))
nprod(lambda k: (k**2-1)/(k**2+1), [2,inf])/csch(pi)
nprod(lambda k: (k**2-1)/(k**2+1), [2,inf])*sinh(pi)
nprod(lambda k: (k**4-1)/(k**4+1), [2, inf])*(cosh(sqrt(2)*pi)-cos(sqrt(2)*pi))/sinh(pi)
sinh(pi)/nprod(lambda k: (1-1/k**4), [2, inf])/4
sinh(pi)/nprod(lambda k: (1+1/k**2), [2, inf])/2
(exp(1+euler/2)/nprod(lambda n: (1+1/n)**n * exp(1/(2*n)-1), [1, inf]))**2/2
3*sqrt(2)*cosh(pi*sqrt(3)/2)**2*csch(pi*sqrt(2))/nprod(lambda k: (1+1/k+1/k**2)**2/(1+2/k+3/k**2), [1, inf])
2/e*nprod(lambda k: (1+2/k)**((-1)**(k+1)*k), [1,inf])
limit(lambda k: 16**k/(k*binomial(2*k,k)**2), inf)
limit(lambda x: 4*x*hyp1f2(0.5,1.5,1.5,-x**2), inf)
1/log(limit(lambda n: nprod(lambda k: pi/(2*atan(k)), [n,2*n]), inf),4)
limit(lambda k: 2**(4*k+1)*fac(k)**4/(2*k+1)/fac(2*k)**2, inf)
limit(lambda k: fac(k) / (sqrt(k)*(k/e)**k), inf)**2/2
limit(lambda k: (-(-1)**k*bernoulli(2*k)*2**(2*k-1)/fac(2*k))**(-1/(2*k)), inf)
limit(lambda k: besseljzero(1,k)/k, inf)
1/limit(lambda x: airyai(x)*2*x**0.25*exp(2*x**1.5/3), inf, exp=True)**2
1/limit(lambda x: airybi(x)*x**0.25*exp(-2*x**1.5/3), inf, exp=True)**2</pre>
]]></content:encoded>
			<wfw:commentRss>http://fredrikj.net/blog/2011/03/100-mpmath-one-liners-for-pi/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

