Using the 'for $in -> $x { ... }' style was going quite slow, but the helpful people on #perl6 got be to try .get in a loop, e.g. 'while (my $x = $in.get) { ... }' which turns out to be much faster. Not only does it use almost no memory, it's 50% faster than the latest @d = "file".IO.lines.

BTW, this was meant to share my experience with using .get for reading a large file. Big thanks to Liz and others for speeding up Str.lines!

]]>132.1 Perl trial division mod 6

291.7 Perl trial division

9.8 Math::Prime::Util

2.5 Math::Prime::Util with precalc

6.7 Math::Prime::XS

On this machine, Math::Prime::XS's simple trial division loop is faster than the non-cached routine I use in MPU until 3e7. Part of this is that MPU uses UV internally while MPXS uses "unsigned long". On this machine UV is "unsigned long long" (64-bit) and unsigned long is only 32-bit. That means MPXS is 32-bit, so doesn't work past 2^32 and probably explains the speed difference as well.

]]>These aren't huge numbers, but from the Math::Prime::Util documentation:

is_prime from 10^100 to 10^100 + 0.2M 2.2s Math::Prime::Util (BPSW + 1 random M-R) 2.7s Math::Pari w/2.3.5 (BPSW) 13.0s Math::Primality (BPSW) 35.2s Math::Pari (10 random M-R) 38.6s Math::Prime::Util w/o GMP (BPSW) 70.7s Math::Prime::Util (n-1 or ECPP proof) 102.9s Math::Pari w/2.3.5 (APR-CL proof)

Math::Prime::Util with the GMP backend will support hundreds of thousands of digits, and is probably the fastest code for large numbers other than OpenPFGW's Fermat test, and is substantially faster than any of the other Perl modules. See this stackexchange challenge, or Nicely's list of first occurrence prime gaps where I used this module.

Caveat being that without Math::Prime::Util::GMP installed, it uses Math::BigInt (with GMP or Pari backend), which is super slow. My todo list has some sort of replacement to get a bigint solution that is both (1) portable assuming XS, and (2) reasonably fast. Also, there are some nice optimizations for x86_64 as well as 64-bit in general. It is still fast on non-x86 machines, but it will miss some of the better optimizations (asm mulmod, montgomery math).

Math::Pari, Math::GMP, Math::GMPz, and Math::Primality will support bigints pretty well. For the two GMP methods you'll have to decide how many tests to use. Math::Pari really needs to be updated to use a newer Pari by default -- the current version will do 10 M-R tests and is quite a bit slower than when built with Pari 2.3.5.

Math::Prime::XS does not support bigints. For 64-bit primes it is about 3-4 million times slower than MPU on my machine (but should be fast for most composites).

Math::Prime::FastSieve is going to eat a lot of memory and time making the sieve once we're past 10^8 or so. The answers are fast once done, but it's not the best solution. It took me 2 minutes to sieve to 10^10, and beyond that will take GB of memory.

Trial division is exponential time so even with C+GMP is not going to be practical past 25-30 digits (and is hideously slow at those sizes). The Perl code is just going to get worse.

Time for primality proofs is another discussion -- I'm writing some web pages on that since I realized I keep writing the same thing on forums.

For the largest known primes, we'd want to use a Lucas-Lehmer test since they are Mersenne primes. I have not added any special form tests (nor have the other modules), but the LL test is pretty straightforward. They would still take a long time. The largest currently known prime has 17,425,170 digits. Using code specifically made for this, it took 6 days on a 32-core server and 3.6 days on a GPU.

For a general form numbers, last year some people ran tests on a couple Wagstaff PRPs with ~4 million digits. OpenPFGW took 4-70 hours to show they were Fermat PRPs, and 5 days for the Lucas test. A fast Frobenius test implemented with GMP took slightly over one month.

]]>

Memory | Time | Solution |
---|---|---|

2096k | 72.6s | Perl trial division mod 6 |

2100k | 124.8s | Perl trial division |

3652k | 36.2s | Math::GMP |

3940k | 14.8s | Math::GMPz |

4040k | 1.9s | Math::Prime::Util (no GMP backend) |

4388k | 1.9s | Math::Prime::Util (with GMP backend) |

4568k | 1.4s | Math::Prime::Util (GMP + precalc) |

4888k | 4.4s | Math::Prime::XS |

5316k | 245.1s | Math::Primality |

5492k | 29.8s | Math::Pari |

6260k | 1.5s | Math::Prime::FastSieve |

~16MB | >1 year | regex |

Times are with perl 5.20.0 on an Ubuntu Core2 E7500 machine. I used `/usr/bin/time perl -E '`*...*`'` to get the time and memory use. With this system just starting up Perl on the command line takes about 2MB.

The first two entries are simple Perl trial division routines:

`# mod-6 wheel`

sub isprime { my($n) = @_; return ($n >=2) if $n < 4; return if ($n%2 == 0) || ($n%3 == 0); my $sn=int(sqrt($n)); for (my $i = 5; $i <= $sn; $i += 6) { return unless $n % $i && $n % ($i+2); } 1; }

$s += isprime($_) for 1..1e7; say $s;

```
# Standard method, from RosettaCode
sub isprime { my($n) = @_; $n % $_ or return for 2 .. sqrt $n; $n > 1; }
$s += isprime($_) for 1..1e7; say $s;
```

These have essentially no memory use, but are pretty slow especially as the input increases. However, very low memory and gets the job done for small inputs.

The modules Math::GMP and Math::GMPz have calls to GMP's mpz_probab_prime_p function. They are the lowest memory of the module solutions by a small margin, but not the fastest. They shouldn't slow down much with larger inputs.

Math::GMPz has a bit clunkier interface but is faster and exports the entire GMP integer API (which makes it use a little more memory to load). The speed will differ based on the number of tests: only 3 are required for this size, but we must have at minimum 11 for 64-bit inputs and probably more to avoid false results. Math::GMP has much more object overhead though I believe this is being worked on.

Like some other modules, the result of the primality test is either 2 (definitely prime), 1 (probably prime), or 0 (definitely composite). Using the double negation is an easy and fast way to make the result either 1 or 0.

`use Math::GMP; $i=Math::GMP->new(0); for (1..1e7) { $i++; $s += !!$i->probab_prime(15) } ; say $s;`

`use Math::GMPz; $i=Math::GMPz->new(0); for (1..1e7) { $i++; $s += !!Math::GMPz::Rmpz_probab_prime_p($i,15) }; say $s`

Math::Prime::Util is my number theory module. After working on cutting down memory use it's reasonably small even with many exportable functions. It is the fastest solution as well.

By default it will load the GMP back-end if available. This uses a little more memory, but can be turned off either by not having it installed or setting the environment variable `MPU_NO_GMP` to a non-zero value.

`use Math::Prime::Util "is_prime"; $s += !!is_prime($_) for 1..1e7; say $s;`

To match the behavior of Math::Prime::FastSieve, we can precalculate the primes, making `is_prime` a simple bit-set lookup.

`use Math::Prime::Util qw/is_prime prime_precalc/; prime_precalc(1e7); $s += !!is_prime($_) for 1..1e7; say $s;`

Math::Prime::XS does mod-6 trial division in C. It's fast for small inputs, but as expected from an exponential-time algorithm, will slow down a lot with large inputs. It uses more memory than I'd expect.

`use Math::Prime::XS "is_prime"; $s += is_prime($_) for 1..1e7; say $s;`

Math::Primality implements the BPSW algorithm in Perl using Math::GMPz. It really is better suited for bigint inputs, being both slow and memory intensive for this simple task.

`use Math::Primality "is_prime"; $s += !!is_prime($_) for 1..1e7; say $s;`

Math::Pari is the Perl interface to the old Pari 2.1.7 library. It has lots of functionality, but does take up almost 1.5MB more startup memory. While PARI/GP is reasonably efficient (about 4 seconds), the module returns the boolean result as a Math::Pari object which sucks up lots of time. The double negation is a little faster than directly summing the result.

`use Math::Pari qw/isprime/; $s += !!isprime($_) for 1..1e7; say $s;`

Math::Prime::FastSieve takes a little different approach. It's written using Inline::CPP and sieves a range into a bit vector, after which operations (such as `isprime`) can be efficiently performed. That does limit its range, but the time shows it's quite fast at this operation.

`use Math::Prime::FastSieve; my $sieve = Math::Prime::FastSieve::Sieve->new(1e7); $s += $sieve->isprime($_) for 1..1e7; say $s;`

Lastly we come to Abigail's regex. Very popular for code golfing and showing awesome regex hacks, it occasionally is seen as a practical recommendation from people who have clearly never used it for non-toy-size inputs. It's very cool. It's also not practical for larger inputs. For this task it took 6.8 seconds and 2684k for the first 10k, but 2607 seconds and 7144k for isprime for the first 100k integers. It takes over 1 minute to verify 999,983 is prime, and for 9,999,991 I killed it after 40 minutes. Hence my estimate of over a year to finish the sum-to-10M example.

`sub isprime { ('1' x shift) !~ /^1?$|^(11+?)\1+$/ } $s += isprime($_) for 1..1e7; say $s`

Conclusions

If you're using Moose or a long-running process, the memory use for any of the reasonable solutions here probably doesn't matter -- use what is easy and fast. For command-line programs or processes that are spun up just for a single task, the memory use can matter. My module was hitting 9MB before I finally had enough and reduced it substantially (a big chunk of that was by having functions go straight to XS and load up the thousands of lines of PP only if required). Even for such a simple task as this we can see sizes ranging from 2MB to 6MB, with over 2MB difference even between modules.

Another subject that is important, especially for making utility scripts, is startup time. This task did not measure that, but it can also be a bottleneck especially if comparing vs. standalone C programs that have essentially no startup cost.

]]>Math::Pari probably needs a co-maintainer with lots of time. So far I don't think anyone qualified has wanted to step up. It's a lot of work. On the plus side, the RT situation isn't quite so bad -- it would look a lot better with some pruning of duplicates and closing of fixed issues. There are a lot of issues that look like they're fixed but just haven't been closed.

This leads to the digression of how it would be nice to wean the remainder of the Perl crypto modules off of Math::Pari, but they're often in the same boat. The authors are around but don't have time to work on the modules and aren't ready to give them up. There's also the issue of the alternatives: Math::BigInt is core and portable, but super slow for this work without one of the backends, and the backends also have long-standing critical bugs. Math::GMP or Math::GMPz would be the obvious and best choices, but then we're requiring platforms to have GMP installed. I'm still trying to find time to get an alternative out, but it won't be ready in time for CPAN day.

]]>In many cases what I found was that often just using string storage was faster than Bit::Vector, merely because Perl optimizes the heck out of things like substr. Once the vector grows large (e.g. for Unary codes) then Bit::Vector is better. Using 32-bit vectors with bit twiddling in Perl was pretty close to Bit::Vector's speed for my operations. Of course it will differ based on your operations.

Using an XS back end for the bit manipulation results in ~70x speedup for this application (and another 2x speedup if I go straight to XS and skip the Perl Moo class entirely, but then you give up some extensibility).

]]>One thing I ended up adding to my test suite for one package was an "examples.t" file, which contains tests for all the examples in the synopsis and throughout the documentation (including the small examples per method and the EXAMPLES section). This compares the expected output, which Test::Synopsis doesn't have. I wanted to make sure I caught any regressions in anything I used in the documentation because that would be particularly embarrassing. The downside is keeping things in sync.

]]>* Longest chain of monthly reference to module in other sources (e.g. stackoverflow, non-Perl blogs, RosettaCode, etc.). Modules that show how Perl can be usefully extended outside of the echo chamber. Hard to tally and prevent gaming though.

* Longest chain of an author contributing to unique non-owned modules (that is, each week/month they have to submit an RT, issue, patch, pull request, etc. to a module that they don't own, each module only allowed once in a chain).

My taste runs to trying to increase the quality of what we have, vs. putting out yet another variation on some module we already have. Measuring this is certainly harder than tracking new CPAN submissions though.

]]>I am going to do a LibTomMath-based module that's self contained (see CryptX, for instance). Interface similar to Math::GMP probably with some additions (e.g. modular inverse, exponentiation, etc. similar to Math::BigInt). It certainly won't be as fast as GMP for the big stuff, but it will be fine for most uses and I will be able to include it as a dependency and still work on all platforms.

For Algorithm::AM, I'm not sure if any of this would help or not. LTM will be tucked away inside the Math::LTM module where you couldn't get to it (let me know if there is a way, but given we don't install the C headers as part of the module, for instance, I don't see how it would work). If the concern is performance and readability, you could use some generic functions such as a subset of the GMP API (e.g. mpz_init, mpz_set_ui, mpz_add, mpz_mul, mpz_get_d, mpz_get_str, mpz_clear), then have Makefile.PL / the compiler make a decision which to use (1) gcc's __int128_t type, (2) GMP (no functions needed, just include GMP.h), or (3) by-hand. Good for performance, not so good for testing. GMP is installed on quite a few systems these days, but definitely not all.

]]>** invmod(a,n)** computes the modular inverse. Similar to Pari's

** vecsum(...)** returns the sum of a list of integers. What about List::Util::sum! I was using that, until I started wondering why my results were sometimes incorrect. Try:

my $n = int(2**53); die unless $n+13 == int(sum($n,13));

The problem is that List::Util turns everything into an NV, which means it lops off the lower 11 or so bits of 64-bit integers. Oops. min and max have the same problem, so the next release of MPU will probably add vecmin and vecmax functions.

** binomial(n,k)** computes a binomial coefficient. I discussed this in a previous post. It uses the full Kronenburg definition for negative arguments, and as much computation in C as possible for speed.

** forpart { ... } n[,{...}]** loops over integer partitions. Pari 2.6 added this, and I thought I would as well. In its simplest form, it gives the additive partitions like the Integer::Partition module, just much faster (albeit not using an iterator). It also allows restrictions to be given, such as

forpart { say "@_" } 10,{n=>5}to only show the partitions with exactly 5 elements. We can use amin and amax to restrict by element size; nmin and nmax to restrict by number of elements. Of course any restriction can be done inside the block, but using the passed-in restriction means the block doesn't get called at all -- important when the number of unrestricted partitions is in the tens or hundreds of millions.

Performance

Wojchiech Izykowski has been working on fast Miller-Rabin testing for some time, including, for a few years now, hosting the Best known SPRP bases collection. He's also been working on Fast Miller-Rabin primality test code. I sent him a 1994 paper by Arazi a while back, which he managed to turn into a performance improvement to his 2^64 modular inverse, and some more back and forth led to even faster code. That combined with a few other changes to the Montgomery math resulted in a close to 50% speedup in the primality tests, which were already blazingly fast. On my 4770k it's now less than a microsecond at any 64-bit size, and 2-5x faster than Pari/GP. The caveat being that the fastest Montgomery path is only used on x86_64. Regardless of platform, the results for any 64-bit number are deterministic -- there are no false results because we use the BPSW test.

I also made a small speedup for small integer factoring, which helps speed up a few functions that call it, e.g. euler_phi, divisor_sum, moebius, etc. Useful for shaving off a little time from naive Project Euler solutions perhaps. I had a few tasks that did a lot of calling of these functions for small-ish values, and while they're already pretty fast, every little bit helps.

What's next?

For minor updates, I already mentioned vecmin and vecmax. I have some improvements to legendre_phi that should be done. I'm thinking is_power may get an optional third argument like Pari that gets set to the root.

Math::Prime::Util::GMP has implementations of valuation, invmod, is_pseudoprime, and binomial now, to help speed those up. I'll probably add vecsum as well. I have a speedup coming for primality testing of numbers in the 40 to 5000 digit range, which is important for the applications I'm running.

Lastly, I'm really hoping to get time to make an XS bigint module, which should give a big boost to the cases where we need bigint processing but don't have GMP. Math::BigInt is a lot slower than I'd like.

]]>]]> Since the base of Math::Prime::Util is in C, I first needed a solution in C. MJD has an old blog post and followup from 2007 related to overflow. This mostly works, but we need a way to detect overflow. RosettaCode has an idea, though overall the code is worse than MJD's.

Experience has shown that once I have to leave the C code, performance takes a huge hit. Hence it would be best to handle everything possible here. I did a little experiment, calculating the 5151 cases of binomials with n in 0..100 and k in 0..n. All but 1355 of them have a 64-bit result, so this is (without going too crazy) the best we can get.

- RosettaCode's example bails on 1990 cases, including (38,13).
- MJD's base code bails on 1617 cases, including (63,29).
- Adding a gcd bails on 1389 cases, including (76,21).
- Adding a second gcd bails on 1355 cases, all having a result > 2^64.

The single gcd is exactly what MJD suggests in his followup blog and gets most of the cases. However, for `r = r * n/d`, if we first reduce `n/d` then `r/d`, that handles the other cases. The cost is an extra gcd in C (only when it looks like we might overflow), to save us a call to either a GMP binomial turned into a Math::BigInt (not too bad), or to Perl's Math::BigInt bnok (a *lot* slower).

static UV gcd_ui(UV x, UV y) { if (y < x) { UV t = x; x = y; y = t; } while (y > 0) { UV t = y; y = x % y; x = t; /* y1 <- x0 % y0 ; x1 <- y0 */ } return x; } UV binomial(UV n, UV k) { UV d, g, r = 1; if (k >= n) return (k == n); if (k > n/2) k = n-k; for (d = 1; d <= k; d++) { if (r >= UV_MAX/n) { /* Possible overflow */ UV nr, dr; /* reduced numerator / denominator */ g = gcd_ui(n, d); nr = n/g; dr = d/g; g = gcd_ui(r, dr); r = r/g; dr = dr/g; if (r >= UV_MAX/nr) return 0; /* Unavoidable overflow */ r *= nr; r /= dr; n--; } else { r *= n--; r /= d; } } }

Negative Arguments

Once the C code overflows (or the XS layer decides it can't understand the arguments), I go to GMP if available, and Math::BigInt's *bnok* if not. When I started testing negative arguments, things got interesting. First let's look at the well defined case: `n < 0, k >= 0`. Knuth 1.2.6.G is quite standard: `binomial(-n,k) = (-1)^k * binomial(n+k-1,k)`. This is handled correctly by Mathematica, Pari, and GMP, but not by Math::BigInt. An RT has been filed (the worst part is that it gives different answers depending on which back end is used).

Things get murky when looking at negative k. Knuth 1.2.6.B indicates that for a positive n, the binomial is 0 when `k < 0` or `k > n`. But what about negative n? Pari says it is 0 in this case as well. GMP's API doesn't allow negative k at all. Math::BigInt also says it is always zero. But Mathematica references Kronenburg 2011 and defines it as `(-1)^(n-k) * binomial(-k-1,n-k)` when `n < 0` and `k <= n`.

I've decided to follow the full Kronenburg definition for negative arguments. This means doing some additional work in the XS and GMP wrappers, as well as wrapping up Math::BigInt's function.

For GMP, this is relatively easy to handle, given char* strn and strk:

mpz_init_set_str(n, strn, 10);

mpz_init_set_str(k, strk, 10);

if (mpz_sgn(k) < 0) { /* Handle negative k */

if (mpz_sgn(n) >= 0 || mpz_cmp(k,n) > 0) mpz_set_ui(n, 0);

else mpz_sub(k, n, k);

}

mpz_bin_ui(n, n, mpz_get_ui(k));

/* ... return result n as appropriate ... */

mpz_clear(k); mpz_clear(n);

For XS and native-int Perl it's slightly trickier, because we have to overflow in the case where the unsigned binomial succeeded and we need to negate the result but the high bit is set -- hence it can't be represented as a signed integer. Wrapping Math::BigInt isn't too different than GMP. I've put a patch in the RT to do the modifications inside Math::BigInt, but it remains to be reviewed and further tested.

]]>