Perl performance on Apple M1

I recently got an Apple M1 Mac Mini, half out of curiosity, half because it was exactly what I would need: I have a low end Mac just to try out things like new Xcode betas etc, like a "canary" machine. My old 2012 Mac Mini stopped getting official Apple updates, so it could no longer do what I needed and the 8GB RAM, 256GB SSD M1 mini at $699 is easily the cheapest Mac you can buy.
Overall, unlike the typical Mac Minis of old which seemed to be on the slow side, it did feel quite fast from the start, so I thought I'd run some benchmarks on it for fun to see how Apple's ARM M1 fares against some x86 competition. And, as in my day job I use mostly Perl, I thought some perl-related benchmarks would be of interest to me.

For those not aware, the M1 is an ARM-based CPU (well, includes GPU, so SoC really), with 8 cores total (4x performance @ 3.2GHz/12MB L3, 4x efficiency @ 2GHz/4MB L3) built at 5nm and consuming up to 15W. Basically the "laptop" class CPU of what Apple has been building for iPhones/iPads. Apart from native ARM code, it can run x86 code through Rosetta 2, but I still can't use it for work - our dev environment currently relies on VirtualBox which needs actual x86/VT-x silicon. I ran benchmarks against my work laptop, a Mid 2015 15" MacBook Pro with a 2.5GHz i7 Crystalwell. Even though it was Apple's top of the line at the time, it's a bit old now, I keep it for the non-butterfly keyboard and the full complement of ports, and until recently the newer Macs weren't much faster anyway. Although an older i7 will make it easier for the M1 to compete, I still find the comparison quite interesting, especially since the Mac Mini has always been the "slow/cheap" Mac - and it's now even cheaper. Plus I'll throw some tests with different hardware just for comparison.

The Benchmarks

I will definitely not claim the benchmarks I ran are truly representative of real world performance, especially when I am of the opinion you should benchmark your own code - what you personally would run. But, I also added some typical things most perl users might do and some things that came up when looking for Perl "benchmarks", so that anybody can try them and get an idea of the relative performance of their own machine.

  • Building perl 5.32.1
It's nice that Apple finally updated the MacOS system Perl. BigSur now comes with v5.28.2 (threaded) by default, after being stuck at v5.18 for many years. However, I rarely rely on system Perl, so the first thing to do which is sort of a benchmark in itself, would be to get perlbrew and run:
perlbrew install perl-5.32.1
  • Moose 2.2015
The Moose test suite is, like the object system itself, a relatively slow affair. I'll time the default cpan installation which builds and runs the test-suite single-threaded:

cpan Moose
Then, I can try the test suite after preloading Moose with yath at 1, 4, 6 threads. In the last case, the M1 will have to use its efficiency cores, while the i7 will use HT:
yath -PMoose
yath -PMoose -j4
yath -PMoose -j6
  • prime.pl
I modified a bit the primes.pl script from here to:

use strict;
use warnings;

use Time::HiRes 'time';
my $time = time();

my $n = $ARGV[0] || 100000000;
my @s = ();
for (my $i = 3; $i < $n + 1; $i += 2) {
    push(@s, $i);
}
my $mroot = $n**0.5;
my $half  = scalar @s;
my $i     = 0;
my $m     = 3;
while ($m <= $mroot) {
    if ($s[$i]) {
        for (my $j = int(($m * $m - 3) / 2); $j < $half; $j += $m) {
            $s[$j] = 0;
        }
    }
    $i++;
    $m = 2 * $i + 3;
}

my @res = (2, grep($_, @s));
warn "Found ".scalar(@res)." primes in ".(time()-$time)." sec.\n";

I ran it as it is, and also on 4 threads with the argument 20000000 (to avoid hitting the 8GB M1 RAM limits).

  • phoronix-test-suite-10.2.2
The only test suite that advertises perl tests, although it turns out in just has 2 small subtests for perl (interpreter, pod2html), with the command:
phoronix-test-suite run pts/perl-benchmark
  • BioPerl
I downloaded some bacteria from genbank and benchmarked loading the sequences to count codons or monomers.

use strict; 
use Bio::SeqIO; 
use Bio::Tools::SeqStats; 
use Benchmark qw(:all);

my $in = Bio::SeqIO->new(-file => "gbbct10.seq", -format => "genbank");

timethis(1, sub {
    my $seq = $in->next_seq;
    my $seq_stats = Bio::Tools::SeqStats->new($seq); 
    my $codon_ref = $seq_stats->count_codons(); 
});

timethis(1, sub {
    my $builder = $in->sequence_builder();
    $builder->want_none();
    $builder->add_wanted_slot('display_id','seq');
    for (1..10000) {
        my $seq = $in->next_seq;
        my $seq_stats = Bio::Tools::SeqStats->new($seq); 
        my $weight = $seq_stats->get_mol_wt(); 
        my $monomer_ref = $seq_stats->count_monomers();
    }
});
  • Precession
Let's precess 1 million random celestial coordinates between random epochs using my Astro::Coord::Precession.

use Astro::Coord::Precession 'precess';

my $precessed = precess([rand(24), rand(180)-90], rand(200)+1900, rand(200)+1900)
    for (1..1000000);
  • Text processing
I threw in a script (called DSOgenerate) that reads various astronomical catalogues and compiles the database for my Polar Scope Align iOS app, and another that parses webpages to get articles (its slowest component is HTML::FormatText) that was used for a university project I worked on.

The Good (aka: The Results)

BenchmarkUnitsi7M1M1 Diff
Build perl 5.32.1min21.5213.7856.1%
cpanm Moosesec116.6144.33163.0%
yath -PMoosesec47.6318.43158.4%
yath -PMoose -j4sec18.055.88207.0%
yath -PMoose -j6sec16.945.77193.6%
prime,plsec23.2215.5649.2%
prime.pl 4xsec4.903.0660.0%
Phoronix pod2htmlmsec213.0091.17133.6%
Phoronix Interpretermsec4.571.25266.8%
BioPerl codonssec149.89127.5817.5%
BioPerl monomerssec16.647.82112.8%
Precessionsec6.853.27109.5%
DSOgeneratesec13.125.71129.8%
HTML::FormatTextsec8.524.7081.3%
Average:124.2%



Or, if we want a nice comparison graph where the i7 is "1x" speed and plot the M1 in relation to it:

chart (2).png

I found the results quite remarkable. I mean, the main reason I went through all these is that I could see the M1 going through the installation of cpan modules at a ridiculous pace compared to my i7 when I was setting it up, it was very obviously faster.
t turns out it is over 2x faster as a crude "average" of the above tests. You can see from the two multithreaded tests that it actually gains even more an advantage when using all its (performance) cores compared to the i7.
There is at least one test (the codons) where the M1 does not really "shine", so, as I said, benchmarking your own specific workload is important - the M1 does seem very fast at many common Perl tasks, but not *all*.
Could my old i7 be just too slow? Just to make sure, I had my colleagues with the 16" Mac with the fastest CPU available, the 8-core 2.4GHz i9, run a couple of the single core benchmarks to make a comparison, one that did really well on the M1, one that did below average:

BenchmarkUnitsi9M1M1 Diff
cpanm Moosesec86.1444.3394.3%
prime,plsec19.3515.5624.4%

So while the i9 is generally 20-30% faster than the i7, it's still nowhere near the M1 being more than twice as fast. Note that the i9 has 8 full speed cores, so things might get tighter for workloads using more than 4 cores at a time.

The Bad

Simply put, not everything works yet. Sometimes it's something simple, like a patch I submitted to Sys::Info::Driver::OSX due to the different reporting of the asymmetric processor cores. But I have been unable to install some other CPAN modules or see test failures that are not easy to explain.

At least Perl developers will have a native-running environment, even though there are some glitches that should get sorted out in time. I am saying this because while some things run fast even under Rosetta, I have encountered cases where non-native software runs slowly. For example, an Android project I tried, takes almost twice the time to compile on the M1. Android studio is not yet native and it shows, I would not recommend the platform to android devs. It is the opposite for iOS devs of course, the M1 is the ideal platform for obvious reasons.

Additionally, the comparison shows the M1 can be much faster than the i7/i9, however that comparison is important only if you are limited to the Apple world. If you don't need a Mac specifically and will just run Linux (to not mention Windows), then you are not limited to what Apple has to offer. I am referring to AMD of course, for most workloads, a Zen 2 based CPU is quite a bit faster than intel per thread, and on top will offer more cores at a similar price. I don't have a Zen 2 CPU to try out right now, however I do have a ThinkPad X395 which has a Zen-1 based 2.1GHz Ryzen 5 3500U. While it's not an old CPU, the newer Zen-2 based 4000-series and 5000-series CPUs seem to be almost twice as fast per core in various benchmarks, which would probably make those faster than an M1, given that the "slow" 3500U is already a bit faster (around 15% on average it seems) than the i7:

BenchmarkUnits3500UM1M1 Diff
cpanm Moosesec101.2244.33128.3%
prime.plsec20.0115.5628.6%

That M1 advantage over 3500U is probably not enough to hold off Zen 2 cpus, which also come with many more full-power cores than an M1.
Then again, the M1 is Apple's first "laptop/desktop" silicon and they were possibly targeting efficiency more than raw performance - as the latter was an easy win vs Intel, so I would keep an eye on what their next CPU will bring.

The Ugly?

The #1 criticism of the M1 Macs is not related to their CPU, but the fact that the SSD is soldered on. This means that when the SSD dies (will take several years, but depending on the usage SSDs will eventually fail), you can't just replace it (unless there's complete disassembly, desoldering etc). This "obsolescence by design" might not be that bad given the price of a Mac Mini compared to what Apple users are used to paying, but it is made worse by the fact that an M1 Mac has a signed system volume on the SSD, which is required for the Mac to boot even when booting from an external device. So when the SSD goes, you might not be able to boot your Mac at all - permanently. As I said, not a criticism of the M1 CPU directly, but of the devices that feature it.

Lastly, while my benchmarks were reproducible in general, there was one benchmark - the prime.pl script - that gave me some trouble, exposing a strange and disconcerting issue. So, I run the benchmark for n=20000000 multiple times and I get consistent results. Also, if I run it by launching 4 instances in parallel background processes with a batch file, I also get consistent results. it goes a bit like this:

test % perl prime.pl
Found 1270607 prime numbers in 2.7765851020813 sec.
test % perl prime.pl
Found 1270607 prime numbers in 2.78401112556458 sec.
test % perl prime.pl
Found 1270607 prime numbers in 2.77585196495056 sec.
test % sh batch.prime.sh
test % Found 1270607 prime numbers in 3.00496196746826 sec.
Found 1270607 prime numbers in 3.01989006996155 sec.
Found 1270607 prime numbers in 3.02487397193909 sec.
Found 1270607 prime numbers in 3.02904796600342 sec.
test % sh batch.prime.sh
test % Found 1270607 prime numbers in 3.01903891563416 sec.
Found 1270607 prime numbers in 3.02826595306396 sec.
Found 1270607 prime numbers in 3.02855086326599 sec.
Found 1270607 prime numbers in 3.03278708457947 sec.
If I try again in a couple of hours or so, I will still see the same thing. But, if I try after a sufficiently long time (I am not clear on "sufficiently" seems like several hours, but definitely by the next day) - without using the Mac Mini in the interim, just left powered on - I start seeing this:

test % perl prime.pl
Found 1270607 prime numbers in 4.18084216117859 sec.
test % perl prime.pl
Found 1270607 prime numbers in 3.94040703773499 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.06315612792969 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.53617215156555 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.34210705757141 sec.
test % sh batch.prime.sh
test % Found 1270607 prime numbers in 3.04679107666016 sec.
Found 1270607 prime numbers in 3.07015514373779 sec.
Found 1270607 prime numbers in 3.07026290893555 sec.
Found 1270607 prime numbers in 3.07335591316223 sec.
test % sudo nice -20 perl prime.pl
Found 1270607 prime numbers in 5.50178408622742 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.03745698928833 sec.
test % sh batch.prime.sh
test % Found 1270607 prime numbers in 3.04621696472168 sec.
Found 1270607 prime numbers in 3.0637059211731 sec.
Found 1270607 prime numbers in 3.07231998443604 sec.
Found 1270607 prime numbers in 3.07551097869873 sec.
Running a single process is suddenly unpredictably slow. The i7 takes 4.3s at this benchmark on a single thread, so the M1 can be much slower. However, once I batch run 4 parallel processes I get the same great performance I was seeing before. It is reproducible, it's not a matter of if, but a matter of when I will eventually will get into this problematic state which, it seems, I can only solve via a reboot. After a reboot everything is fine once more. I tried to see what's going on in several ways. Checking to see whether something like the efficiency cluster taking over, or the scheduler switching cores etc, I tried monitoring with
powermetrics -s cpu_power
The result is not interesting enough to post, because both when the system is in the "good" and the "bad" state, only the performance cores are used (but not just one of them as I expected - a random mix, different each time, but similarly random for both "states"). It's the same story using the CPU history window:
cores2 copy.png

The graph above shows a couple of single-process runs of prime.pl while in a "good" state, it causes all 4 performance cores (numbers 8 on CPU monitor) to be used at random proportions and it's the same for "bad" status runs - just the bars are twice as wide, the calculation takes longer.

To add another clue that makes things weirder rather than explain the issue, I checked to see if it's my compiled perl at fault, so ran with the built-in perl which I assume Apple made sure to compile correctly. System 5.28 is a bit slower in "good" status runs:

test % perl prime.pl
Found 1270607 prime numbers in 2.77031397819519 sec.
test % /usr/bin/perl prime.pl
Found 1270607 prime numbers in 2.95687794685364 sec.
test % perl prime.pl
Found 1270607 prime numbers in 2.77954316139221 sec.
test % /usr/bin/perl prime.pl
Found 1270607 prime numbers in 2.95602297782898 sec.
test % perl prime.pl
Found 1270607 prime numbers in 2.77461099624634 sec.
test % /usr/bin/perl prime.pl
Found 1270607 prime numbers in 2.94599509239197 sec.
But on "bad" status it is faster than my compiled perl (quite consistently, I've done this a few times) - although still much slower than after a reboot:

test % perl prime.pl
Found 1270607 prime numbers in 5.44245409965515 sec.
test % /usr/bin/perl prime.pl
Found 1270607 prime numbers in 4.92102980613708 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.34624910354614 sec.
test % /usr/bin/perl prime.pl
Found 1270607 prime numbers in 3.51168012619019 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.66441202163696 sec.
test % /usr/bin/perl prime.pl
Found 1270607 prime numbers in 3.62216806411743 sec.
test % perl prime.pl
Found 1270607 prime numbers in 5.46292304992676 sec.
I did eventually find a good clue: I can trigger this weird behaviour if I force the Mini to sleep and then wake up - it wakes up in the bad state. However, as there are no battery settings (not a laptop), I can't find any "go to sleep" timer in the settings - and, as I said, just leaving it for an hour or two (the screen does go to sleep, there's a setting for that) does not get it in the weird state. In any case it's probably got something to do with the CPU sleep states that Apple has missed. Since I couldn't reproduce it with the other workloads I would assume it's not gonna be easy to track down. Reminds me a bit of the problems I had waking an older Macbook (the white ones) from sleep while connected with multiple monitors - they never actually fixed that, so it had put me off Macbooks for a few years. I seem to hit Apple sleep state bugs.

Overall

I'd say, despite some caveats, the M1 is showing some impressive potential, especially for people who use MacOS and would not get much choice other than Intel's not-that-impressive-lately offerings. If I was looking for my main work machine, I'd probably wait a bit longer for some teething troubles to be solved (unless I wanted to help solve potential perl-specific issues) - and perhaps wait for the rumoured release later this year of a faster chip ("M1X" or whatever).

5 Comments

Could you wrap these up in to a github (etc) repo that others could use to duplicate the benchmarks?

sounds like the "bad state" is preferring the efficiency cores instead of the performance cores.

What I mean is to wrap up what you can in a script so its quick and easy for someone to replicate. Spitting out a result that can be compared to other results.

Leave a comment

About Dimitrios Kechagias

user-pic Computer scientist, physicist, amateur astronomer.