Reflecting on the DFW.pm winter hackathon competion
I know it has been a while, but I’m finally getting around to posting my reflections on the Dallas-Ft. Worth Perl Mongers Winter Hackathon. Better late than never right? I will talk about my solution, but I also want to comment on the hackathon itself, which I thought was an excellent bit of fun.
First of all I want to thank all the DFW.pm members and the sponsor The Perl Shop for the excellent event. I have been trying to pin down why I enjoyed it so much, and I think it came down to a few things.
First, it forced me to take on a problem I hadn’t tried before — one I wouldn’t probably have found interesting outside of a competition — and give it a fun motivation. By doing so I learned a lot.
Second, I got shell access to a remote computer. Yes this seems trivial, but it kinda feels like borrowing someone else’s car for the weekend, ya know? The reason this was done was that it was impractical for everyone to download a huge volume for testing, but I think being handed the keys was more fun than why we needed them.
Third, in no particular order, was getting to interact with a new group of Perlers. I regularly meet with Chicago.pm members and I spend a fair amount of time on #mojo and #pdl, but suddenly I was talking with a whole new group of people. I even got to participate in interactive testing and the wrap-up meeting both via Google Hangouts (which worked relatively well for multiple participants, sharing video and screens).
Once again, congrats to all of you organizers and participants, it was a great time.
Now about my code, which is available here: https://github.com/jberger/DeDuperizer/blob/master/deduperizer.pl
I will admit that though I had meant to create a full CPAN-able dist, I fell short of time and only wrote a script. Nope not one test :-( Still real-life happens.
I happened to win the competition in several categories. I was relatively sure that I wouldn’t win for speed, because I didn’t have the experience that some of my “opponents” had, nor the time to experiment with different algorithms. In fact I actually hit on some of the secret-sauce, but didn’t organize it well (my monte-carlo method cannot be used with the hashing, its either-or).
Since I saw that limitation, I decided to shoot for lowest memory use. I did win that, and at the time I chalked it up to my use of File::Map for file access. At the time someone mentioned that it was likely that the OS was going to buffer the read even though I was using mapping, and now that I think about the memory that I won by, I would have expected to win by more if the memory mapped file access was the boon I had hoped. This left me with only one explaination: compile-time optimizations.
As you can see here, I parse the command line options in a
BEGIN block and build constants out of the results. Then later when I use those constants, the compiler can optimize away any “unreachable” code, any code masked by a false constant. This meant that my code did not have to build and load any code that wasn’t needed for a given run.
A simple example is this code here:
perl -MO=Deparse -e 'use constant TEST => 1; print TEST ? "True" : "False"'
-MO=Deparse tells the compiler not to run the code, but to print out the code as it sees it AFTER compilation. The result reflects the state of the
use constant ('TEST', 1); print 'True'; -e syntax OK
The last line is just the diagnostic check. As you can see, the compiler optimized away the conditional and thus the program was (minutely) smaller and would not need to check that execution branch in later operation.
I suspect that these optimizations were the real reason I came in with the smallest memory footprint. Still, I liked the idea of File::Map, so check it out if you haven’t already; it’s cool stuff.
Cheers, and I’m looking forward to the next one!