irc.perl.org #perl-help posed a good question tonight. Why does this only find some of the matches?
my $sequence = "ggg atg aaa tgt tcc cgg taa atg aat gcc cgg gaa ata tag cct gac ctg a";
$sequence =~ tr/ //d;
print "Input sequence is: $sequence \n";
while ($sequence =~ /(atg(...)*?(taa|tag|tga))/g) {print "$1 \n";}
Because, by default, regex /g begins each subsequent search after the end of the last match, so overlapping hits are not found. As this blog post explains, a negative lookahead assertion is the key to finding all of them. This works great:
while ($sequence =~ /(?=(atg.*?(taa|tag|tga)))/g) {
print "$1\n";
}
I'm partial to bioinformatics homework after 4 years of hacking on the stuff. :)
My goal a long time ago was to index about 90 to 95% of BackPAN, thinking that if I didn't get some ancient distributions that would be just fine and no one would miss them. There are about 140,000 distributions to index, and I'm figuring out why I can't get the last 4,200. That means I'm indexing
Recently I got a nice small project work on: a web interface for a database with a simple search mechanism (Ajax for frontend with redirects to actual result pages).
I received the database in Excel form. No worries, we have the excellent Spreadsheet::ParseExcel so I'm not scared of spreadsheets. Bring it on!
And yes, the client did "bring it on". He brought it on with 260 columns, nonetheless. Each contained a "1" or "0" for match. "You just go over the columns here, look for '1', and then continue over to the product name, search it in this sheet over here and find the number to the right and return that to client - simple!"
Yes, two-hundred and sixty columns. Alright, so I'll just normalize it. "You don't need to normalizical nothing [double negative!], it's good the way it is" - "No, trust me, I need to normalize it" - "Alright, knock yourself out".
I'm currently working with extracting data from a system with an XML based command UI, so I am fairly often dumping serialised (using Data::Dump) perl objects out whilst debugging.
To make the piles of debug output easier for me to parse I pushed the files through Perl::Tidy.
You would not believe how long it takes, or how much memory is required, to run 110MB of perl datastructure dumps through perltidy!
Actually I don't know how long or how much memory it took either - I killed it after half an hour and 3GB.
I mean, who knew! :-)
Next Tuesday in Bernal Heights from 7:00pm until whenever.
For those that haven't been chez Paul we have a basement, bar, projector, wifi, yard, BBQ, etc so we can eat, drink & give presentations. There's space for at least a dozen seated inside, and more outside (for those that can withstand the Day Star).
We'll be hacking on whatever, or just shooting the breeze about Perl.
Summary:
What: SF.pm Hackathon
When: Tuesday 28th September 2010, 19:00 'til Paul kicks us out.
Where: Paul's place, SF, 94110 (address on email to Yes RSVP on day of, in Bernal Heights.)
What to bring: computer, snacks & drinks.
A while ago I needed a Perl data serializer with some requirements (supports circular references and Regexp objects out of the box, consistent/canonical output due output will be hashed). Here's my rundown of currently available data serialization Perl modules. A few notes: the labels fast/slow is relative to each other and are not the result of extensive benchmarking.
Data::Dumper. The grand-daddy of Perl serialization module. Produces Perl code with adjustable indentation level (default is lots of indentation, so output is verbose). Slow. Available in core since the early days of Perl 5 (5.005 to be exact). To unserialize, we need to do eval(), which might not be good for security. Usually the first choice for many Perl programmers when it comes to serialization and arguably the most popular module for that purpose.
In a previous post I wrote about the lack of a Perl module to build standalone C libraries. I suggested the creation of a new module, and I did it. I have my first working code available at github. I am happy to add patches as far as the main objective of the module remains intact.
At the moment I tested it with Mac OS X (Leopard) and Windows (with Strawberry Perl). In both cases, with Perl 5.12.x. So, the Build.PL might be missing a Perl version if there is anything that doesn't work on previous Perl versions.
Also, documentation is still missing. Refer to test 01-simple.t for directions on how to use it.
Previously, I wrote about modeling the result of repeated benchmarks. It turns out that this isn't easy. Different effects are important when you benchmark run times of different magnitudes. The previous example ran for about 0.05 seconds. That's an eternity for computers. Can a simple model cover the result of such a benchmark as well as that of a run time on the order of microseconds? Is it possible to come up with a simple model for any case at all? The typical physicist's way of testing a model for data is to write a simulation. It's quite likely a model has some truth if we can generate fake data sets from the model that look like the original, real data. For reference, here is the real data that I want to reproduce (more or less):
One of the most common responses to simple, text-book-quality questions on many Perl community outlets is "We are not here to do your homework". It's usually thrown in a swift, abase, manner, as if saying "How DARE you ask us to answer your assignment for you?!", and at times is accompanied by a general comment as to the asker's intelligence, seriousness, effort, capabilities, values, ethics and sexual capabilities. It is also, always, the most incorrect response possible.
Just read this blog post. Comments are disabled, so I thought I'd add a blog post.
There are endless ways we can sneer at PHP's deficiencies, but since 5.3 PHP already supports anonymous subroutines, via the function (args) { ... } syntax. So:
On 28 February 2006, the Golden Age of Right Parsing ended.
The End of the Age was like a rewind of the Beginning.
The Golden Age of Right Parsing began when a right parser
replaced the hand-written recursive descent parser in
the C compiler.
It ended,
almost three decades later,
when the world's most visible C compiler replaced
its right parser with a
hand-written recursive descent implementation.
I know, I know, it's all free software and open source - you shouldn't expect anyone to do anything. However, we do want our projects to succeed and we do put them out there in hope that it will be useful to someone.
So, you can be a jerk and provide a simple .pm file, without any documentation (because "it simply works"), no toolchain configuration (you "don't need any") and no tests (hey, if it works, it works, right?).
However, CPAN ain't like that. CPAN is (mostly) well structured distributions that adhere to community standards that include: building toolchain files, metadata to help CPAN indexers and installation applications, tests, documentation (in a standard format - POD), sometimes examples folder with sample scripts using your code, perhaps a GPG signature file to have content verified and - *gasp* - a change log indicating what each version added (and perhaps when it was released too).
While preparing some major technical stuff not yet released, here some more philosophical items. Last time i was comparing Larry with Al Yankovic. A funny thing with some mostly well known insights. so lets go even deeper. What's a Pearl really?
Like many Perl scripts also Pearls start with a pain. A mussel gets a stone or something else painfull into his shell and has to deal with it and something shiny takes birth out of it. Perl is also more focussed to solve practical problems than demonstate paradigmes. But lets go even deeper.
Where pain comes from? From injustice, ignorance and own ego of course. And it shows real greatness to stay humble but don't render yourself as an victim and do something productive with that situation.
Ultimate darkness and what some would call evil is called in the rabbinic tradition Binah. And its associated with a Pearl (all other sephira with clear see-through-gems). Because in the end every darkness/shortcoming is turned into great gift. But only by those who stay to their greatness. And the Perl community has several of them.
The path from darkness to strength is called Gimel (in the version i prefer at least), which translates to Camel. How appropriate. I start to wonder what Larry knew when he choose that logo :).
In the previous article, I wrote about the pitfalls of benchmarking and dumbbench, a tool that is meant to make simple benchmarks more robust.
The most significant improvement over time ./cmd is that it actually comes with a moderately well motivated model of the time distribution of invoking ./cmd many times. In data analysis, it is very important to know the underlying statistical distribution of your measurement quantity. I assume that most people remember from high school that you can calculate the mean and the standard deviation ("error") of data and use those two numbers as an estimate of the true value and a statement of the uncertainty. That is a reasonable thing to do if the measurement quantity has a normal (or Gaussian) distribution. Fortunately for everybody, normal distributions are very common because if you add up enough statistics, chances are that the result will be almost Gaussian (Central Limit Theorem).
CPAN Ratings are a great idea. Unfortunately it seems that in some cases they are either unused (in places where they should be) or misused (in places where they shouldn't be used).
While some distributions (which are very recommended to use by the Perl community) do not enjoy having ratings at all, I've noticed some people putting their personal gripes as ratings.
Theoretically you should be able to do that, true, but these ratings don't act the way you might assume they would. If you have a dist and I rate your first version as one star (for lack of tests, or because I do not see it useful for me), you have a single star (out of five) marked until someone changes that. I've seen ratings that existed since the first development version of a distribution (which usually say "this sucks, how do I use it?!") - leaving me to believe that they cannot be deleted, perhaps only to the original poster who might be MIA.
This weekend I gave a presentation at OpenCert 2010 about the Perl testing ecosystem, this was an article I wrote with ambs++ and jjoao++. The article can be found on the conference site here.
I like the Perl community. We have a open, tolerant and helpful community. Looking around to other languages, I do think we’re in a much better shape then most people seem to realize.
I loathe the Python community’s snobbish and hostile attitude to newcomers and outsiders. Maybe I’m biased by having to counter FUD in people who have been exposed to the Python community, but I can’t call it anything else. I wouldn’t be the first to suggest that’s the main reason they never took over even when they were otherwise in a much better shape than we were. They threw bricks into their own windows and I am somewhat doubtful they’ll get a second chance.