Gitpan Languages

As you may know, Perl is the second most popular language on github. Well, that's what the page says and that page is wrong for a variety of reasons, but first I'm going to talk about an unexpected problem at work.

Other groups often follow our change event log to see if there are any changes their systems need to know about. The default is to present the uses with 10 changes at time, but you can ask for one at a time, twenty at a time, one thousand at a time and so on. One of our customers decided that they wanted our entire change history. So they asked for a million changes. Our system could fetch the changes, but imagine trying to construct an XML document with a million complicated change events. The request timed out. So our customer tried again. Repeatedly. By the time we had several of these requests piling up, our system had gotten dragged to its knees (actually, I believe it was an automated system which automatically retried on timeouts).

Our system wasn't malfunctioning, per se, but I don't think it really occurred to anyone that someone would try to do this. That's why programmers don't write acceptance tests. We tend to test against expectations. I remember one of our acceptance testers pointing out a bug in my code with an extremely complicated set of steps to replicate and my first thought was "why the hell would someone do that?", and then I remembered the million change events.

In short, our code often holds up really well when you do what the developers expect. When you don't, all bets are off. The reason I mention this, particularly in light of Github, is that they've built an incredible tool, but it has some rough edges. When you push it to its limits, they start becoming obvious.

First, you really have to thank Schwern for creating gitpan (the actual repositories are with the gitpan user). It's a fantastic resource, but there are some oddities in its 21,766 repositories. First, I was looking at Smalltalk projects and happened to notice that a couple of the projects were owned by gitpan! (7 of them, if you must know). For example, github thinks that Tk is a mixture of C, Perl, Objective-C, Shell, C#, C++ and the aforementioned Smalltalk. That's because github uses file extensions to guess the language.

Coming back to the languages graph, we can suspect that those numbers are off. In fact, I'm told that until recently, Perl trailed behind Python. While Schwern wasn't trying to game the system, the net effect is that he has (and has done so in a way few others could hope to duplicate). However, I started noticing that quite a few of gitpan's nearly 22,000 distributions appeared to show up in other languages, but no matter your feelings regarding SOAP::Lite, I find it hard to believe that 1% of SOAP::Lite is Visual Basic. (OK, I don't find it hard to believe any more. Seems it's true!)

There was no way I could troll through all of the entries, but my curiosity was piqued. Fortunately, github has an API. So, for example, if you want to see all of my repositories, you can issue the following command:

curl http://github.com/api/v2/yaml/repos/show/ovid

Now you want to take a guess as to what happens when you s/ovid/gitpan/ and try to pull its 21,766 repositories? I've already opened a trouble ticket. My request for Atom feed help was my second attempt to get this data. Here's my third attempt:

use HTML::SimpleLinkExtor; use File::Slurp 'slurp'; use WWW::Mechanize; my $page = 'http://github.com/gitpan/repositories?page='; my $mech = WWW::Mechanize->new; for my $page_num ( 1 .. 726 ) { $mech->get("$page$page_num"); print_links($mech->content); sleep 1; # be nice to them } sub print_links { my $html = shift; my $extractor = HTML::SimpleLinkExtor->new; $extractor->parse($html); for my $link ( $extractor->links ) { if ( $link =~ m{^/gitpan/([^/?]+)$} ) { print $1, $/; } } }

I felt dirty, but it worked. Getting the languages took a bit more work and I won't reproduce the code here, but it took 6 or 7 hours to run. It's basically a wrapper around this:

curl http://github.com/api/v2/yaml/repos/show/gitpan/SOAP-Lite/languages

So, unless I've gotten my SQL wrong, here's what github thinks about the gitpan repositories:

  • Code in 34 languages from ActionScript to VisualBasic.
  • There are 30 distributions it thinks are in one language but have no Perl at all (this is true for the S5 slideshow someone uploaded, but not for Win32odbc).
  • 1516 distributions are written in more than one language
  • 507 distributions simply have no Perl
  • 441 distributions have no code at all! (For example, see Acme-MOM-Yours)

I think there should be a better way to signal to github what language your repository is in.

What? You don't believe these numbers? See for yourself. That links to an SQLite database with three tables (repository, language, repo_2_language).

5 Comments

I once asked Github support what to do if the reported language was wrong. They said, "It won't be. If it is, tell us and we'll fix it. But it won't be wrong.

I have a vague memory of reading about a tool that github uses to do analysis of source code files. They open sourced it so other people could improve it. Sadly I can't remember the name of find details of it now :(

Today GitHub among Most Watched Today Perl projects include markdown-js (A Markdown parser for javascript) and Objective-C-Metaclass (A high-level metaclass system for Objective-C). Funny...

Leave a comment

About Ovid

user-pic Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant. Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.