Perl QA Hackathon 2011 - Day 1 "Am I Online?"
Despite my being on the other side of the world, I hope to remain at least somewhat as productive as if I was at the actual event itself (even if this is not particularly achievable).
I've made it something of a tradition for myself that I use time spent in airports to work on modules and algorithms that make it easier to write programs that deal elegantly with being offline, or being in nasty network environments.
This year I've been revisiting one of my biggest successes in this area, LWP::Online.
I originally wrote LWP::Online in response to the rise of "captured" (my term) wireless networks in airports. Captured wireless networks allow you to connect without a password, and appear to have working internet, but all requests result in a redirect to some login, advertising or paywall page.
Captured wireless networks cause massive confusion to many parts of the CPAN toolchain, as a request for something like 02packages.tar.gz will APPEAR to work just fine from the perspective of the HTTP client, but the result is a file that is effectively corrupt (which usually crashes the CPAN client).
My original solution was to take a sample of about half a dozen major websites (Google, Microsoft, CNN and the like) and issue a request for their front page one at a time.
I then look for various strings on each "About Google" on google.com, "Yahoo!" on yahoo.com etc. As long as I can find at least two pages which have the expected string, then I can assume these websites "work" and thus that I probably have access to "The Interweb" in general.
The need for only two out of our five or six samples allows for degradation of a couple of sites due to hacking or special circumstances, and for one or two companies to go out of business without the need for an immediate upgrade of the online detection module.
This first generation does have flaws though. A major company could easily change their site in such a way as to break my detector, and each site must be checked one at a time making the overall detection slow if a single website becomes slow.
For the second generation implementation, I've moved to a completely different algorithm.
The new detector is based on an assumption that major companies with an internet presence will follow best practice and run their website on the "www" variant of their main domain, and that they will issue a 301 or 302 direct from the plain address to the www address.
A captured wireless network, on the other hand, will redirect pretty much every request to some random small company website, or to a raw IP address.
This method is much faster than the original, as it only requires the very first HTTP transaction, does not need to follow any redirects, and only needs to deal with a very small HTTP response that won't be generated on the server.
To make the new implementation even faster, all of the requests are issued in parallel.
This means the time it takes to determine online'ness is now equal to the time to receive the second-fastest tiny highly-cached redirect response from a collection of major companies.
The new methodology should be fairly easy to reimplement on top of pretty much any HTTP client.