Perl QA Hackathon 2011 - Day 1 "Am I Online?"
Despite my being on the other side of the world, I hope to remain at least somewhat as productive as if I was at the actual event itself (even if this is not particularly achievable).
I've made it something of a tradition for myself that I use time spent in airports to work on modules and algorithms that make it easier to write programs that deal elegantly with being offline, or being in nasty network environments.
This year I've been revisiting one of my biggest successes in this area, LWP::Online.
I originally wrote LWP::Online in response to the rise of "captured" (my term) wireless networks in airports. Captured wireless networks allow you to connect without a password, and appear to have working internet, but all requests result in a redirect to some login, advertising or paywall page.
Captured wireless networks cause massive confusion to many parts of the CPAN toolchain, as a request for something like 02packages.tar.gz will APPEAR to work just fine from the perspective of the HTTP client, but the result is a file that is effectively corrupt (which usually crashes the CPAN client).
My original solution was to take a sample of about half a dozen major websites (Google, Microsoft, CNN and the like) and issue a request for their front page one at a time.
I then look for various strings on each "About Google" on google.com, "Yahoo!" on yahoo.com etc. As long as I can find at least two pages which have the expected string, then I can assume these websites "work" and thus that I probably have access to "The Interweb" in general.
The need for only two out of our five or six samples allows for degradation of a couple of sites due to hacking or special circumstances, and for one or two companies to go out of business without the need for an immediate upgrade of the online detection module.
This first generation does have flaws though. A major company could easily change their site in such a way as to break my detector, and each site must be checked one at a time making the overall detection slow if a single website becomes slow.
For the second generation implementation, I've moved to a completely different algorithm.
The new detector is based on an assumption that major companies with an internet presence will follow best practice and run their website on the "www" variant of their main domain, and that they will issue a 301 or 302 direct from the plain address to the www address.
For example, a request to http://google.com/ will return a 301 redirect to http://www.google.com/.
A captured wireless network, on the other hand, will redirect pretty much every request to some random small company website, or to a raw IP address.
This method is much faster than the original, as it only requires the very first HTTP transaction, does not need to follow any redirects, and only needs to deal with a very small HTTP response that won't be generated on the server.
To make the new implementation even faster, all of the requests are issued in parallel.
This means the time it takes to determine online'ness is now equal to the time to receive the second-fastest tiny highly-cached redirect response from a collection of major companies.
For the proof of concept of this new method, I've implemented it using POE::Declare as POE::Declare::HTTP::Online.
The new methodology should be fairly easy to reimplement on top of pretty much any HTTP client.
Why not check all the sites and compare them to each other? If they all return the exact same response, then you are probably on a captured network. That way you don't have to worry if the sites you are checking have been changed. You only have to assume that the captured network always redirects to the same page.
Why not check all the sites and compare them to each other? If they all return the exact same response, then you are probably on a captured network. That way you don't have to worry if the sites you are checking have been changed. You only have to assume that the captured network always redirects to the same page.
I hope you skated to the airport, parking and public transport are both extortionate to Kingsford Smith.
Just for information and interest...
Apple's iOS devices do a similar thing when connecting to wireless networks, to work out whether they've hit a Captive Portal.
In short, they GET a predefined document from apple.com in the background. This blog post describes it quite well: http://erratasec.blogspot.com/2010/09/apples-secret-wispr-request.html
neomorphic: The main reason not to check them for similarity is that the similarity may not be perfect. The hijack page may have time-variant information (Airline Departures) or contain text of the link you meant to go to. These subtle differences makes matching "the same" rather more difficult than just "eq"
Oliver: It appears that Microsoft has a secret one as well. I should probably add support for both of those to the module, since they will be intentionally robust (as opposed to the current list, which is accidentally robust) :)