Web scraping continued

I recently gave a talk at AmsterdamX.pm about web scraping. I provided a few examples of scraping (most of them on my Github repo), and amongst them, a few relating to the January assignments page Neil has put up.

The first was to simply check if any of them are mine. I check if I'm the last person who released or if the repo is under my username. It's not very accurate because someone else might release and some projects are under a Github organization, but it's a good start.

Then I thought, I could check for people I know personally, like my colleagues. I did this in the second script. Using Acme::CPANAuthors I apply the same logic I did before, but to a large audience.

By then the output was difficult to read. The third script put everything in a hash so I could display it better.

During my talk I also mentioned I had wanted to merge Web::Query and AnyEvent and the following script achieves just that.

Since you can provide wq() with content on which it would select and not just URLs for it to fetch, I can use AnyEvent::HTTP to fetch and then feed the input into wq(). Fun!

I have another scrapper up my sleeve and I'll write a separate post about that.

Leave a comment

About Sawyer X

user-pic Gots to do the bloggingz