Web scraping continued
I recently gave a talk at AmsterdamX.pm about web scraping. I provided a few examples of scraping (most of them on my Github repo), and amongst them, a few relating to the January assignments page Neil has put up.
The first was to simply check if any of them are mine. I check if I'm the last person who released or if the repo is under my username. It's not very accurate because someone else might release and some projects are under a Github organization, but it's a good start.
Then I thought, I could check for people I know personally, like my colleagues. I did this in the second script. Using Acme::CPANAuthors I apply the same logic I did before, but to a large audience.
By then the output was difficult to read. The third script put everything in a hash so I could display it better.
During my talk I also mentioned I had wanted to merge Web::Query and AnyEvent and the following script achieves just that.
Since you can provide wq()
with content on which it would select and not just URLs for it to fetch, I can use AnyEvent::HTTP to fetch and then feed the input into wq()
. Fun!
I have another scrapper up my sleeve and I'll write a separate post about that.
Leave a comment