Web Services Part 1: YouTube playlists
It started off with just a simple problem; a friend recommended some songs from various artists on YouTube. I go and listen to the songs, like them, and start looking for more from the same artists. Browsing around I find some interesting playlists but notice some of them have the same songs and I think "I'll just download the playlists and remove duplicates with a quick script".
A little search-fu later and I am disappointed. I find broken links, closed source pay applications, (yeah like I am going to pay for a closed source application to access a public API) and a stack of how to blog posts consisting of nothing more than 'Download this program for Windows and click here, here, and here.' I hit up a few friends on irc and they all strongly recommended youtube-dl, a Python script that has command line options for getting playlists. Let me be clear when I say strongly recommended, I was told about how they had used it for years with no problems, how it was the go to script for downloading different items from youtube, etc... My expectations are now set pretty high considering the high praises my friends put forth meaning, I expect it to work for the problem I have.
youtube-dl the disappointment
I start by going to github.com/rg3/youtube-dl to pull down the latest version and I notice there are 193 Issues and 8 Pull Requests currently open. I clone the repository, set it up and play with it. I build up the following command to fetch just the playlist from youtube.
./youtube-dl https://www.youtube.com/playlist?list=FLNj0f88cIMwioXOKJE89Lpw --simulate --get-url > music_vids.txt
Now youtube-dl defaults to trying to download the videos when I only want the urls, hence the simulate flag. The above combination did not work nor did the various combinations I tried out, including adding --ignore-errors. Instead I ran into the following problems:
- youtube-dl is slow. It took 2 minutes 17 seconds to get 36 video urls from a list of over a hundred before failing out. I expect this api call should only take a few seconds.
- youtube-dl fails when a video in the playlist has been taken down due to DMCA violations. The application just exits with the error message and does not process the rest of the list.
- youtube-dl fails when a video in the playlist has been deleted by the author. Fails and exit.
- youtube-dl fails when a video in the playlist has been blocked based on your geo-location. Fails and exits.
- The urls youtube-dl returns are for the CDN locations not the canonical urls. CDN urls are not guaranteed to be valid in the long term. At the very minimum this should be noted in the documentation.
I checked the 8 pull requests for fixes and nothing, same for the issue queue. By now I have spent about twenty minutes on research and trying things out. At this point I think it will probably be faster for me to write a program to get the playlists then to try and fix this application or keep looking for something else.
Build it yourself
The first step is to see what modules are on CPAN. Now I am going to go into a little more depth about how I research this to show some of the potential problems with our Perl ecosystem, the key one being signal to noise ratio. Finding quality in the mass of everything is not a problem unique to Perl but the Perl community has come up with some interesting solutions. I open up search.cpan.org in one tab and metacpan.org in another. I do this because the search capabilities of both are different, amongst other things. Metacpan's search is great if the keyword you are looking for happens to be part of the module name.
youtube-playlists more disappointment
On metacpan when you put "youtube" in the search field the first auto-complete is youtube-playlists a script that is part of the WWW::YouTube::Download module. It was last updated May 5, 2013, has zero bug reports, two five star reviews and zero failed test reports. The synopsis shows example usage that looks exactly like what I want. I install WWW::YouTube::Download with no problems and run 'youtube-playlists FLNj0f88cIMwioXOKJE89Lpw' and it returns 25 results only on a play list with more than one hundred entries. Looking at the POD there are no cli options to change the result count and looking at the code I see it is only pulling down the first page of results as xml with no mechanism for handling pagination. I consider hacking away at this program but there must be a youtube module that handles most of the work instead of just pulling in raw xml feeds and doing it all yourself. I am also surprised this program did not show up in the general search engines, only in the CPAN ones.
WWW::YouTube has not been updated since July 28, 2008 and has one review which is one star. The positive points are there are zero bugs in RT and zero failed test reports. The single review raises multiple issues that I agree with based on reading the POD for the different modules WWW::YouTube provides. Therefore WWW::YouTube is not a good fit.
Last updated Jan. 20, 2009, it has 4 open bug reports and zero failed test reports. Glancing at the POD I notice the following message: This module support only Legacy API, does not support YouTube Data API based on Google data protocol. Now to me in this instance Legacy reads as deprecated so why would I build something on that?
WebService::GData::YouTube was the next viable option last updated Nov 13, 2011, with 3 bugs, 1 five star review and six failed test reports and 515 passed reports. A quick search for 'playlist' in the POD reveals get_user_playlist_by_id() - Retrieve the videos in a playlist by passing the playlist id. Exactly what I am looking for and it installs with no errors. Here is my short program.
I call get_user_playlist_by_id() in my while loop so it keeps fetching blocks of 25 videos until the playlist is done. Videos that have been blocked or taken down can be skipped because their 'duration' field does not exist. The urls returned as part of $video have query parameters that are not needed, so I use URI::URL to pull out the host and path. The program outputs the playlist position number followed by the canonical YouTube url.
I also took a quick look at other modules to see what else there was. WWW::YouTube::Info and WWW::YouTube::Info::Simple do not have any playlist functionality.
The take away
I would like to think I selected the best module available but there could still be something better on CPAN or out in the wild. It is always a balance between time spent doing research and building it yourself if you can easily conceptualize the problem. I think far too often people get sucked into the research phase which can be quite fun. I use the bug count, last release date, pass/fail test reports, CPAN Rating, and the list of previous versions to determine if a module is mature enough for a deeper look to solve the problem I have.
There is no single search engine that encapsulates everything so just using one generalized option like Google is not enough. Searching CPAN directly is always worth while. Next stop would have been github, then bitbucket, and sourceforge.
A few months ago I researched the same problem. If I remember correctly I ended up using the excellent XML::Feed which is on its own and with minimal fuss quite capable of handling the YouTube api.
The only problem I encountered is the fact that the api only provides access to the last 1000 entries in a playlist. If you want to completely capture bigger playlists (which exist) you need to use good old screen scraping tactics.