Downloading Criminal podcast episodes

On my quest for downloading more podcasts, I decided to tackle another podcast I started listening to, Criminal.

This posed a set of new problems, and I'm going to go over the code for solving it, since I'm actually somewhat proud of it.

It started fairly simple. I whipped out Web::Query and started walking the path:

    my $hdr   = wq($post)->find('header.entry-header > h1 > a');
    my $title = $hdr->text;

However, when printing the title, Perl has informed me that some of title strings have Unicode characters in them. That's not necessarily bad, Perl just explained that I need to declare tell the terminal this. What sparked my interest is which characters.

When I printed them out, I noticed that titles that had words like Can't were not using a regular ' character, but a full fledged Unicode single quotation mark. That's odd. These aren't quotation marks. But alright, I can easily fix this in Perl. Change previous two lines to:

    my $title = $hdr->text =~ s/\N{RIGHT SINGLE QUOTATION MARK}/'/rg;

Somewhere along the way I realized the live show has a node but no actual episode. No problem, skpping that:

    # skip live show
    $title eq 'CRIMINAL LIVE SHOW' and return;

Now I hit the next challenge: creating the filenames.

The filenames should just be the episode name with a .mp3 suffix, right? Well, yes, except I wanted the episode number. It's provided in words instead of numbers. For exmaple, EPISODE THIRTEEN: THE BIG SLEEP (12.19.2014). Whoa!

Alright, Perl has great language support under the Lingua namespace. A quick search and I found Lingua::EN::Words2Nums. So the process should be yank out from the title the number of episode, then pass it through the module to get the number in digits, then reassemble it back with .mp3. I wonder if I could write a single line of Perl to do all that...

    my $filename =
        $title =~ s/^EPISODE ([^:]+)\:(.+)$/words2nums($1) . " -$2.mp3"/re;

Not bad!

When using Web::Query, I also get a count when iterating over elements. I wanted to print that, but it starts at zero. I can use sprintf, but I can also use the magical @{[]} statement:

    print "[@{[$count+1]}]: Fetching episode: $title... ";

The next challenge was two fold:

  • The episodes are in a SoundCloud widget
  • Not all episodes have a download button

What I need to do is first find the link to the episode widget:

    my $episode_link = wq($post)
                           ->find('div.entry-content')
                           ->find('iframe')->attr('src');

The reason I'm calling find twice is because these elements don't follow each other. That was pretty simple. the iframe is where the widget is located.

We can also get the ID of the episode:

    my ($episode_id) = $episode_link =~ /tracks\/([0-9]+)/;

Except it's sometimes encoded and sometimes it isn't. Oh well, simple enough to handle:

    my ($episode_id) = $episode_link =~ /tracks(?:\/|%2F)([0-9]+)/;

Ah! Now we have it either way.

And now the really juicy bit:

After researching this, I realized that even the widgets that do not have a download link can be downloaded if you use right path, but only if you have a client ID. Where do you one? By registering with SoundCloud as an app. So wait, how does it work on the website?

Well... the SoundCloud widget iframe downloads a piece of HTML with Javascript script tag for the guts of the widget Javascript. That guts JS file (minified and obfuscated) contains the client ID for the web client. Then it fetches all the resources it needs to compile links using that client ID.

Let's start with opening the iframe and fetching the Javascript link within that HTML:

    my $widget_js_link = wq($episode_link)
                             ->find('script')->first->attr('src');

This is the link to the widget guts JS file. Let's download it:

    my $js < io("https://w.soundcloud.com$widget_js_link");

So far so good. It's minified and obfuscated, but with a simple regular expression we can dig into it and grab the client ID:

    my ($client_id) = $js =~ /production\:"([0-9a-f]+)"/;

Oh yes!

All we need now is to create the proper link, fetch the mp3 file, and save it with the right filename:

    io(
        "https://api.soundcloud.com/tracks/$episode_id/download"
      . "?client_id=$client_id"
    ) > io($filename);

Here is the entire code:

(The highlighter got a bit crazy so most of it is in green, sorry...)

This code, along with the code from my previous entry can be found in a small repo I started in order to collect these scripts.

UPDATE 2015-01-06: As Miyagawa-san points out, there is a RSS feed that provides direct links. I should have made it clear that I intended to have a technical challenge and learning experience rather than a simple way to download them. The podcast is not meant to be hard to download, and it has comfortable RSS feeds to easily download all episodes.

Leave a comment

About Sawyer X

user-pic Gots to do the bloggingz