Wallflower improvements and issues
(Read on, I'm asking for help towards the end!)
The latest release (version 1.008) finally added support for the
Last-Modified response header. Combined with the support for the
If-Modified-Since request header (added in version 1.002), it makes it possible to regenerate a static web site faster, by only generating those files that were actually modified.
It goes like this:
- during the first run, the application sends a
Last-Modifiedheader for pages where it knows the date of the latest changes for the source (think a blog with a publication/update date)
- wallflower saves the static file and changes the filesystem date to the value of the
- during the following runs, if the target directory from the previous run still exists, wallflower will add an
If-Modified-Sinceheader set to the modification time of files that are already there
- if the application figures out the content has not changed, it can send a
304 Not Modifiedresponse, which wallflower interprets as "keep the existing file and move on"
This will only make the generation of subsequent versions of the site faster if generating a page is a somewhat costly operation for the application, and it can decide its content is still fresh before doing the work.
There's actually a Plack middleware that supports sending a
304 Not Modified response when the conditions apply:
Plack::Middleware::ConditionalGET. However, adding it to an application processed with wallflower won't make the generation faster, because it lets the application generate the entire response before deciding to send a 304 instead. In the case of wallflower, which does not use the network at all, this is not gaining anything. The only case I can imagine some actual savings (in memory and in time saved not copying the content to the filesystem), is maybe when the application serves large static files and returns a filehandle to the content.
Wallflower visits the application starting from
/ (by default), and then follows all the links found in
text/css files. Since it behaves as a crawler, any content not reachable from the starting point will be missed.
This is easily fixed by extending the list of starting points (either on the command-line or as a file passed as an argument), but can become cumbersome.
I was thinking about ways of extending this: what if the application generated a list of links that are expected to exist, but not necessarily linked from anywhere in the HTML or CSS? That page could be added to the initial crawling list for wallflower, but we wouldn't want to save this (even if not linked from anywhere).
The idea of reading but not archiving reminded me of a famous artifact from Usenet times: the
Here's my first question / call for help: should wallflower read the
X-No-Archive: yes header in a HTTP response as "you can follow the links, but don't save this file as part of your crawling"?
Other possible improvements
There may be other such standard files that are not usually linked from the web site, but that browsers look for (robots.txt comes to mind) and that wallflower should at least poke for. Note:
favicon.ico is not one of them, since it's linked from the
<head> section of most HTML pages.
Can anyone think of other such well-known URL?