Wallflower improvements and issues

By BooK on July 2, 2018 12:16 PM

Wallflower is my static website generator. Well, not really: it's actually generating a static version of any Plack application, provided it behaves reasonably when seen as a static site.

(Read on, I'm asking for help towards the end!)

Improvements

The latest release (version 1.008) finally added support for the Last-Modified response header. Combined with the support for the If-Modified-Since request header (added in version 1.002), it makes it possible to regenerate a static web site faster, by only generating those files that were actually modified.

It goes like this:

during the first run, the application sends a Last-Modified header for pages where it knows the date of the latest changes for the source (think a blog with a publication/update date)
wallflower saves the static file and changes the filesystem date to the value of the Last-Modified header
during the following runs, if the target directory from the previous run still exists, wallflower will add an If-Modified-Since header set to the modification time of files that are already there
if the application figures out the content has not changed, it can send a 304 Not Modified response, which wallflower interprets as "keep the existing file and move on"

This will only make the generation of subsequent versions of the site faster if generating a page is a somewhat costly operation for the application, and it can decide its content is still fresh before doing the work.

There's actually a Plack middleware that supports sending a 304 Not Modified response when the conditions apply: Plack::Middleware::ConditionalGET. However, adding it to an application processed with wallflower won't make the generation faster, because it lets the application generate the entire response before deciding to send a 304 instead. In the case of wallflower, which does not use the network at all, this is not gaining anything. The only case I can imagine some actual savings (in memory and in time saved not copying the content to the filesystem), is maybe when the application serves large static files and returns a filehandle to the content.

Remaining issues

Wallflower visits the application starting from / (by default), and then follows all the links found in text/html and text/css files. Since it behaves as a crawler, any content not reachable from the starting point will be missed.

Nowadays, even static web sites can be made quite dynamic with a little help from JavaScript. If the JavaScript loads more content from the application, wallflower won't see it. (There's no robust way to parse the JavaScript to find which URL it's going to load.)

This is easily fixed by extending the list of starting points (either on the command-line or as a file passed as an argument), but can become cumbersome.

I was thinking about ways of extending this: what if the application generated a list of links that are expected to exist, but not necessarily linked from anywhere in the HTML or CSS? That page could be added to the initial crawling list for wallflower, but we wouldn't want to save this (even if not linked from anywhere).

The idea of reading but not archiving reminded me of a famous artifact from Usenet times: the X-No-Archive header.

Here's my first question / call for help: should wallflower read the X-No-Archive: yes header in a HTTP response as "you can follow the links, but don't save this file as part of your crawling"?

Other possible improvements

It should be possible to add support for reading sitemaps, for example with WWW::Sitemap::XML, if the application produces one such file.

There may be other such standard files that are not usually linked from the web site, but that browsers look for (robots.txt comes to mind) and that wallflower should at least poke for. Note: favicon.ico is not one of them, since it's linked from the <head> section of most HTML pages.

Can anyone think of other such well-known URL?

6 comments

6 Comments

BooK | July 3, 2018 9:24 AM | Reply

This was scheduled for some time later, to give me a last chance to review and possibly reword. I guess adding a time zone to the scheduled date would make sense. :-)

BooK replied to comment from BooK | July 3, 2018 9:25 AM | Reply

Also, I never remember which of 12PM and 12AM is midnight.

Aristotle replied to comment from BooK | July 5, 2018 4:45 PM | Reply

The most helpful tip I saw is to think of 12:01. Is that in the AM or PM block? Well the hour is in the same block. That means 12AM is midnight, because 1 minute after midnight is AM, and 12PM is noon, because 1 minute after noon is PM.

This combines with the fact that hours run from 1–12 instead of 0–11 to mean that AM/PM flip exactly on the noon/midnight boundaries, but each 1–12 block is offset by an hour against that boundary. So AM/PM and 1–12/1–12 are offset against each other… which is silly of course. That gives you another way of remembering though: 12AM follows 11PM while 12PM follows 11AM, which you can memorise as “it’s the stupid way around”.

Or you can also just fix the missing-zero ineptitude of the system by thinking of 12 as a weird way of spelling 0, which makes 0AM midnight ahead of 1AM, and 0PM noon ahead of 1PM, which is pretty logical. So thinking of 12 as 0 but just saying/spelling it 12 is another way of remembering this. (As they say, you know you are a programmer if you start counting at zero.)

But I find both of these more complicated than the “what block is 12:01 in” aid.

It sure is confusing.

Aristotle | July 5, 2018 5:18 PM | Reply

what if the application generated a list of links that are expected to exist

I was going to shout “sitemaps!” before I read on and saw you already have that covered. 😊

X-No-Archive

An HTTP application should not be looking at Usenet headers really. If you do want to add support for something then the corresponding standard for that in HTTP is the Robots tag. If you do add support I think it should be an option, though, and probably not the default.

Another thought is that a tool to turn a site static should be able to respect HTTP headers separately from in-content directives, because a downloaded HTML file will still contain whatever <meta name="robots"> tag was inside it but it will almost certainly not be served with the same X-Robots-Tag header. So external crawlers can still obey the in-content directives but won’t see the originally present out-of-band directives.

I don’t know if anyone needs even the core feature, though, let alone any more complicated elaborations of it. I’d say it depends on whether there is anyone serving both the Plack app being archived by Wallflower and the static archive generated from it by Wallflower. In that case I can imagine it being helpful to rope off some piece of the dynamic app. It’s not how I’d personally use Wallflower though.

BooK replied to comment from Aristotle | July 9, 2018 8:47 AM | Reply

X-Robots-Tag: noindex sounds exactly like what I'm looking for. Thanks, Aristotle!

Regarding an application that can be served both as static site and full web application, I've never built one. However the version I keep playing in my head is a corporate blog that is run internally as the full application (so non-technical users can edit/validate their posts in a web interface, instead of depending on e.g. vim and git), and published as a static site (for security reasons) on an outside-facing server.

BooK | July 9, 2018 8:47 AM | Reply

See also daxim's very detailed response on Reddit: https://www.reddit.com/r/perl/comments/8vo1hk/wallflower_improvements_and_issues/e1rq37b/

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About BooK

Pink.

More info »

BooK