Hey! I know, I know: long time, no blog. I would love to blame the pandemic, but the truth is, I just haven’t been inspired with any sufficiently Perl-y topics lately. Until recently, when I ran into this.
Now, once upon a time, I wrote a post about a small buglet I had encountered. The post presented the problem, then asked you if you saw the problem, then explained what was going on. So let’s do that again. First, the code:
sub generate_temp_file
{
state $TMPFILES = [];
END { unlink @$TMPFILES }
my $tmpfile = `script-which-generates-tempfile`;
chomp $tmpfile;
push @$TMPFILES, $tmpfile;
return $tmpfile;
}
As before, the actual code does a bit more, but I promise I’ve omitted nothing that’s relevant to the bug. Do you see it? If not, click to find out more.
]]> You might be able to spot the bug just from reading the code, but I can also offer you a big hint by telling you what the actual error was. First off, I should note that there was no error in my testing. Then I committed the code and pushed it out to the repo, where all my fellow developers propmptly downloaded it and started using it ... and there was still no error. Then, one coworker (our sysadmin, as it happened) ran the script which used this code in a particular way, and reported this error:Can't use an undefined value as an ARRAY reference at ...
and the line number that contained unlink @$TMPFILES
.
To see what’s going on here, it’s worth taking a brief detour into what a state
variable is. If you’re familiar with the C-based family of languages (e.g. C++, and I’m pretty sure Java as well), it’s what they would refer to as a static
variable. I don’t know if I agree that state
was the best name for it, but it sure beats the hell out of static
.
So what’s a state
variable? Well, it’s a bit like a global variable ... and also not. To explain that apparent contradiction, and also to address why state
variables are awesome while everyone knows that globals are bad, we need to pick apart that stereotype. Are all global variables bad? Well ... depends on what you mean by “global.” See, when we talk about a “global variable,” we’re talking about a variable with global scope. And scope actually consists of two distinct parts: visibility, and lifetime. Most of the variables that we refer to as “globals” are those which have global visibility and global lifetime. And those are definitely bad. But they’re bad because the global visibility part is bad. That’s what causes all the trouble. But the global lifetime part ... nothing wrong with that at all. And a state
variable is one which has a global lifetime, but only block visibility.1 And that’s not bad at all. It’s quite useful, in fact.
Now, if my tempfile were being generated by Perl, I would of course use File::Temp
(or something which in turn used it, such as Path::Tiny
), and that would handle the cleanup for me. But, since the file is being generated by some script, I need to arrange that cleanup myself. How do I do that? simple: by using an END
block, which will always get called when my program exits.2 Admittedly, my simplistic use of it assumed that unlink
is fine with receiving no arguments (e.g. in the case where the function hasn’t actually been called yet, and thus @$TMPFILES
is empty). By the way, if you’re wondering why I’m using state $TMPFILES = []
and @$TMPFILES
instead of state @TMPFILES
and @TMPFILES
, it’s because one of state
’s quirks is, it will only work with scalar variables.
But, as it turns out, unlink
is fine with getting an empty list (I tested it). So that wasn’t the problem. Still, the last two sentences of the previous paragraph, when combined, contain the answer to the mystery. If you haven’t spotted it by now, you may want to take a moment to reread them carefully and see if I you see it before proceeding further.
Still stumped?
Very well, then. Read on.
$TMPFILES
still exists: it’s a (sorta kinda) global, so the END
block can access it perfectly fine even if the function is never executed. And that’s important, because END
blocks are processed at compile time. In fact, that very sticking point is why I’m using the state
variable in the first place. That is, why can’t I just do it this way?
sub generate_temp_file
{
my $tmpfile = `script-which-generates-tempfile`;
chomp $tmpfile;
END { unlink $tmpfile }
push @$TMPFILES, $tmpfile;
return $tmpfile;
}
It’s because END
happens at compile-time, when $tmpfile
hasn’t been set yet. Not to mention what happens if generate_temp_file
is called multiple times: the END
block only gets added to the chain of END
blocks for the program once, so it only gets called once: if there’s a possibility of this function happening multiple times, I need to store my tempfiles in an array. None of those would be an issue if END
happened at run-time, of course. But that ain’t the way it works.3
So I’ve set it up to handle all that, by using a (semi-)global state
var, which will always exist, and can contain multiple things, and can get processed once by the END
block. Except that I had to use $TMPFILES = []
instead of @TMPFILES
, like I really wanted, and that’s where it all fell apart. See, the variable certainly exists whether the function is executed or not ... but it only gets assigned the first time it’s called. So, before the function is called for the first time, $TMPFILES
is not []
... it’s undef
. And that’s what triggered the error message, of course. If state
would let me declare an array instead of an arrayref, I wouldn’t have had the problem, but, once again: that ain’t the way it works.
So, in the end, once I finally realized the problem, the fix was trivial:
- END { unlink @$TMPFILES }
+ END { unlink @$TMPFILES if $TMPFILES }
And now it works whether the function is called (which it always was in my testing, and always was in most of my coworkers’ usages), or whether it’s never called at all (which was the case when my sysadmin ran it). And it taught me a valuable lesson about the interaction of seemingly unrelated implementation details of the language. And now I’ve shared it with you.
Hopefully it’s been helpful.
1 Assuming you declare it inside a block. I suppose a state
var outside any block would have file visibility, or package visibility, or somesuch. But that’s a more esoteric usage that we don’t really need to get into.
2 Well, not always ... in fact, the (unflattering) comparison between Perl’s END
and bash’s trap ... EXIT
was one of the points I made in my post on Perl vs shell scripts (see the “Commands on Exit” section).
3 Although, it occurs to me that, if I were using Perl 5.36+, I could probably work around this by using defer
instead of END
. I think.
Yep. It's one of the things that attracted me to Perl in the first place. :-)
]]>Of course, nowadays it means something else, and I’ve had to redirect my ossified mental patterns into new channels, so that, now when I see “TIL,” I can have my brain recognize it as “Today I Learned.” Which is a handy phrase: it encapsulates feelings of discovery, serendipity, and epiphany all into one. And TIL1 that the way I’ve always tried to write code has a name, a history, and a venerable progenito
The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.
Now, at the time Knuth was writing this, the heavy hitter in the computer language scene was wc
.
The present chunk, which does the counting, was actually one of
the simplest to write. We look at each character and change state if it begins or ends
a word.
<<Scan file>>=
while (1) {
<<Fill buffer if it is empty; break at end of file>>
c = *ptr++;
if (c > ' ' && c < 0177) {
/* visible ASCII codes */
if (!in_word) {
word_count++;
in_word = 1;
}
continue;
}
if (c == '\n') line_count++;
else if (c != ' ' && c != '\t') continue;
in_word = 0;
/* c is newline, space, or tab */
}
@
It’s also interesting to me that Knuth’s system seems like it has “comments” built right into the code. That first paragraph there isn’t actually contributing to the code at all. It’s just there to give the reader of the code some idea of what’s going ovi
vs emacs
, or spaces vs tabs. But there’s a crucial difference here between comments and many of those other subjects of religious wars: If you used emacs
to write your code, I may disagree as violently as I like with your choice of editor, but, in the end, it doesn’t really impact me at all ... not even if I’m the person who has to maintain your code. Tabs vs spaces is a little more impactful, but if I can’t switch the code from one to the other with a single command-line call, I really shouldn’t be calling myself a coder, now, should I? I may not like where you put your curly braces (to take another area of contention), but I can still read the code just fine. On the other hand, if you abhor or adore comments, that will very much impact my experience reading your code.5
Now, I don’t actually want to get into whether or not you should use comments, because it’s really a separate issue. But I bring it up for a reason. One might think that people who write articles such as the ones I mentioned above are actually advocating for less literate programming by telling us to use fewer comments. But that’s not their message at all. These authors (primarily) object to comments which explain what the code is doing. Those sorts of comments are unnecessary, they argue: your code explains what your code is doing, and, if it doesn’t, your code is not written clearly enough. Instead of glomming on extra words in the form of comments, you should rewrite your code. Jeff Atwood6 puts it thus:
... if your feel your code is too complex to understand without comments, your code is probably just bad. Rewrite it until it doesn’t need comments any more. If, at the end of that effort, you still feel comments are necessary, then by all means, add comments ... carefully.
I focus on the Atwood article above the others for a couple of reasons. Firstly, I think Atwood is an articulate and trustworthy source, not to be dismissed lightly (or at all, for that matter). But also because this article contains a bit of dubious wisdom:7
Perhaps that’s the dirty little secret of code comments: to write good comments you have to be a good writer. Comments aren’t code meant for the compiler, they’re words meant to communicate ideas to other human beings. While I do (mostly) love my fellow programmers, I can’t say that effective communication with other human beings is exactly our strong suit. I’ve seen three-paragraph emails from developers on my teams that practically melted my brain. These are the people we’re trusting to write clear, understandable comments in our code? I think maybe some of us might be better off sticking to our strengths— that is, writing for the compiler, in as clear a way as we possibly can, and reaching for the comments only as a method of last resort.
There are two implied premises here that I strongly reject:
Now, I’m pretty sure that first one isn’t too controversial. It’s been over 10 years, after all, since Robert Martin told us that:
Indeed, the ratio of time spent reading vs. writing is well over 10:1. We are constantly reading old code as part of the effort to write new code.
Because this ratio is so high, we want the reading of code to be easy, even if it makes the writing harder. Of course there’s no way to write code without reading it, so making it easy to read actually makes it easier to write.
You really will go farther in this business if you believe that you’re writing code primarily for other humans, and the fact that the compiler can understand it too is just a side benefit. That may sound radical to some folks: after all, if the compiler can’t understand it, the program doesn’t work, right? But that’s the wrong way to look at it. If you can’t make the compiler understand you, you don’t last long as a programmer at all. The flipside, though, is that you can make the compiler understand you, but no other humans can. Folks like that can last a depressingly long time in our field, but everyone hates working on their code, no one wants to recommend them to work at the same place they work, everyone complains about them behind their backs (or, occasionally, given our tendency towards social bluntness, to their faces) ... that’s not how you’d like to known by your fellow programmers, I’m sure. Getting the compiler to understand your code is literally the bare minimum you can do as a coder. Getting your fellow coders to understand your code is the real goal.
So let’s take it as read that we all reject premise #1. What about premise #2? Surely that’s much more controversial.
But, to me, that’s where literate programming comes in. I want my code to read like an interesting essay. I try to construct my code carefully, using the same principles I use when writing: I break my code into blocks just like I break my writing into paragraphs; each paragraph needs a topic sentence, which is usually (but not always!) the first sentence; I define my terms before I use them, but then I expect my reader to know what they mean, or else they’ll have to look them up in the dictionary (which in this case is a library module). And I expect a lot of meaning to be delivered via context. If I name a method fetchFrobnabulator
, then you should be able to assume that that method will go fetch a frobnabulator and return it: you don’t need to go find and read the code of the method to know that. Now, you might still have to find and read it for other reasons of course: perhaps it isn’t delivering a frobnabulator when you think it should, or it’s delivering the wrong one, so you need to dig into it and see where it’s going wrong. But, barring those special circumstances, you shouldn’t need to break the flow of your reading by jumping around to a whole separate piece of code; you can just assume the thing does what it says on the tin until you have reason to suspect otherwise.
And finally we get to the place where this is a Perl blog post and not just a general programming blog post. Because, you see, Perl is the most literate programming language I know of. I don’t need those big blocks of text like Knuth had to write for his weaving and tangling and all that. I can just write code, and it reads pretty much like English. Just to take a random chunk of code I wrote recentlvim
than vim
does:8
my $vim_script = tempfile();
sh vim => -c => "mksession! $vim_script", -c => 'q', '>/dev/null', '2>&1';
my @baseline_mappings = uniq map { parse_mappings } apply { chomp } $vim_script->slurp;
if ($OPT{D})
{
say foreach map { "#-->PRE: $_" } @baseline_mappings;
}
Note a few things here. You technically have no idea what sh
does, or what parse_mappings
does, but if you know what sh
is in Linux, and you know anything about vim
sessionfiles and what’s in them, you can very easily guess. You technically don’t know what $OPT{D}
is either, but it takes very little imagination to work out that it refers to a -D
option, which indicates that the user wants debugging (and, even if you lacked that much imagination, there was a block which laid out all the options up above, so hopefully you filed that away for future reference at the time). You may not know off the bat that tempfile()
comes from Path::Tiny,9 but then you don’t even really need to know that to understand that it’s going to create a tempfile. It’s all very clear.
This is just a stupid, simple example, but it illustrates a number of principles that I try to use for all my code:10
slurp
, uniq
, and tempfile
might not be obvious to a random native English speaker, but most coders are going to get those immediately).uniq
and apply
from List::Util and List::MoreUtils respecitvely,11 and tempfile
from Path::Tiny all make perfect sense, and they’re both named well and fit seamlessly into the “narrative.”sh
is a shortcut for making a bash
call,12 and %OPT
is the variable created by my opts
function that I wrote to process command-line options.13 Just make sure you do the same two things you look for in other folks’ libraries: name them wel# See what mappings are defined _prior_ to loading the session, and save them for comparison with
# the mappings we see _afterwards_.
Doing this takes effort. Writing well is a skill like any other. But this idea that if you can make a compiler do what you want you don’t have to care about writing well is, I think, a fallacy. Writing well is about how to communicate with your fellow humans, and that’s worth learning for lots of reasons, many of which have nothing whatsoever to do with coding. But, if you also want to communicate with your fellow humans through code, it’s probably worth spending a little time learning how to do it well.
So I want my code to read like well-crafted prose, and Perl helps me do that. In today’s world, where most people believe that Perl is dying, I’m under just as much pressure as anyone to switch languages. And I’m often accused of being resistant to change, or even afraid of it. But that’s not true: I’m perfectly happy with change ... as long as it’s change for the better. If I ever find a language that helps me write my own little version of literate programming better than Perl does, I will switch in a heartbeat.
But, so far, I’m still looking.
1 Okay, to be completely honest, it was a few days ago. But it took me a while to write this post.
2 The other, of course, being Edsger Dijkstra.
3 Though of course Lisp in general had been around forever. It just hadn’t caught on yet. I suppose some would say it never did ...
4 Possibly the most important piece of software (and one of the few that still remain) resulting from C/WEB is TeX.
5 Unless, of course, I’m one of those obdurate coders who just utterly refuses to read any comments, on general principle. That creates entirely different issues though.
6 In the third of the articles I provided links to in the previous paragraph.
7 Just because I think we should listen to what he has to say doesn’t mean I agree with him all the time, you know. I dove into this a little in an older post on my Other Blog.
8 Don’t even get me started on why vim
session files suck. That’s probably a whole ‘nother blog post.
9 Okay, technically speaking I’m using my own Path::Class::Tiny here, but tempfile()
is just a pass-through from Path::Tiny, so ... same diff.
10 Not claiming I’m always successful, of course. But I try.
11 Or List::AllUtils, if you want to get ’em all in one go.
12 It is in fact a fairly thin wrapper around my own PerlX::bash.
13 And, just to stave off comments, yes, I know about Getopt::Std, and Getopt::Long, and Getopt::Declare, and Getopt::Compact, and Getopt::Euclid, and ... I’ve tried a bunch of ’em, and read about even more, and I still wrote my own. But, to be fair, it’s mostly just a wrapper around Getopt::Std.
14 Still mostly trying to avoid the comment controversy, but hopefully we can agree that comments shouldn’t explain how the code work
We always coach developers to use hard coded test data to the extent practical. When writing tests you have to unlearn a lot of DRY principles. We tolerate a lot more repetition, and factor it out sparingly.
and then I said that this really required a longer discussion than blog post comments would allow. This, then, is that longer discussion.
So, first of all, let me agree whole-heartedly with Tom’s final statement there: “We tolerate a lot more repetition, and factor it out sparingly.” Absolutely good advice. The only problem is that, as I noted previously on my Other Blog, we humans have a tendency to want to simplify things in general, and rules and “best practices” in particular. I don’t think that Tom was saying “always hardcode everything in your unit tests!” ... but I think some people will read it that way nonetheless. I would like to make the argument that this is a more nuanced topic, and hopefully present some reasons why you want to consider this question very carefully.1
]]> For an example of how hardcoding the right answers for all your unit tests can be taken too far, imagine a system where you need to test several scripts that produce output. One simple way would be to run the scripts and capture the output, then examine that output very carefully, once. Once you’ve verified by hand that the output is good, you save said output to a file. Now all your unit test has to do is run the script, generate new output,diff
against the Last Known Good output, and throw a fit if there are any differences. Easy peasy.
Now, if you make a change to a script that affects its output, you will generate a diff
, but then you can verify (again, manually) only the change to make sure it’s correct, then resave the new output as Last Known Good. Still pretty basic, right? All the correct “answers” are hardcoded, in files, and you’re checking against them all. As your system grows, you’ll get more and more scripts, and therefore more and more LKG files to diff
against, but no biggie. As your team grows and changes, though, it can be harder to know what the problem is if an unexpected diff
pops up during your testing ... you didn’t think your change could have possibly affected that test, and yet there it is. You’ll have to track down the right person to ask about this ... assuming they still work here.
As this “system” grows to dozens or even hundreds of LKG files, the chances that any given person knows what any given diff
actually means (and whether it’s an actual problem or just a symptom of the lack of foresight someone had when setting up the test in the first place) begins to drop dramatically, and the chances that any given person will just “fix” a failing test by simply replacing the LKG file with the new output starts going up. Pretty soon you’ve invented “unit testing theater,” a system whereby everyone feels great about the number of passing tests and nothing is really being tested in a way that will ever catch any actual errors. I would say this is an example of the failure of hardcoded tests.
Now, before you attempt the classic rebuttal of “yes, but your example is so extreme that that would never happen in the real world,” let me assure you that any coworkers of mine from $lastjob
who are reading this have now buried their faces in their hands, because they know exactly what I’m talking about, and it was all too real. By the time I got there, very few people had any real understanding of what all that hardcoded data was supposed to represent and whether it still made any sense and/or still represented the actual goals of the business. I was brand new, so I certainly didn’t have any idea. When I asked what I should be doing when a diff
unexpectedly caused a test to fail, it was explained to me how simple the fix wa
What did I replace it with? Simple: a system whereby output was generated by a different algorithm. The great thing about algorithms is that, like mathematical proofs (and like Perl), There’s More Than One Way To Do It. If you’re lucky, there’s another way to calculate that answer that is just dumber and/or slower than the way the real code is doing it, and that’s the way your unit tests should be calculating it. Why is that better than harcoding the right answers (in this particular instance)? Well, there are a few advantages:
So is this the perfect solution in all cases? No, of course not. Frederick Brooks told us there was no silver bullet for software engineering problems over three decades ago, and that’s why it’s so dangerous for us to give in to our natural human instincts to simplify. It’s awesome to be able to tell ourselves (and our new employees) “never do this!” and “always do that!” As soon as we have to start adding a bunch of caveats and exceptions, we’ve made the whole thing more complex, and that just makes it much harder to get it right. But as Einstein once sort-of-said: things should made as simple as possible, but no simpler.2 As with many things in computer science, you just can’t make this one simpler.
To be clear: should you always use hardcoded data for unit tests? No. Should you never use hardcoded data for unit tests? No. You’ve got to look at a number of aspects of the problem, including what your long-term effects are going to be (or your best guess at them anyhow), and try to come up with the best solution you can. And sometimes you’ll still be wrong. But you do your best.
To try to bring this discussion full circle to where it started, the blog post that started it all was discussing the unit testing for Date::Easy. What Tom specifically asked me (later in the same comment I quote above) was:
Are you not able to explicitly set environment variable or other flags so that your tests are always executed in a known environment?
And the answer is, often, “yes.” But I don’t want to do that. Take timezone as a simple example. It’s not too hard to force a timezone, at least on Unixoid systems: just set $ENV{TZ}
. If I wanted to, I could force every single one of the unit tests for Date::Easy to run in the U.S. Pacific timezone, which is where I live. That way, all my tests would pass nearly all the time, because, if they didn’t pass there, I wouldn’t have released the version in the first place. But if I did that I’d be missing out on a golden opportunity: there are CPAN Testers machines all around the world, in all different timezones, with all different locale settings, in all different languages. I want to know if my unit tests pass with all those parameters. I already knew it passed in Los Angeles, in English, with the standard C
locale. I want to find out where it fails ... so I can fix it.
I hope this has been a useful discussion into why you should sometimes avoid harcoding your unit test data, even though it deliberately hesitates to pick a side in all situations. Definitely don’t sweat a little repetition in your unit tests, and “factor it out sparingly” (as Tom suggests)—
1 Note: This discussion is not limited to Perl, of course. But it’s far more of a technical topic than my Other Blog is prepared to handle, and the audience here will appreciate it far more.
2 Actually, what Einstein said was “It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” But, you know, things get paraphrased.
What you're suggesting can be good advice—I certainly agree that repeating yourself in unit tests is often preferable to being too clever in them, for instance—but I don't believe it is always good advice. Unfortunately, I think a proper response is beyond a comment here; perhaps I'll compose a larger blog post on this very topic in order to discuss the pros and cons.
> Are you not able to explicitly set environment variable or other flags so that your tests are always executed in a known environment?
There are probably a few things along those lines I could do, but I think that would be a big mistake here. For instance, if I explicitly mucked with the locales or faked out the timezones, I would only be proving that my module works in my own personal environment ... which, you know, I already knew. :-/ By allowing a certain amount of uncertainty, I've triggered lots of errors in the smokers of CPAN Testers, and that's helped me find several legitimate bugs. Mostly bugs in my unit tests, granted, but occasionally some actual bugs in the code. ;->
]]>It looks like it's there already! I'm looking forward to converting over to using these new functions: timegm_posix
and timelocal_posix
. Should make my job much easier. :-)
Well, sure. That's why it's a dumb example. :-) But imagine if multiplication worked differently on different computers ... due to the effects of timezones and DST, that's what I'm up against trying to test datetime stuff.
(Additionally, hardcoding test results has other problems, although I'm not in the camp of saying never do it.)
> ... you've gotten a fragile test suite that accidentally tests additional modules.
Well, I would say "fragile" is a bit of a value judgement—certainly I would consider a unit test suite for a datetime module much more fragile if it weren't possible for it to produce different answers in different timezones or at different times of year due to DST, because that would indicate it wasn't sufficiently complete.
OTOH, testing additional modules is a fair criticism. There's nothing accidental about it though: the additional modules I'm testing are all used by Date::Easy, so, if they don't work, I have a problem. I'd rather know about that problem than blithely go on thinking all was well.
]]>
In case you missed my talk on Date::Easy from a couple years back, I’ll sum it up for you: dates are hard, y’all.
On January 1st of 2019, a bunch of unit tests for Date::Easy started failing. It was immediately reported, of cours
sub foo { my ($x, $y) = @_; $x * $y }
is producing the right answer, I would not write the test thus:is foo(2, 3), 2 * 3, "multiplication works";
my $product = 0; $product += 2 for 1..3;
is foo(2, 3), $product, "multiplication works";
So, as I looked at this code last year, I realized that, while attempting to prove that Date::Easy::Datetime->new(@six_args)
produced the proper number of epoch seconds, I was doing something fairly dumb: I was turning the six arguments into a string and then handing them off to str2time
, roughly like so:
my $secs = str2time(join(' ', join('/', @args[0,1,2]), join(':', @args[3,4,5])));
timegm
(or timelocal
, as appropriate) ... but that’s what the code was doing. So I didn’t want to just do that.
Unfortunately, I tripped over a bug in Date::Parse. This bug has been reported numerous times, although perhaps most concisely in RT/105031.1 The problem had to do with timegm
(from Time::Local) handling (or some might say mishandling) the year. See, when gmtime
returns a year as part of its 6-element list, it returns year-minus-1900. But, when timegm
accepts a year, it wonders whether you’re passing a year you got from gmtime
, or a year you got from a human. When you pass it, say, “12,” you probably mean 1912, which is absolutely what it means if you got it from gmtime
, because 2012 would be 112. But, then again, you might have gotten it from a human, who in that case is way more likely to have meant 2012 than 1912. Of course, if it’s, say, “59” we’re talking about, then whether it is more likely to represent 1959 or 2059 probably depends on when the human who gave it to you gave it to you. Around the time Linux was invented, 1959 was a scant 10 – 12 years ago and 2059 was a science-fiction future full of flying cars. Nowadays, 1959 is a time that my children can’t conceive of (one of them asked me once if the world was in black-and-white back then), and, even if someone does manage to invent a flying car that doesn’t explode in the next 30 or 40 years, it seems pretty unlikely that anyone will be able to afford to ride in it. The solution for the long-ago(-ish) writers2 of Time::Local was the idea of a “rolling century.” If a two-digit year is 50 or more years in the past, it’s more likely to refer to the future. Sounded sensible at the time, I’m sure.
So we begin to see what went wrong. One of my sets of arguments to try out was (1969, 12, 31, 23, 59, 59)
, because it’s the second before the 0-second of (Unix) time, so it makes a nice boundary case, and, when 2019 rolled around, 1969 suddenly became 50 years ago, and timegm
decided that “69” meant 2069, so, boom! I hit the Date::Parse bug. Bummer.
But, wait a tick3 ... didn’t I say that the Date::Easy code was using timegm
too? Why didn’t it have the bug? Well, I was oversimplifying a bit: while my original code did in fact use timegm
(or, again, timelocal
, as appropriate), I had already run into that ... well, let’s not call it a bug in Time::Local, but rather an unfortunate design choice ... for my own code and had determined the appropriate fix: just use timegm_modern
instead. What’s that, you say? Well, the current maintainer of Time::Local (the ever-excellent Dave Rolsky) also knows that the whole “rolling century” can be a pain in the butt, so he offers an alternative: timegm_modern
, which does not make assumptions based on what you pass i
Now, the reason I say this is not a bug in Time::Local is that it’s doing exactly what it documents. Contrariwise, the reason I agree with those who say this is a bug in Date::Parse is because:
[caemlyn:~] perl -MDate::Parse -E 'say scalar localtime str2time("1/1/1969")'
Tue Jan 1 00:00:00 2069
In the end, by the way, I solved my 2019 unit test problem by ... switching my unit tests to use timegm_modern
/timelocal_modern
. Yes, I’m now testing that A = A, and I’m not happy about that, but at least I’m not blowing up CPAN Testers any more.
Of course, now it’s 2020. Time for a whole new set of bugs. As ever, Slaven promptly reported it, and I went looking. I was immediately suspicious of the very similar nature to last year’s bug set, but this one was in an entirely different unit test. Still:
# got: '2070-01-01'
# expected: '1970-01-01'
my $local_epoch = timelocal gmtime 0; # for whatever timezone we happen to be in
timelocal
instead of timelocal_modern
... that can’t be good.
Except ... it’s not quite the same problem. In particular, switching to timelocal_modern
doesn’t actually fix the problem. If you want a pretty detailed-yet-concise description of the issue, Grinnz has you covered, but the short version is, this has nothing to do with being wonky about what two-digits dates mean (or might mean). This is about being able to reverse what gmtime
/localtime
spit out. For a more in-depth discussion of that aspect, Aristotle did a recent post, where he posits that timegm_modern
was the wrong solution. I can’t quite go that far, since it’s really saved my bacon on quite a few other issues, but it certainly is the wrong solution here.
It’s perhaps worthwhile to figure out just WTF I’m doing in the unit test. See, my first thought was, I want to make sure 0 is considered a valid time (as a number of epoch seconds) instead of being considered an error (since it’s false, at least in Perl’s eyes). So I figured, I’ll just toss zeroes at various pieces of Date::Easy in various guises and make sure I always get back 1970, which is what Unix considers the beginning of time. One of those guises is, of course, the actual number 0. But here I have a problem: 0 is certainly 1/1/1970 ... in England. In California, where I live, it happens to be 12/31/1969. So I “solved” that by running 0 through gmtime
and then reversing it through timelocal
to get what Jan 1 1970 0:0:0 would be in the local timezone.4 Now, one could argue that it kinda defeats the whole purpose, as I’m no longer actually sending zero to the constructor. But it’s sort of indirectly zero, if you see what I mean, so I let it slide at the time. Maybe I should just take out the whole thing at this point, now that it’s turned into more trouble than it’s worth. But I thought to myself, self,5 perhaps it’s worth figuring out how to fix this.
One way to do it would be follow Aristotle’s chain of reasoning, and provide a (third!) interface in Time::Local.6 But I decided to take the opposite tack and created myself a gmtime_sane
/localtime_sane
which just add the 1900 back to the year before returning. Right now they only exist in my unit tests’ library module. But perhaps they deserve some wider attention; I leave it up to you to ponder, dear reader.
I would like to stress, though, that, in both cases, the unit test failures did not indicate any problems with the modules. Just part of the challenges of making sure Date::Easy stays well-tested as time marches on.
(Interesting side note: the venerable Time::Date module is also suffering from the “Y2020 bug”; you can see a discussion of th
Currently available on CPAN as 0.09_01 and hopefully graduating to 0.10 soon, CPAN Testers willing. I hope you folks are getting as much use of it as I am.
The full code for Date::Easy so far is here, and there’s also a new developer’s version. Of special note:
gmtime_sane
/localtime_sane
implementationAs always, your feedback (here, in GitHub issues, or in CPAN RT tickets) is always greatly appreciated.
1 Other reports I found included RT/84075, RT/124509, RT/53413, and RT/128158.
2 Or writer ... I actually think the original code was written by Tom Christiannsen, FWIW.
3 Pun very much intended
4 For a refresher on why epoch seconds for dates are always interpreted as local time, even though they’re actually stored as UTC, refer back to part #4 of the series.
5 ‘Cause that’s what I call myself.
6 Well, technically fourth, as there’s also timegm_nocheck
/timelocal_nocheck
.
First, let’s set the situation. (These are certainly not the only conditions under which you could use this pattern, but it’ll probably be easier to grasp it with a concrete example.) Let’s say you have a script which you’re going to use to launch your other Perl scripts. This script (in my example, it’s a bash script, but it could also be a super-minimal Perl script) has exactly one job: set up the Perl environment for all your personal or company modules. That may even include pointing your $PATH
at a different Perl (e.g. one installed via perlbrew
or plenv
) so you’re not reliant on the system Perl. Here’s a typical example of what such a (bash) script might look like:
export PERL_LOCAL_LIB_ROOT=/my/perl/dir
export PATH=$PERL_LOCAL_LIB_ROOT/bin:$PATH
export PERL5LIB=$PERL_LOCAL_LIB_ROOT/lib/perl5:/my/company/modules/lib
export PERL_MM_OPT=
export PERL_MB_OPT=
exec perl "$@"
This is just an example; your environment variables could be totally different, or you might have more, or fewer, or what-have-you. The point is, you use this “launcher” script to get everything exactly the way you want it, environment-wise, then you fire up the right Perl, pointing at the right set of libraries, and you’re all set.
Now, let’s pretend that, somewhere in the actual Perl script that was launched by the example script above, you need to launch a subcommand. There are various subcommands that might need to be launched, but maybe some of those commands are other Perl scripts. And not your Perl scripts; Perl scripts that depend on the system Perl and the modules installed on it. Suddenly your carefully crafted environment, which was perfect for your personal or company scripts, is all wrong.
Perhaps your function to run commands looks something like this:
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
This is just a rough idea of some extra things you might be doing before or during running the command, but it’s handy, because it means that it’s likely that all subcommands are being routed through this one routine, instead of having random system
calls sprinkled all over the place. And that’s good, because what we really need to do to fix this problem is to save the original values of all those environment vars we twiddled, then temporarily put them back when running commands via the shell. And now we have a central place to do it.
Fixing the launcher script side is trivial:
# save existing values
export PERL_LOCAL_LIB_ROOT=$PERL_LOCAL_LIB_ROOT_ORIGINAL
export PATH=$PATH_ORIGINAL
export PERL5LIB=$PERL5LIB_ORIGINAL
export PERL_MM_OPT=$PERL_MM_OPT_ORIGINAL
export PERL_MB_OPT=$PERL_MB_OPT_ORIGINAL
# set to our specific environment
export PERL_LOCAL_LIB_ROOT=/my/perl/dir
export PATH=$PERL_LOCAL_LIB_ROOT/bin:$PATH
export PERL5LIB=$PERL_LOCAL_LIB_ROOT/lib/perl5:/my/company/modules/lib
export PERL_MM_OPT=
export PERL_MB_OPT=
# run using our environment
exec perl "$@"
Fixing the subcommand-runner side is a bit trickier. We want to put the env vars back the way we found them before running the shell command, but we want them to go back to our “real” values afterwards. We could do that manually, but we wouldn’t be able to return
inside those if
conditions any more, so it makes the code messier, and that’s a bummer. Too bad Perl doesn’t give us some magical way to temporarily change the value of something and then automatically put it back when you leave a certain scope.
But, wait: Perl does give us that. It’s called local
, and of course as dutiful, modern Perl programmers, we’ve all learned that we should never use it. You see, local
works on global variables, and global variables are bad. Everyone knows that. So never use local
.
Except ...
Except that your program’s environment is already global—%ENV
is a global hash, as it pretty much has to be. So, since we’re using a global variable anyway, no matter what, may as well use local
to make our lives easier ... right?
In fact, local
is so smart that we don’t even have to localize the entire %ENV
hash. Perl is perfectly happy to localize individual hash key/value pairs for us. So we can actually do something like this:
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
local $ENV{PERL_LOCAL_LIB_ROOT} = $ENV{PERL_LOCAL_LIB_ROOT_ORIGINAL}
local $ENV{PATH} = $ENV{PATH_ORIGINAL}
local $ENV{PERL5LIB} = $ENV{PERL5LIB_ORIGINAL}
local $ENV{PERL_MM_OPT} = $ENV{PERL_MM_OPT_ORIGINAL}
local $ENV{PERL_MB_OPT} = $ENV{PERL_MB_OPT_ORIGINAL}
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
Except ...
It bugs me. It’s 5 lines of remarkably similar code. If I have to add another environment variable to restore, I’ll likely copy-paste one of those existing lines, and then change the env var name ... in two places. I better remember to change it in both places, or else I’ll get some really hard-to-find bugs. Nope, I just don’t like it. I know; I’ll just do it in a loop:
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
my @env_vars = qw< PERL_LOCAL_LIB_ROOT PATH PERL5LIB PERL_MM_OPT PERL_MB_OPT >;
foreach (@env_vars)
{
local $ENV{$_} = $ENV{$_ . '_ORIGINAL'}
}
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
foreach
loop introduces a new scope, so the original values of my env vars are now restored before they’re ever even needed. That’s pretty useless. Looks like I’m stuck with doing it the long way.
Except ... no. I refuse to accept that. I hate doing things the long way.
How about this?
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
my @env_vars = qw< PERL_LOCAL_LIB_ROOT PATH PERL5LIB PERL_MM_OPT PERL_MB_OPT >;
local $ENV{$_} = $ENV{$_ . '_ORIGINAL'} foreach @env_vars;
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
But then I was struck with an inspiration: hash slices.
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
my @env_vars = qw< PERL_LOCAL_LIB_ROOT PATH PERL5LIB PERL_MM_OPT PERL_MB_OPT >;
local @ENV{@env_vars} = @ENV{map { $_ . '_ORIGINAL' } @env_vars};
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
One last twist I’ll give you. What if your script (the one that runs the sub-commands, that is) might not be run by the launcher? That is, what if there’s a chance that all those _ORIGINAL
env vars might not exist? The code as is would then be setting $PATH
and all those to nothingness, which is obviously not a great idea. So we need to conditionally set the vars. Of course, an if
introduces a new scope just like a foreach
does, so we have to be make sure the local
isn’t inside the if
. Our first naive attempt might be something like so:
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
my @env_vars = qw< PERL_LOCAL_LIB_ROOT PATH PERL5LIB PERL_MM_OPT PERL_MB_OPT >;
local @ENV{@env_vars};
if (exists $ENV{PATH_ORIGINAL}) # assume that, if one is set, they're all set
{
$ENV{$_} = $ENV{$_ . '_ORIGINAL'} foreach @env_vars;
}
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
_ORIGINAL
versions are set, everything works fine (but of course we had that much before). In the case where they’re not set, all our Perl env vars end up undefined, which is often even worse than having them pointing at the wrong directories. In short, this:local $SOMEVAR;
local $SOMEVAR = undef;
Happily, this particular problem has a very simple solution: set your local variables to what they already were.
sub run_subcommand
{
my ($type, $command) = @_;
say STDERR "running command: $command" if DEBUG;
my @env_vars = qw< PERL_LOCAL_LIB_ROOT PATH PERL5LIB PERL_MM_OPT PERL_MB_OPT >;
local @ENV{@env_vars} = @ENV{@env_vars};
if (exists $ENV{PATH_ORIGINAL}) # assume that, if one is set, they're all set
{
$ENV{$_} = $ENV{$_ . '_ORIGINAL'} foreach @env_vars;
}
$command = expand_variables($command);
if ($type eq 'capture')
{
return `$command`;
}
elsif ($type eq 'tty')
{
return system($command) == 0;
}
else
{
die('bad type');
}
}
So that’s a little more about how to use local
, and why you might want to use it even though you agree that global variables are bad. Note that this technique isn’t helpful when localizing a batch of global scalars as opposed to certain keys of a global hash, but then again, if you had a batch of global scalars, you wouldn’t be trying to set them in a loop in the first place. Also, you wouldn’t have a batch of global scalars, because globals are bad ... right?
Still, you’re stuck with global vars sometimes, and you may as well make the best of it. Hopefully this helps. A little.
After missing a year last year, I came back to attend YAPC this year. (Yes, yes: “The Perl Conference,” now. But it’ll probably always be “YAPC” to me.) And I actually spoke again (second time), this time on dates and my Date::Easy module. If you missed it and are interested in watching it, the video is up.
This year was in Salt Lake City again, and, while I normally don’t like repeating cities (mainly because I like visiting new places instead), I do have to say the Little America Hotel is every bit the excellent venue that I remembered. Plus it’s just barely close enough to where I live that I can drive there and take the whole family, and do a sort of “conferenscation.” Which is what we did.
My highlights from this year:
KEYWORD_PLUGIN_EXPR
, for which there are very few examples)Things I learned:
study
does the same thing in Perl 6 that it does in Perl 5.I also would like to thank the folks who work very tirelessly (and closer to thanklessly than they should) to organize and run this event. Lena Hand in particular was always working her butt off every time I saw her, but all the folks did a magnificent job. Whatever problems I noted were fixed very quickly, and certain things went more smoothly than one would have thought possible. For instance, a lot of the videos of talks went up less than 24 hours after being presented! That’s pretty amazing turnaround. And we had our own network, so bandwidth was never an issue (which it nearly always has been in the past). I can only imagine how much work it takes to pull off this event, and it amazes me that you see many of the same folks as organizers year after year. Kudos to them all.
Overall, I was very pleased to see some old friends and learn some new bits of Perl. I hope I have the opportunity to go again next year.
Just tried to look at my failures, but I hit two problems:
* The author page says I have two failing modules, but the module page only lists one with my name on it.
* The link to the failure log gives me a 404.
Thx for all your hard work!
]]>Well, sure: if you don't need lazy attributes to solve your problem, then you don't need this pattern. But, also, if you don't need objects, you don't need this pattern. ;-> (And we could continue in that vein forever, basically—if you don't need software, if you don't need a computer, if you don't need technology ...)
]]>> Your "neat trick" leads to a very untidy design. You just created a single class which implements two very different, completely unrelated things:
>
> * some unspecified action for which foo and bar are needed, which is the sole purpose of Thing instances
> * initialization of your instances from the configuration
Well a couple of things here:
($1, $2, $3)
from a single regexp match. _read_config
were a separate function, or a method in an entirely different class, that wouldn't really invalidate this example. > The instances of your Thing class now cannot be created the usual and predictable way:
Now that is a very good point, and it's probably the number one counterindication against using this pattern. But, on the other hand, sometimes not being able to pass in the attribute values is a plus. It just depends on your use case.
> ... (how you pass the configuration file?) ...
Probably in one of the many other attributes hidden by the "more stuff here" tag. :-)
> Also you prevented the config file for being read twice (once for foo, once for bar) for one instance, but it is still read once for every instance of Thing created.
Also a very good point. But, as I say, moving the reading of the config file out of this class doesn't really change things much. I still have a single data structure that I need to get from somewhere, whether that's a reader method, a singleton class, a service from something like Bread::Board, or whatever. And I only care about that structure because I need it to build the few attributes I really do care about. Maybe I could assume that however I'm getting that structure is a cheap retrieval for all but the first request, as in a typical singleton pattern, but that is an assumption I'd be making, and it means that if things change on the other end and retrieval becomes more expensive then my performance degrades without a single line of code being changed on my side. So that's a tradeoff I need to consider. Might be so unlikely as to be not worth worrying about ... or might not be.
> Though I admire Moose there are features which usage is almost certain indicator of a bad design, complicated and dependent attribute initializers are one of them.
I have to disagree with you there. There are very few things in programming which are universally bad. (I can't think of any off the top of my head, but I'm sure that as soon as I said they don't exist some smart person will toss one out in the comments. ;-> ) Just about every Moose feature is no exception: there are times when they are useful, and avoiding them is going to cost you unnecessarily, and there are times when they are dangerous and you're just buying trouble if you use them. But it's not some features which fall into the one camp and some which fall into the other. It's all features which fall into both camps ... just in different scenarios. You really have to look at every situation individually and analyze your options.
Of course, you can't analyze your options unless you know what all your options are in the first place. That's why it's good to know these little "tricks," even if they're not always a good idea. Because sometimes they are. And sometimes you're going to run across them, even if you wouldn't have done it that way yourself. So you need to have them in your brain—for pattern recognition, if nothing else.
]]>> You could more carefully separate it (and avoid tying yourself to Moose) but just having a factory method:
Well, sometimes avoiding tying yourself to Moose is desireable. But, if you're already hitched up to Moose pretty tightly, rewriting features which Moose gives you for free can just give you even more code to worry about having to maintain.
For instance, a new_from_config
method as you suggest would be fantastic for a class full of attributes that all had to (or at least could) come out of a config file. But my example is two lazy attributes which are presumably just two of many. If the laziness is important, I don't want to write that myself when Moose has already done all the work for me. My example is much better for a class in which the majority of its attributes are passed in via the typical new
but a few of them should be drawn from a config file iff the client doesn't supply them.
> This still however ties the implementation details of your configuration infrastructure into your object.
Yes, but inside the object. As long as I'm not exposing those implementation details to my client, I'm not as concerned. If my class needs to change, then I'm going to be changing the code of that class anyway. As long none of the client's code has to change, I consider that properly encapsulated.
]]>