A Date with CPAN, Part 6: Time Won't Give Me Time

By Buddy Burden on January 24, 2016 7:32 AM

[This is a post in my latest long-ass series. You may want to begin at the beginning. I do not promise that the next post in the series will be next week. Just that I will eventually finish it, someday. Unless I get hit by a bus.

IMPORTANT NOTE! When I provide you links to code on GitHub, I’m giving you links to particular commits. This allows me to show you the code as it was at the time the blog post was written and insures that the code references will make sense in the context of this post. Just be aware that the latest version of the code may be very different.]

Last time I added Time::ParseDate support to our date class, which made it fairly usable, if still incomplete. This time I decided to concentrate on getting a first cut at our datetime class.

In many ways, the datetime class is simpler than the date class, because it doesn’t need to do anything fancy like truncate to midnight or try to ignore times and timezones when parsing. Of course, datetimes do have to consider timezones, but I decided to defer that thorny issue until next time.

The datetime class parallels the date class in many ways. Where Date::Easy::Date->new accepts 0, 1, or 3 args (meaning either the current day, a number of epoch seconds, or year/month/day), Date::Easy::Datetime->new will accept 0, 1, or 6 args: use the current time, a number of epoch seconds, or year/month/day/hours/minutes/seconds. Any other number of arguments is an error.

sub new
{
    my $class = shift;
 
    my $t;
    if (@_ == 0)
    {
        $t = time;
    }
    elsif (@_ == 6)
    {
        my ($y, $m, $d, $H, $M, $S) = @_;
        --$m;                                       # timelocal/timegm will expect month as 0..11
        $t = timelocal($S, $M, $H, $d, $m, $y);
    }
    elsif (@_ == 1)
    {
        $t = shift;
    }
    else
    {
        die("Illegal number of arguments to datetime()");
    }
 
    return scalar $class->_mktime($t, 1);
}

Pretty basic. Notice that I’m sending Time::Piece‘s _mktime a second argument of 1, indicating local time instead of UTC. This is the part we’ll make variable next time, but local time is good enough for now.

Also note the use of scalar. Something I missed until this time around is that Time::Piece->_mktime will return a list of datepart values when called in list context, which is something that’s not useful in this application. Of course I had to go back and do this in the date class as well.

Where ::Date had today, ::Datetime will have now:

sub now () { Date::Easy::Datetime->new }

Again, super-trivial. The prototype allows you to do cool things like now + 30 to mean “thirty seconds from now.” And, unlike with ::Date, that works right out of the box, because Time::Piece already defines addition (and subtraction) the way that we want it. Nifty.

Parallel to ::Date’s date function, for parsing human-readable strings, we’ll need a datetime version:

sub datetime
{
    my $datetime = shift;
    if ( $datetime =~ /^-?\d+$/ )
    {
        return Date::Easy::Datetime->new($datetime);
    }
    else
    {
        return Date::Easy::Datetime->new( _str2time($datetime) );
    }
    die("reached unreachable code");
}

As you can see, it works pretty much the same, only a bit simpler: where I wanted date to handle things like compact datestrings, I didn’t have that need here. So right now it just handles an integer, which it interprets as a number of epoch seconds, and everything else it tosses off to Date::Parse‘s str2time function, like so:

sub _str2time
{
    require Date::Parse;
    return &Date::Parse::str2time;
}

Much easier than the version in ::Date. Also note that I haven’t yet added the fallback to Time::ParseDate, but that should be trivial to do for next time.

One of the reasons this round took me longer than I’d anticipated¹ was a bit of trickiness I hadn’t considered. Remember those unit tests I stole from Date::Parse? They consist of a string and a corresponding number of epoch seconds. For dates, I had to finagle that string quite a bit, and the number of epoch seconds wasn’t useful at all. For datetimes, though, I can just blast the string straight through and I should end up with the epoch seconds value, right? Well, if I had started with UTC, then yes. (In hindsight, I should have done just that and added local time afterwards, just as I should have started with datetimes and done dates after that. But for some reason I like doing things the hard way.) But, since I’m doing the local time version, the epoch value from Date::Parse is not what I actually expect to get.

Well, how does the Date::Parse unit test do it then? First of all, it has a hard-coded number (5, in this case) of strings which don’t have timezones in them and therefore should be parsed as relative to the current timezone. For the first 5 strings it sees, it adjusts the epoch value by an amount based on running localtime and gmtime against the same value and then calculating a delta. Time to steal some more code, I suppose. So I copied over the code to do the adjustment (with a few small tweaks), but I didn’t like hardcoding a number of tests which required that adjustment. So I decided to reuse my method of identifying which strings have timezones and which don’t: a regex to pick out all the timezone patterns which Date::Parse knows about. And I already have such a regex from last time. So all I needed to do is share it between the date unit tests and the datetime unit tests. Which I did.

I also spent way too long fiddling with my regex, until I eventually realized I’d hit some sort of weird behavior in Perl ... I won’t say it’s a bug, but I sure don’t understand it. See, the vast majority of timezone specifiers that Date::Parse knows about are at the end of the string, so naturally I used a $ anchor. But, every once in a while, you get one just before the date, like so:

Jul 22 10:00:00 UTC 2002

My first attempt for this was just to use an optional look-ahead, something like this:

[A-Z]{3} (?= \h+ \d{4} )?

(Remember this is part of a much larger pattern which uses /x.) Which worked perfectly when I was just matching. But when I tried to share my regex with the date class, which actually has to remove the timezone code, I ran into a problem. At first I thought it was something relating to an optional look-ahead. Then I thought it had to do with the whitespace somehow. Eventually, though, I narrowed it down to this:

[absalom:~] perl -le 'print "A B C" =~ s/B (?=C)//r'
A C
[absalom:~] perl -le 'print "A B C" =~ s/B (?=C)$//r'
A B C

Doesn’t matter whether the whitespace is inside the look-ahead or not, and it doesn’t make a difference if you replace $ with \Z. If anyone wants to take a crack at explaining to me why this might actually be correct, I welcome the input. It sure looks wonky to me.

The full code for Date::Easy so far is here. Of special note:

export code; as usual, Date::Easy exports everything, while Date::Easy::Datetime exports only what you ask for
unit test which tries to verify that Date::Easy::Datetime->new, now, and time all return the same thing, even though we face the unpleasant reality that the system clock might rollover to a new second in between the assignments
how I fixed the problem of calling _mktime in list context in Date::Easy::Date
the stolen code from Time::ParseDate to do the unit test adjustment for local times
the shared timezone-identifying regex code for Date::Parse

Next time, I’ll add in the Time::ParseDate fallback, figure out how to handle the UTC version of datetimes, and hopefully slap some POD in here. At that point, Date::Easy won’t be done, but it will be sufficiently useful to put up on CPAN for all you folks to start beating up. I’m looking forward to it!

__________

1 Other than the interposition of Christmas and New Year’s, of course.

14 comments

Tagged as:

Date::Easy

14 Comments

E. Choroba | January 24, 2016 7:57 PM | Reply

To answer your question: in "A B C", you can remove "B " that's followed by C, but you can't remove "B " at the end of the string, that's followed by C. The look-ahead is zero-length, so the $ must match after the space, not after the C. If you want to just look at it, include it in the parentheses:

$ perl -le 'print "A B C" =~ s/B (?=C$)//r'

A C

Buddy Burden replied to comment from E. Choroba | January 24, 2016 10:31 PM | Reply

Thanks for the explanation!

The look-ahead is zero-length, so the $ must match after the space, not after the C.

That still seems weird to me ... if two consecutive zero-length tokens need to match, they must both match at the same point? Consecutive tokens should match consecutively, it seems to me, regardless of their length.

As far as your workaround goes:

If you want to just look at it, include it in the parentheses:

That doesn't really help in this case. Recall that the actual subpattern is more an analog of:

/B(?= C)?$/

meaning either a "B" at the end of the string, or a "B" followed by a space and a "C" at the end of the string. And this is a subpattern, one of many, all of which have to be at the end of the string. So an analog of the entire pattern might be something along these lines:

/(X|Y|B(?= C)?)$/

which makes it even more difficult to apply the workaround. I suppose I could change it to something more like:

/(X|Y|B)$|B(?= C$)/

but at that point it's confusing enough (not to mention the repitition of "B," which in reality is a more complicated subpattern) that my chosen workaround seems clearer. That is, I just gave up on the look-ahead altogether and went with:

my $re = qr/(X|Y|B( C)?)$/

and then I just had to change:

s/$re//

to:

s/$re/$2/

(well, and I had to turn off the "uninitialized" warnings, since $2 is often undefined).

But I'll take your word for it that it's WAD. I'm sure the reasoning makes sense from an implementation standpoint, somehow, but I can't think of a case where this would actually be desired behavior from a usage perspective.

Aristotle replied to comment from Buddy Burden | January 25, 2016 9:47 PM | Reply

if two consecutive zero-length tokens need to match, they must both match at the same point?

Yes.

Consecutive tokens should match consecutively, it seems to me, regardless of their length.

That wouldn’t make any sense. Then (?: and (?= would mean the exact same thing. If you need (?: then use (?: instead of using (?= and complaining that it doesn’t do what you need. :-)

not to mention the repitition of "B," which in reality is a more complicated subpattern

No problem.

/(?:X|Y)$|B(?= C$|$)/

Or if the $ was actually a more complicated subpattern too, you could DRY this out further:

/(?:X|Y)$|B(?=(?: C)?$)/

I can't think of a case where this would actually be desired behavior from a usage perspective.

Easy. Consider when foo(?=.*bar)(?=.*quux) will match.

Ether | January 25, 2016 10:50 PM | Reply

> if two consecutive zero-length tokens need to match, they must both match at the same point? Consecutive tokens should match consecutively, it seems to me, regardless of their length

If they are zero-length, then "at the same point" is the same as "consecutive". If they weren't zero-length, then they wouldn't be at the same point.

Buddy Burden replied to comment from Aristotle | January 26, 2016 2:34 AM | Reply

Consecutive tokens should match consecutively, it seems to me, regardless of their length.

That wouldn’t make any sense.

It doesn't make any sense that consecutive atoms should match consecutively? :-D I dunno man ... that's what makes sense to me ...

Then (?: and (?= would mean the exact same thing.

Wait ... what? I don't follow that at all. A(?:B) means an "A" followed by a "B." A(?=B) means an "A" followed by a "B," except don't include the "B" in the match string. I'm not sure how the consecutive-ness of anything would impact that.

If you need (?: then use (?: instead of using (?= and complaining that it doesn’t do what you need. :-)

But I don't need (?: ... that wouldn't do what I need. In point of fact, (?= does do what I need ... until I add the $ in there too. Only then am I sad panda. :-(

Buddy Burden replied to comment from Ether | January 26, 2016 2:40 AM | Reply

If they are zero-length, then "at the same point" is the same as "consecutive".

But that's only true from the perspective of the match string. That is, when we say a look-ahead subpattern is "zero-length," what we really mean is it corresponds to zero characters in the match string. But (assuming it is non-optional) it does not correspond to zero characters in the source string. So when I say:

/A(?=B)/

that can only match two characters in the source string. Or, to look at it another way, a single-character string could never match that pattern. So the look-ahead pattern is not really "zero-length" from that persepective.

Buddy Burden | January 26, 2016 2:59 AM | Reply

Oh, hey! I didn't respond to the second half of your comment. :-) Sorry about that.

not to mention the repitition of "B," which in reality is a more complicated subpattern

No problem.

/(?:X|Y)$|B(?= C$|$)/

Or if the $ was actually a more complicated subpattern too, you could DRY this out further:

/(?:X|Y)$|B(?=(?: C)?$)/

Admittedly, that is clever. I had not considered sticking the $ into the look-ahead. So that is certainly a possible solution that I may well look at changing over to. It still seems a little tricky for the reader to understand, but overall it probably beats out having to jam the no warnings 'uninitialized' in there.

I can't think of a case where this would actually be desired behavior from a usage perspective.

Easy. Consider when foo(?=.*bar)(?=.*quux) will match.

Well, in that very interesting case I'd say there's a difference between what it will match and what I might want it to match. :-) I think that, if we hadn't just had this extended conversation on the topic, I would have been quite surprised to find that it matches "foo quux bar". That's remarkably counter-intuitive, to me, and I can't say that I see the ability to match look-ahead tokens in any order seems worth losing the principle that consecutive things should match consecutively. For instance, if I told you that this:

"ACB" =~ /A(.*B)(.*C)/

were true, you would be surprised, wouldn't you? All I'm saying is I'm likewise surprised that this:

"ACB" =~ /A(?=.*B)(?=.*C)/

is true. Everyone's explanation of why it is true makes sense, but that doesn't change my level of surprise. ;->

Aristotle replied to comment from Buddy Burden | January 26, 2016 7:13 AM | Reply

A(?=B) means an "A" followed by a "B," except don't include the "B" in the match string. I'm not sure how the consecutive-ness of anything would impact that.

Then if you could write 'ABC' =~ /(A(?=B)C)/ and have it match because consecutive, which portion of the string would $1 be supposed to contain?

Aristotle replied to comment from Buddy Burden | January 26, 2016 7:16 AM | Reply

That surprise is ignorance leaving your mind. :-) Please do yourself a favour and read Mastering Regular Expressions. You seem to have a mistaken mental model of how regexps work; that book will beat you into shape.

Buddy Burden replied to comment from Aristotle | January 27, 2016 12:56 AM | Reply

Then if you could write 'ABC' =~ /(A(?=B)C)/ and have it match because consecutive, which portion of the string would $1 be supposed to contain?

Why, "AC" of course. If tell you to match "ABC" but leave the "B" out, what else could you possibly get?

I'm assuming that doesn't make any sense to you, but it seems perfectly obvious to me. :-)

Buddy Burden replied to comment from Aristotle | January 27, 2016 2:48 AM | Reply

That surprise is ignorance leaving your mind. :-)

Hmmm ... I would say instead that it's the sound of my brain bouncing off a wall because the feature that sounded like it did exactly what I wanted did nothing of the sort. ;->

You seem to have a mistaken mental model of how regexps work; that book will beat you into shape.

Well, I certainly have a mistaken mental model of how look-ahead works. But of course that's only one small piece of regular expressions. One which just hasn't come up all that often in the course of my programming career.

I mean, look: this is a conversation that devs and users have been having since the beginning of time. :-D (Or at least since the beginning of software time.) I know I've been on the other side of it often enough that I get where you're coming from. I'm just pointing out that understanding how and even why something works has little to nothing to do with the concept of how it ought to work. For instance, I understand perfectly well both how and why @_ works inside a Perl sub, but I'm always going to think it's stupid. :-)

Of course, sometimes people just have different perspectives on what's "intuitive." Take for example "sigil invariance" in Perl 6. When I first learned Perl 5, the way sigils worked (meaning that you say @array but $array[0]) was just logical, sensible, and comfortable. I don't like it because I got used to it: I like it because it always made perfect sense to me. When I look at code in Perl 6 like @array[0] it feels freaky and just ... wrong, somehow, on a gut level that's not easy to explain. Obviously, however, some people (including Larry, I guess) felt that it was backwards in the first place and so they changed it to make more "sense" and feel "righter" to some people. That's cool. If I end up using Perl 6, I'm sure I'll get used to it ... but that won't mean it'll make sense to me.

So I suspect we're having a similar disconnect here. The way look-ahead works seems to make perfect sense to you, but it doesn't jibe with my mental model, which is based on what I want it to do, which is based on achieving what I need to get accomplished. So it seems silly and baroque to me, because I can't imagine ever using it in the way that it actually does work, and meanwhile the way I actually need it to work right now is not even an option. :-)

But it's all good. I get it and I see how to work around it, thanks to your continued efforts to enlighten me. Thx for the time and effort! :-)

Aristotle replied to comment from Buddy Burden | January 27, 2016 2:04 PM | Reply

Why, "AC" of course. If tell you to match "ABC" but leave the "B" out, what else could you possibly get?

And what would @- and @+ contain?

Aristotle replied to comment from Buddy Burden | January 27, 2016 2:09 PM | Reply

Another try:

Then (?: and (?= would mean the exact same thing.
Wait ... what? I don't follow that at all. A(?:B) means an "A" followed by a "B." A(?=B) means an "A" followed by a "B," except don't include the "B" in the match string. I'm not sure how the consecutive-ness of anything would impact that.

What would be the difference between (?=A(?=B)) and (?=A(?:B))?

Aristotle replied to comment from Buddy Burden | January 27, 2016 2:42 PM | Reply

Meanwhile the way I actually need it to work right now is not even an option. :-) […] I see how to work around it, thanks to your continued efforts to enlighten me.

You’re aware that sentence 2 there directly contradicts sentence 1, yes? (Not that I agree with the characterisation as a “workaround”.)

In fact, the situation is strictly opposite of what you claim: the way it does work allows you to achieve everything you need, and the way you think it ought to work would disallow many other things that are possible with the current model.

Which is just why it’s defined this way around, and not like you think it should be.

Which is why I completely disagree that this is simply a matter of “this way is obvious to you and that way would be obvious to me”, or at least the implication in how you say it, that just because people spontaneously generate different mental models, all of them must be equally valid. Not every discipline is product design.

In particular, the way it does work allows you to say things like

/^(?!foobar)\w+\s+/

i.e. “match a sequence of whitespace-separated words except if the first word starts with foobar”, which you would be unable to express otherwise.** And (?= has a similar role, except in a “but only if” capacity.

If you fail to imagine how that could be useful to at least somebody else (and almost certainly even yourself), I’m afraid that’s not a failure of empathy on my part.

(Also, just technically, I’d be curious about how negative look-behind fits into this consecutive matching mental model.)

** For me it would “merely” be excessively painful. It’s not impossible, but you’d have to fix your mental model to be able to figure it out, and even so you wouldn’t want to have to resort to that way of doing it.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Buddy Burden

14 years in California, 25 years in Perl, 34 years in computers, 55 years in bare feet.

More info »

Buddy Burden