Fighting a 30-year-old software bug
Er, an earlier version of this said the software bug was 20 years old. It's 30! Thanks, Sidhekin
Today, in reading some data, I processed 36,916 potential dates. Two of those 36,916 failed to validate. I wasn't terribly concerned as these dates come from client-supplied data and this data is often, um, surprising. However, when I inspected the raw data, it turns out that those dates were January 1st, 2011 and January 1st, 2007. I had a bug in software I wrote about a month ago, but it turns out that this bug is actually 30 years old.
For anyone who doesn't really understand the software ecosystem, this may sound mystifying, but it makes sense. Because of a decision taken a long time ago to make another company money, my $client lost money in paying me to fix a bug that one company accidentally introduced and another company deliberately introduced. But to explain it I need to talk about a third company that introduced a feature that eventually became a bug, and a few other historical tidbits that nonetheless contributed to the obscure bug I fixed today.
In the good ol' days, Apple computers would sometimes spontaneously reset their date to January 1st, 1904. The reason for this is fairly simple. Back then, Apple computers used battery-powered "system clocks" to keep track of the date and time. What happened when the battery ran out? Apple computers tracked their dates as the number of seconds since the epoch. In this sense, an epoch is merely a reference date from which we start counting and for Macintosh computers, that epoch was January 1st, 1904 and when the system clock battery died, that was your new date. But why did that really happen?
Back then, Apple used 32 bits (ones and zeros) to store the number of seconds from their start date. One bit can hold one of two values, 0 or 1. Two bits can hold one of four values, 00, 01, 10, 11. Three bits can hold one of eight values, 000, 001, 010, 011, 100, 101, 110, 111, and so on. How much can 32 bits hold? 32 bits can hold one of 232, or 4,294,967,296, values. For Apple dates that was approximately 136 years, which is why older Macs couldn't handle dates after 2040 and if your system clock battery died, your date would reset to 0 seconds after the epoch and you'd have to keep manually resetting the date every time you turned on your computer (or until you bought a new battery for your system clock).
However, the Apple solution of storing dates as the number of seconds after the epoch means we couldn't handle dates before the epoch and that had far-reaching implications, as we'll see. This was a feature, not a bug, that Apple introduced. It meant, amongst other things, that the Macintosh operating system was generally immune to the Y2K bug (though many Mac apps weren't because they would introduce their own date system to work around the Mac limitations).
Moving along, we have Lotus 1-2-3, IBM's "killer app" that helped to launch the PC revolution, though it was VisiCalc on the Apple that really launched the personal computer. It's fair to say that if 1-2-3 hadn't come along, PCs would likely have not taken off the way they had and computer technology would have turned out considerably differently. However, Lotus 1-2-3 incorrectly reported 1900 as a leap year. When Microsoft released Multiplan, their first spreadsheet program, it didn't have much market penetration. So when they conceived of Excel, they decided to not only copy 1-2-3's row/column naming scheme, they made it bug-for-bug compatible, including deliberately treating 1900 as a leap year, a problem that remains to this day. So for 1-2-3, this was a bug, but for Excel, it was a feature to guarantee that everyone who used 1-2-3 could import their spreadsheets into Excel with no differences in the data, even if the data were wrong.
Eventually Microsoft decided to release a version of Excel for Apple's Macintosh computers, but they had a problem. As mentioned, Macintosh didn't recognize dates prior to January 1st, 1904. However, Excel used January 1st, 1900 as its epoch. So Excel was modified to recognize what the epoch was and internally stored dates relative to these respective epochs. This Microsoft support article explains the problem fairly clearly. And that leads to my bug.
My current $client receive spreadsheets from many customers. Those spreadsheets may have been produced on Windows, but they may have been produced on a Mac. As a result, the "epoch" date for the spreadsheets might be January 1st, 1900 or January 1st, 1904. How do you know which one? Well, the Excel file format exposes this information, but the parser I am using doesn't and it expects you to know whether you have a 1900 or 1904-based spreadsheet. I suppose I should have spent a lot of time trying to figure out how to read the binary format of Excel and sent a patch to the maintainer of the parser, but I have many other things to do for $client and so I quickly wrote a heuristic to determine whether or not a given spreadsheet was 1900 or 1904. It was pretty simple.
In Excel, you may have a date of July 5, 1998, but it might be formatted as "07-05-98" (the useless US system), "Jul 5, 98", "July 5, 1998", "5-Jul-98" or any of a number of other useless formats (ironically, the one format my version of Excel doesn't offer is the standard ISO 8601 format). Internally, however, the unformatted value is either "35981", for the 1900 date system, or "34519", for the 1904 system (these numbers represent the number of days after the epoch). So what I do is use a fairly robust date parser to extract the year from the formatted date, and then an Excel date parser to extract the year from the unformatted value. If they're four years apart, I know I'm using the 1904 date system.
So why don't I simply use the formatted date? Because July 5, 1998 might be formatted as "July, 98", losing me the day of the month. We get our spreadsheets from so many companies and they create them in so many different ways that they expect us (meaning me, in this case) to figure it out. After all, Excel gets it right, I should, too!
That's when 39082 kicked me in the tail. Remember how Lotus 1-2-3 considered 1900 a leap year and how that was faithfully copied to Excel? Because it adds an extra day to 1900, many date calculation functions relying on this can easily be off by a day. That means that 39082 might be January 1st, 2011 (on Macs), or it might be December 31st, 2006 (on Windows). If my "year parser" extracts 2011 from the formatted value, well, that's great. But since the Excel parser doesn't know whether it's a 1900 or 1904 date system, it defaults to the common 1900 date system, returns 2006 as the year, my software sees that the years are five years apart assumes an error, logs it, and returns the unformatted value.
To work around this, I now have the following (pseudo-code):
difference = formatted_year - parsed_year if ( 0 == difference ) assume 1900 date system if ( 4 == difference ) assume 1904 date system if ( 5 == difference and parsed month is December and parsed day is 31 ) assume 1904 date system # yeah, I had 1900 originally. Thanks for spotting that
And now all 36,916 dates parse correctly.
Note: for extra fun, if you have a Mac and you're running Excel, try entering a date before 1904 and then formatting it into a different date format. You can enter the date, but you can't format it because Excel will accept it, but will treat it as a text field. Meanwhile, for Microsoft Excel, all days of the week prior to March 1st, 1900 are wrong, due to a bug in software released in January of 1983.
Update: It was pointed out that Spreadsheet::ParseExcel does parse Excel's 1904 flag. Unfortunately, I'm using
Spreadsheet::ParseExcel::Stream, which does not. Even on beefy boxes, we've run out of memory with the using the standard parser so we've had to fall back on the streaming parser. So far every attempt I've made to work around this has hit yet another bug.
Update 3: And according to an anecdote from Joel Spolsky, the Lotus 1-2-3 "bug" may have been a deliberate attempt to simplify the Lotus software. I had previously read hints that Lotus did this deliberately, but since I couldn't say for sure, I left it out.