A Desired SQL Affordance

When talking about an affordances, it turns out that there are different definitions, but I like to think of an affordance as "an element of design which suggests the correct course of action", which seems to be pretty much what Donald Norman meant when he adopted the term. It's an elegant concept which, if applied properly, makes it easy to do the right thing and harder to do the wrong thing. However, sometimes you want to do the wrong thing, so an affordance which makes this impossible isn't the most useful in my book. I just want doing the wrong thing to be jarring enough that it would give anyone pause.

And this brings me to the following issue I often see in SQL:

SELECT count(*) 
  FROM employee 
 WHERE salary < 20000;

If salary is allowed to have a NULL value, then employees with a NULL salary won't be included in this count. If you're familiar with SQL, you will instantly know why. However, even those familiar with SQL sometimes fall into a trap here because they think "hey, why didn't I count the unpaid volunteers?"

There's a very simple way of removing that trap, but I seriously doubt anyone will ever implement it.

First of all, what is a NULL value? CJ Date argues that every attribute in a database must have a name and a type. Since a NULL by definition cannot have a type, he argues that it doesn't belong in a relational database. However, despite well-known problems with NULL values, they remain because sometimes you really do need them (primarily to improve performance by limiting the number of joins). So if we're going to have NULL values but we don't know the type, what does this imply?

Consider the following:

my $temperature = $degrees_fahrenheit + $degrees_celsius;

That's obviously non-sensical, even though it will likely work. This is because even though the $degrees_X variables are likely floats, their types are actually different, even if the underlying language doesn't support this. If those were actually objects, you could presumably overload the '+' operator to understand how to add these two types (and figure out which you're going to return), but in the absence of defining operators between these types, you can't know how to manipulate them. That's why, in SQL, "salary < 20000" will never select a NULL salary. Since you don't know what the type of NULL is, you can't define operators to work on it. That's also why we have "salary IS NULL" instead of "salary = NULL", a source of constant confusion to those new to databases.

So what's the affordance I mentioned earlier? If we replaced the word NULL with UNKNOWN, I think we would quickly understand why the above SQL is not equivalent to this:

SELECT count(*) 
  FROM employee
 WHERE salary < 10000
    OR salary IS UNKNOWN;

Once people got used to seeing UNKNOWN instead of NULL, it would be obvious why salary < 10000 could never match a UNKNOWN value and I think a lot of confusion would be avoided.

Yet I can imagine hearing someone say "but wait, what if that's NULL because they're a volunteer and really don't have a salary?"

This is where the beauty of the affordance comes in. A NULL value in the salary field is probably going to be a violation of the 4th normal form, and if you put a "reason" field in the same table (as some people do to describe NULLs), then you're violating 3rd normal form as well! Thus, we have an affordance which helps you to normalize your database but it's still possible to violate this affordance, if needed, which I think is a desirable property.

The affordance will encourage us to make it right. We can make it fast later.

4 Comments

Of course I’d ask, why not assign the volunteers salary values of zero? After all, their salaries are not unknown at all. A “known to be nothing” is not the same thing as an “unknown”.

I wouldn't assign volunteers a salary of zero. If someone's not an employee, they don't belong in an employee table. An "employee" implies all sorts of other things, so to add a non-employee to it, you have to add all sorts of other fields to make the real emloyees, the fake ones, and so one. Then you knock-on the code to provide lots of business rules to think about fake employees. This employee classification issue is a particular problem I've help to make tractable (not exactly "solve") for one of my customers. It's the lousy schemas that make it so hard to do sensibly.

I get Ovid's main point, and I'm not disputing that, but there's a lot that people can do to give themselves affordances where none exist (and Ovid does mention the normal forms that can do that). People often commit themselves to a poor schema with unnormalized data though, so they are stuck and have no one to blame but SQL.

CJ Date's books are quite the education, as well as quite depressing on how far from ideal practice we actually are.

Find a balance between an appropriate level of normalization and the amount of code you want to maintain, or simply strive for better understanding in the use of NULL. Sure, you can move the salary column into a salary table and link to it through an employee_salary table using strict constraints. Result: no NULL possible for salary and you're approaching higher normal forms; but now you have 2 extra tables in your database, 2 extra classes in your ORM, a busier schema, more complex SQL to think about, etc.

brian:

Oh, the database schema at $work has a person table referenced from others such as account. I’m well aware of those concerns.

My suggestion was limited to the case where the question of salary is all that distinguishes employees from volunteers, as in Ovid’s examples.

Of course if salary = 0 becomes a flag on which to key all sorts of other behaviour differences, the schema is badly designed, just as it would be if salary IS NULL constituted that flag.

Leave a comment

About Ovid

user-pic Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant. Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.