Quickly Check for Database Design Flaws
Don't you just hate it? You've finished reading, again, that blog entry about database design and you're feeling that you can design something reasonable, and then you see this table:
EmployeeID| | SalesPerson| | SalesOffice| | OfficeNumber| | Customer1| | Customer2| | Customer3 |
---|---|---|---|---|---|---|
1003 | Mary Smith | Chicago | 312-555-1212 | Ford | GM | |
1004 | John Hunt | New York | 212-555-1212 | Dell | HP | Apple |
1005 | Martin Hap | Chicago | 312-555-1212 | Boeing |
You can easily see that Customer1
, Customer2
, and Customer3
are wrong, but what about the rest? Try as you might, you can't quite put all of the rules together that easily to figure out what's wrong with the above table.
There's a shortcut, though, and it makes it very easy to start understanding database design.
If you've read my blog post, How to fake a database design, or seen my related talk, you know that I focus heavily on nouns.
Take a look at the above table and pull out the nouns. You see that we have employees, offices, and customers. If you assume that each noun probably requires a separate table, you have a great start at normalization without necessarily knowing what the final database form will look like.
In the above, we know we might have an employees
table, an offices
table, and a customers
table, all just by separating the nouns. At this point, you ask "what if an employee is also a customer?" That's where expertise comes in because it's often tricky knowing how to separate out issues like this.
For a counter-example of separating nouns, consider a recipes table:
recipes
-------
recipe_id
name
directions
directions
is a noun, so why isn't that in a separate table? Because the name
column is merely a shorthand for directions
. You could argue that if two recipes had the same directions, they're the same recipe. Thus, the noun directions
is wholly dependent on the key.
Here's a different, but common example:
customers
---------
customer_id
name
address
You see that so often that it almost looks natural, but what if you have more than one customer at an address? What if a customer has more than one address? Should your database have addresses without customers? Because customer and address are both nouns and the address is clearly not wholly dependent on the customer, it belongs in a separate table.
So there you have it: if you put aside the rules of normalization for a moment and just focus on the nouns, a lot of the low-hanging fruit of database design jumps out at you.
For added fun, read A Simple Guide to Five Normal Forms in Relational Database Theory (pdf).
I agree that natural language offers important clues on how to design the data model. But it seems to me your advice is too simple, and not literally correct.
You seem to argue that whenever there is noun we should consider a separate table. But there are nouns everywhere, way more than you mention.
For instance, in the recipe example, "customer_id" and "name" are also nouns. Yet we know they should not have a separate table.
To extract meaningful clues from the nouns we have to look further at what kind of noun it is.
For example, if it is a compound noun, then that often indicates a propery of some entity ("customer_id" is the "id" of the "customer").
The noun "directions" also provides an important clue: because it is a plural, there can be multiple things called "direction" associated with a recipe. So contrary to how you treat it, I do not consider it a shorthand for the name of the recipe: I would give the recipe a child table to hold its list of directions. This is apt since a direction has properties of its own: apart from the sequence within the other directions and the text describing what one has to do, it can also have an indication of duration. And if we'd want to normalize even further a direction could itself have a list of (references to) the utensils that are required to complete the task conveyed by the direction.
If the noun is based on a verb then that can indicate we're dealing with a many to many relationship, and constructing a sentence with such a verb can provide clues about which kinds of things are related to one another. "Order" is a simple example: If we have an order, then someone ("customer") is ordering something ("products") from someone else ("store"). While the verb does not contain any information as to exactly which things are related and how exactly they are related, the mere occurrence of a verbal known can serve as an indication that further investigation is required to determine whether it constitutes a relationship between several tables.
Roland, thanks for your feedback. In this case I would suggest that you're overthinking the problem. The "directions" field is a perfect example of the sort of unstructured data that is difficult to decompose in a regular manner and thus is hard to model directly (which has significant implications for 1st normal form in that makes us consider what atomic values are). Because it had a one-to-one relationship with the "recipes" table and is not optional, it's wholly dependent on the key. A "TEXT" field on the "recipes" may very well be the perfect thing for it ... or it may not, depending on the business rules.
As pointed out in the above, this is a "quick check". For many beginning database students, merely realizing that nouns can be a guidepost (but not the endpoint!) for normalization makes a great starting point. This, in turn, makes the rest of the normalization questions much easier.
(Note: the atomic comment above plays havoc with many normalization schemes. Is "222 Jump Street" an acceptable value? Maybe for someone's street address for a mailing list, but if you run a mapping service, maybe street number and street name should be different?)
Ovid, my comment about directions was just to illustrate that a plural noun (as opposed to a singular) often indicates a one to many relationship. And rather than treating it as an exception right away, I showed how that would play out in this case.
You state that I am overthinking the problem and that directions is a "perfect example of unstructured data". I would say that it is a perfect example of how the requirements of your database determine whether it is to be treated as structured or unstructured.
If your main requirement is fast, read-only query access to recipes, then we might consider not normalizing at all and storing recipes as documents, and only extracting the most basic items ("name") and store those in a structured matter. More fancy requirements, like alllowing "variation" recipes (say, cake and a gluten free variation of the same cake), allow queries on the ingredient level ("filter for gluten-free recipes"), the ability to view recipes using localized measures for weight, volume, temperature, then imposing more structure will be the way to go. Even a modest requirement to have recipes report a ingredient list that is consistent with the directions already requires structuring the directions.
For this reason it seems impossible to say that some kind of data is a perfect example of unstructured data, for the simple reason that "unstructured" is not a property of the data but of our requirements.
You mention that this is just a quick check for beginners. But what do you tell such a beginner when they point out to you that according to the dictionary, "name" is also a noun? It seems to me all of the fields you mentioned are nouns, so how to distinguish which one become tables and which ones become columns? Distinguishing between plural/singular and compound nouns and verbal nouns is already a step in the right direction.