How to fake a database design

Many database students rejoiced when a paper was published showing a shortcut to fifth normal form (pdf). From the abstract:

A key is simple if it consists of a single attribute. It is shown that if a relation schema is in third normal form and every key is simple, then it is in projection-join normal form (sometimes called fifth normal form), the ultimate normal form with respect to projections and joins.

What the hell is fifth normal form and why do we want it? Well, it deals with cases where we can avoid redundancy when information can be reconstructed from smaller bits of information and ... and ... and ...

OK, so that's not helping. In fact, the vast majority of explanations on the Web aren't helping, so I'll explain how to fake database normalization. I'll even avoid big words.

For those of you who know nothing about databases, this will help you tremendously. For those of you who know databases, try not to throw your coffee mug at the screen. You might break something. What I present here is so incredibly limited that it won't help for larger systems, but for smaller systems, having a partially normalized database (yeah, yeah, I know, it's like partially alive) is far better than what we actually see in the real world.

This brief introduction to normalization starts with the following:

It's common for people new to relational databases to approach the design of the tables like they would a flat, Excel-like spreadsheet: to design the tables as if they were isolated, with all the data needed for a particular functionality crammed into a single table.

I see this a lot. If you've been a programmer for any length of time and you interact with databases, you need to move on from this. I won't always go into the why of normalizing a database, but I'll give you some tips.

Years ago at one company I went on a leave and when I came back, I discovered that the company had hired a DBA at a considerable salary. I was overjoyed, until I found out that he had sweet-talked his way into the job and had no understanding of databases (he was fired after several months of agony). For example, we were creating a proposal for a company that sold pears and they had a "pear recipe" database on their Web site. Customers could log in and search for recipes by ingredient. Our "DBA" said we could import their database directly because it was already normalized. The recipes table had columns ingredient1, ingredient2, through ingredient8.

If you're an experienced programmer, even if you don't know how to design a database you're probably thinking "what if you have more than 8 ingredients?" (As it turns out, ingredient8 had CSV values!).

At its simplest, our database has at least three entities (but we'll call 'em "nouns"): customers, recipes and ingredients. Congratulations! Now you know that you need at least three tables, one for each noun. You can figure out for yourself what data is in those tables.

But how do we know which ingredients go with which recipe? Because one recipe can have one or more ingredients and one ingredient can be on one or more recipes, we have a "many to many" relationship. Every "many to many" relationship gets its own table, often called a lookup table. It might look like this:

.--------------------.
| recipe_ingredients |
|--------------------|
| recipe_id          |
| ingredient_id      |
'--------------------'

Each of the ID fields should contain a foreign key constraint against the table that ID should be in. A foreign key constraint says "this value must exist in the corresponding column in the table it refers to". That prevents you from inserting an ingredient ID which doesn't exist in the ingredients table. It also prevents you from deleting, say, a recipe or ingredient that's in the recipe_ingredients table.

But how much of each ingredient do we want? Well, maybe a recipe calls for 2 cups of sugar. '2' is the quantity, 'cups' is the measure and 'sugar' is the ingredient. We haven't modeled measures, so now we'll need a fourth table called measures containing values like "cup", "teaspoon", "gallon", "gram" and so on. And now our lookup table might look like this:

.--------------------.
| recipe_ingredients |
|--------------------|
| recipe_id          |
| ingredient_id      |
| measure_id         |
| quantity           |
'--------------------'

We put the measure_id and ingredient in this lookup table because they're specific to this ingredient for this recipe.

So let's say that a customer can only view three free recipes before they have to buy recipes. Each purchase is an order (ooh, look another noun). So now we have an orders table, too. If we assume that one customer can have zero, one, or many orders, but each order can have only one customer, than we have a "one-to-many" relationship from customer to orders. In that case, each order should have the id of the customer it belongs to:

.--------------------.
| orders             |
|--------------------|
| order_id           |
| customer_id        |
'--------------------'

Again, the customer_id should have a foreign key constraint against the customers table. You will have other information in that table, such as the order date, but we're keeping this simple.

But how do we know which recipes a customer ordered? If you think about it, one order might have many recipes and one recipe might be on many orders, so we have a "many to many" relationship again (don't forget the foreign key constraints):

.--------------------.
| order_recipes      |
|--------------------|
| order_id           |
| recipe_id          |
'--------------------'

(Note: Some people like to name lookup tables with a _to_ in the name: orders_to_recipes, recipes_to_ingredients, and so on, just to make it clear by looking at a schema which tables are lookup tables.)

So our final database might have:

  • customers
  • orders
  • order_recipes
  • recipes
  • ingredients
  • recipe_ingredients
  • measures

In other words, a simple database that allows customers to order recipes might have seven tables (it will probably have more). In the "pears" database I mentioned above, we had three. It was a nightmare. The customer did not accept our bid due to our "overcomplicated" database.

Conclusion

The above is woefully inadequate, but it gives you a vast head-start over many developers who don't even understand the simple rules above. In short:

  • Every "noun" gets its own table.
  • "many to many" relationship get their own tables
  • "one to many" relationship require the table that "owns" another table have its id in that other table
  • Any time a table has an id that refers to a row in another table, use a foreign key constraint to make sure that id really exists

There is much, much more we could say on this topic, but others have said it far better than I. What I describe above is far from perfect, but it lets you skip a lot of jargon (many people who design perfectly adequate databases can't describe the first three normal forms). As you gain experience in database design, you'll want to read more about normalization, different forms of constraints, custom data types (well, not you MySQL fans), indexes, and far more than what I have listed here. However, if you can follow the above simple rules, you'll have a database design which is definitely better than most I have seen at the hands of inexperienced programmers. And as your system grows, you'll have a (vaguely) reasonable database to fix as you encounter limitations in your design.

30 Comments

Back when I was taught this stuff at college, the lecturer had a nice phrase that summarised the requirements for third normal form.

A table is in third normal form if all of its columns rely on the key, the whole key and nothing but the key.

Interesting that you've published this. Over the weekend I was considering digging out my "Advanced Database for Beginners" course (that I only ran once at YAPC in Birmingham) and perhaps turning it into an eBook.

I really liked reading this. Aside from the content, i found the most important part was that you had an example that wasn't entirely and abstractly fabricated, but made a lot of sense as a real business example and was easy to grok.

Once you grasp how many-to-many relationships work, it's like a lightswitch flips and you suddenly see how bad most people are at designing databases.

Thanks for this, I'd be really interested to understand some of the shortfalls of this approach (ie why it's only vaguely reasonable) as this is the approach I take nearly all of the time.

Very nice write up. Thank you

Can you explain to me why you couldnt just have a varchar field in recipes where you write "2 cups sugar" unless you wanted to create metadata and do aggregates on ingredient usage or something along those lines I dont see the point of giving ingredients their own table, especially since this would slow down queries

So I learned all that very well in uni, then spent the next several years of my life regretting having designed a few systems that way, and eventually tearing apart databases that were in that structure, deliberately making redundant flat tables to get rid of the many three-way joins routinely needed. The improvement was remarkable.

In short; yes its great in theory, but once your datasets become large, good luck on 3way+ joins. When you access that data often, and aren't using a database engine that can correctly cache as part of a view, or need it cross-database-platform, I challenge you to find a solution which can rival the speed of a got awful flat-table. In short, redundant tables solved my problems, the speed had previously gotten slower and slower... we'd done a lot of optimising.. and thrown some hardware.. helped a bit.. then when we did this, it's basically instant, and dramatically easier to maintain most queries too.

Of course, you never find this out with play-data sets that are small.

Shouldn't the entities be singular? "customer" not "customers"?

Yes the table names should be singular.

Well afaik theres always a trade-off in reality. For instance I would not have bothered to make an extra table just for the measurements. Maybe just maybe a separate attribute for the amount and the measure so that you can calculate for different servings. Maybe make the measure an enum or something. It depends what is really needed in the end. Making an overcomplicated schema makes it awful to use and error prone with no real benefit.

"I challenge you to find a solution which can rival the speed of a got awful flat-table."

Indexes and materialized views. You need a DBA who knows what he/she is doing. Your denormalization "fix" is a result of lack of knowledge. I have built 20+TB data warehouses that support fast queries.

@john: "Can you explain to me why you couldnt just have a varchar field in recipes where you write '2 cups sugar'..."

How would you find a recipe that contains "cumin"?

Probably by using LIKE '%cumin%' - whcih is slow, hard to speed up by classic indices, and requires a text search engine on top of your SQL search.

That's hoping there are no ingredient names have other meanings that can occur. As well as allowing for recipes containing "cups".

But it goes on. Some people write "cumin", some "cummin". Joining these with separate ingredient names is simple. A text search&replace is slow, and replaces things that are not ingredients.

And on...

How do you list distinct ingredients for the user to pick from?

How do you make sure searches for "lime" don't find "lime zest"?

How do you find ingredients that make sense to sell alongside the pears? (That would be E.g. the most common ingedients that are used in recipes which are commonly ordered.)

And, for the elephant in the room: How do you filter recipes that can be made or almost made with the ingredients selected by the user?

If you are using text search for that, I don't know why you are using SQL. Keep recipes in text files, and search those, and scale using a text indexing engine.

"especially since this would slow down queries"

compared to what?

@largeDatasets

I'd guess if 80% of DBA's and developeres followed the practices given, most databases in the world would be better, most developer jobs would be easier, and custoemrs would not complain about the speed.

Selective, evidence-based denormalization to gain performance is fine. Saying "denormalization is nice in theory" is a telltale sign of a crappy DBA.

This way you can search by ingredient and have meaningful results. If it was simply a varchar field, you would be doing widlcard string searches, and if that old lady kept spelling it "suger", you would not return a result. Sure, you could work around this somewhat, but why when the alternative is better?

Anonymous: Datawarehouse schemas are inherently denormalised.

Which brings me to the point, you often will use normalised structure for data entry and denormalised for reporting.

How do you do that with one database?..

Depending on the application, you would ideally have two databases, one for live data entry, and another one for reporting. The reporting database is so denormalised that you often end up with a central fact table with just keys. You then feed this through an analytics package, and presto, your customers can run blazing fast prepared and adhoc queries.

If it's a small database that doesn't need the complexity of a proper reporting system, then you still should use a normalised structure because you mustn't have many rows anyway, and performance won't be a problem. Use views for convenience in such a case.

What I've always found to be a good rule to follow when designing databases (in addition to what has already been stated) is to ask the customer: "Do you need just one or many of this thing?"

If you potentially need many, then you need to break it out into a separate table and connect it with either a one-to-many or many-to-many lookup table depending on your use-case.

When I ran across this it reminded me of when you read a review in the New Yorker; it's either a boring review because the work is good, or the work is bad and the review is funny as hell.

this was very boring, so congrats

The plural/singular table names issue comes up with ORMs like DBIC - since by default it determines your class names and you may have a preference depending on how you are using the objects.

Also pluralisation is inconsistent, which may be a pain if you are trying to generate table names dynamically in code (user => users, company => companies, sheep => sheep, etc) especially with related tables, where the singular name is the prefix to other table names ("userroles", "companyorders", etc), not the plural. Not impossible to solve, but just extra work.

The best thing is to be consistent - pick either singular or plural, and stick to it.

As Anonymous, above, mentions, it's important to understand that normalized data structures are important for OLTP (on-line transaction processing) systems, with frequent reads and writes.

Why? Well, imagine we have our recipe database, and we realize, shoot, someone misspelled the word "cumin". Well, in a denormalized schema where everything is flattened into a single table, we have to change thousands (millions? There are a lot of recipes out there...) rows. And updates require locks, and locks impact performance. Now imagine doing this on a large, live system... bad bad mojo.

Now, take our normalized schema. With this structure, because we've eliminated data duplication and isolated entities into separate tables, this update is a simple change to a single row in a small table. This means our transaction needs to only lock a very small set of data, and can perform the update very quickly.

So normalized schemas make it possible to minimize the scope of changes (thus minimizing lock contention), which means fantastic performance for databases with heavy transaction volumes.

Now, in contrast, a data warehouse, or OLAP (on-line analytics processing) system, has very different access patterns. These systems involve very heavy reads and large, extremely intermittent writes. They also involve very complex queries with various types of filtering, aggregation, and so forth. It turns out, to facilitate these kinds of activities, denormalization can be a huge win, since you can slam lots of relevant data into single tables and then index them up the wazoo while minimizing the number of joins required (the operative term, here, is a "star schema").

Of course, most people aren't building data warehouses, so normalization is the rule.

Brett

I admit I'm not a database expert, but the first thought that came into my mind when I heard about flat non-normalized tables for indexing purposes was, why not create yet another table containing the indices of the required fields?

Why wouldn't that be easier to maintain and to extend in the long run?

What if you wanted to do a search by ingredient? What if you want to capitalize on the website and link to a store to sells that ingredient? What if you get sponsored by a brand that sells one of the ingredients? What if you want to keep the naming of ingredients and the look of the page even by formatting the ingredients and quantities in different columns, colors, fonts, etc?

I'm guessing because it's an example.

Re: Naming Like most things there is a place in the middle. For Nouns and Key Entities I use singular -> Person it's clear, and how do I deal with plurals? Persons? People? When I resolve many-many with a table I use a compound plural -> PersonSkills When I have something that is ONLY distingusher (sp?) (generally something that will fill a dropdown in the GUI I use TypeOf -> TypeOfPerson.

Anything works fine in isolation, but a lack of clarity and consistancy will confuse you at some point even when you wrote it.

If you're not the only user pick a standard, tweak it to your needs and document the tweaks - best of all worlds (maybe not but a good start?)

Most importantly this allows for future needs. I did a system for a catering business and by allowing for an ingredient table xref to receipies with measures we were easily able to create a weekly order list for suppliers based on bookings, which had the number of guests, with a table for average consumption of selected menu items per guest, with a xref to min order quantites. This little report feature saved the two owners about 9 hrs or work a week to do a weekly order list.

"A table is in third normal form if all of its columns rely on the key, the whole key and nothing but the key"

Actually, Codd specified this as part of the definition of 3NF, so your lecturer was just restating the definition.

Suppose that recipes may be entered by the customer. How will your db deal with ingredients that it doesn't have in the ingredients table? What happens when a customer misspells an ingredient? It seems to me that worrying about normalizing the data into measures ("2 12 oz cans," anyone?) and ingredients (melted butter vs stick of butter?) is the very least of your problems, then, unless you plan on having staff on hand that cleans up and edits all submitted recipes. (Or you'd need an input form that makes recipe entry a laborious process of looking for the ingredients and measures in drop-down lists, with an "add ingredient" button and the like.)

If anything, the db in your example had too many tables. A recipe needs to be a single text block, with formatting allowed. Maybe even allow a document file. The input has to be sanitized (to prevent cross site scripting and the like), and analyzed (to look for keywords). Then you add the keywords to an index for quick searching, perhaps including proximity metrics (e.g. apple near strawberries) if you want a fancy feature like that.

"Can you explain to me why you couldnt just have a varchar field in recipes where you write "2 cups sugar"

You could, if you are into bad design or if you intend to store just text only for later printing process. That is not very helpful, though. Most likely a business will want to know how much sugar they need in a monthly or daily basis, so yes you have to keep the aggregation in mind which doesn't work with "2 cups of sugar".

Another thing is that you can't create an index that will contain the word "sugar" to easily filter out the recipes that contain it. You will have to scan the entire table on the ingredient/quantity column every time you need to do that, which is totally counter productive.

"Can you explain to me why you couldnt just have a varchar field in recipes where you write "2 cups sugar""

You said it correctly -- aggregation.

Because if you do this correctly, you open up opportunities that would be painful if you didn't. A great example would be using the recipies in a Bill-of-Materials sort of way, which would allow the aggregation of multiple recipies into a single ingredients list. Imagine a customer asking, "I'd like these 5 recipies - what are the ingredients I need to buy?" and having it aggregate the sugar values, like 2 cups + 1 cup + 1 TBSP + 2 TBSP = 4 cups sugar.

I worked on a system that did this, decades ago, for a company that made diapers.

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/