Plumbing the Depths of Normalization

by Sue Petersen

It's mid-morning at Petersen Plumbing and after the initial rush of calls for service and dispatching I finally have a moment for a well-deserved cup of hot chocolate. Then, the next batch of calls starts. These are from the service guys and they have all sorts of different questions. The first call is from a plumber at a house on the south edge of town and he wants to know where the cleanout is for this house he's working on. Then the boss calls and wants to know how many BrandX water heaters we've sold this year and have any of them been installed on the western side of town? Someone else wants to know if we did a, "similar job on the north side of this street last week?" Then, "What did we charge this landlord the last time we replaced one of his disposals?" You can see why I cherish that brief moment with a cup of chocolate!

In some ways, my life was a lot simpler 10 years ago. The data my business created and used was stored in ledger books, on 5" by 7" file cards, and in a whole lot of file cabinets. The typewriters and desktop calculators were the only complicated machinery in the office. However, I did get awfully tired of the smell of white-out and filing has never been one of my favorite activities...

Those manual systems worked. We got the basic accounting done and we could tell who owed us money and when they paid us. But there were many questions that we just didn't think to ask (and they would probably have been unanswerable if we had.) There is absolutely no doubt in my mind that computers are complicated, but I also have no doubt that my office staff would have a heart attack if I tried to make them go back to the old, manual ways of doing things!

We got from there to here in one piece, but the journey included a lot of confusion, and way too many brick walls and dead-end paths for my peace of mind. However, along the way I did learn a fair amount about data analysis and how and why I should structure that data in the most logical way. I learned to ensure that the data was organized in a semantically correct and meaningful way and that the data could be stored in the smallest possible space. I learned to structure the database in a way that provides the greatest possible reliability and future flexibility.

It's tempting, especially with all the great GUI generators and database systems available off the shelf these days, to start cranking out systems without first sitting down and thinking about the data and what it means. (There should be a warning on these packages - Engage Brain Before Opening!) But if we are going to gain the greatest possible benefit from computer technology, we must first learn to understand our data.

Business computing has been around for many decades, so it's not surprising that other people have struggled with the same problems of data management and accuracy that you and I are facing today. In fact, an entire theory of database structure has developed, called Normalization Theory, with theorems, philosophies and rules of thumb that have been thoroughly tested and refined in the real world. A firm understanding of normalization will give you a leg up in your tasks. Fortunately, despite the arcana of normalization, it can be expressed in a few simple rules that suit the needs of most of us. I'll use my plumbing shop as an example.

One fact, one place: eliminate redundant data. (2NF)

An area I've been concentrating on lately is our invoicing system. An invoice contains all the information required for a single job. (see diagram 1) Every invoice has one and only one customer. A customer may have several invoices listed in my computer system. We can, of course, store the customer's name and address along with each invoice, but that wastes a lot of space. Storing the customer data more than once can also create problems when we want to update the customer's information. If the customer moves, or we discover that we've been spelling their name incorrectly, we'll have to update each and every one of their invoices and, if we miss one, the data is suddenly ambiguous. When information conflicts, which should we believe? Also, in that format, we have no way to store information about a customer before we've done a job for them and typed up the invoice. That's annoying. I often want to enter data about a new or potential customer well before that first invoice is created.

Storing the customer information in each INVOICE creates redundant data, so the INVOICE file is not in what the database gurus call 2nd Normal Form (2NF). The solution is to pull the customer information out into a separate table, CUSTOMER, and to include a field in INVOICE that allows the computer to look up the CUSTOMER information whenever it's needed. (see diagram 2) Once we do this, a customer's information is stored in only one place. We've drastically reduced the space needed on the hard drive, and we won't have problems with conflicting information (since we only have one copy). We can also store data about a customer before we've actually done any work for them. Another advantage is that we can now delete all the invoices for a customer without deleting the knowledge that this particular customer exists. We add one additional table and so many problems are solved. It's sort of magical!

Eliminate repeating groups. (1NF)

My invoice has space for several phone numbers. A naive data modeler (such as I was 10 years ago) will attempt to replicate this structure in the database. Originally I included 3 phone numbers as attributes in my CUSTOMER table.: HomePh, WorkPh, and MessagePh. This worked OK in the majority of cases, although it did waste quite a bit of space storing empty fields for those customers that had no work phone or message phone. I even had some customers who didn't have a phone at all! And, of course, there was always that troubling 20% of cases where a customer would have more than one home phone number, or had a fax or mobile number, or just plain had more than 3 phone numbers! This wasn't working very well.

My phone number attributes were a repeating group, so the CUSTOMER table was not in 1st Normal Form (1NF). The repeating group wasted space and made it difficult to search for a particular phone number. If somebody handed me a phone number and wanted to know if it already existed in my database, I had to either guess what category it was listed under (home, work, or message) or I had to triple my work and search the HomePh, WorkPh and MessagePh fields. As I learned more about databases and normalization theory, I realized that the solution was to pull the phone number attributes out of CUSTOMER, to create a separate PHONE table. (see diagram 3) Now a CUSTOMER can have as many phone numbers as they have phones, and no space is wasted if the CUSTOMER doesn't have a phone.

Put logically separated data into separate tables. (4NF)

Once we pull CUSTOMER out of INVOICE, we are left with the data in diagram 3. Take another look at INVOICE. There are several gobs of data that aren't particularly related to each other. Who did the job and how long they spent on it really doesn't have a whole lot to do with what parts they used or how many payments the CUSTOMER took to finish paying for the job. INVOICE isn't in 4th Normal Form (4NF).

An INVOICE will always have at least one plumber listed. However, it's also possible for an INVOICE to have many plumbers listed because, for a complicated job, several people may be needed to complete it. So, how many identical fields should I include for plumber information? 1? 5? 10? I'm either going to be wasting a lot of space or I'm going to be artificially limiting the number of plumbers that are allowed to work on a job - probably both. And I can just see my boss's face when I tell him that only 5 people are allowed to work on one job! Obviously, as soon as I decide to include multiple fields to list any extra plumbers, INVOICE is no longer in 1st Normal Form (1NF).

As we already know from the PHONE number problem described above, the solution is to pull the plumber information out into a separate table, named LABOR. Likewise, part and payment data can be pulled out into separate tables - PART and PAYMENT. (see diagram 4) Each of these tables is now much more cohesive than the original INVOICE table was, and INVOICE no longer needs the repeating groups to hold information about plumbers, parts used, or payments made.

Eliminate data not dependent on the key: (3NF)

As I was preparing the example for this article, I got this far and then paused. I really liked this structure. However, as I looked at diagram 4, I started to get more and more uneasy about PART. Obviously, the PartDescription should be the same for every instance of a particular PartCode, which means we'll always know what the PartDescription is just by looking at the PartCode. As a result, we're wasting space by storing a redundant field (PartDescription) and, if we're really unlucky, we can end up with ambiguous information. If something happens and PartCode and PartDescription disagree, which one should we believe? The PART table is not in 3rd Normal Form (3NF) because the PartDescription field is dependent solely on the PartCode field and not on PART's entire primary key.

The solution, as you've probably guessed by now, is to pull the PartCode and PartDescription out into a separate table. (see diagram 5) Now we only need to store PartDescription once, instead of every time we sell that part. And now SoldPart is in 3NF and so is the rest of the system.

Theory again:

Normalization is the process of analyzing and structuring the data in the problem domain so that it is as stable, flexible, and consistent as possible. Some basic rules of thumb you should remember are:

A) Each table must provide a totally unambiguous way to identify an instance of itself, we call this identifier the Primary Key (PK). It may consist of one field, or several, but it must be unique within that table. This PK can be included as a Foreign Key (FK) in any table that needs to reference an instance of this table.

B) One fact, one place. - A fact should be stored in only one place, primary and foreign keys can be used to provide access to this fact whenever necessary.

C) The key, the whole key, and nothing but the key! - Every field in a table should depend on the primary key, the entire primary key (if it consists of multiple fields) and nothing but the primary key.

D) Business rules RULE! - The semantics of the data is more important than, and is reflected in, the structure. It is entirely possible for a data structure to be perfectly correct in one situation and incorrect in another situation, if the businesses differ in how they understand and use their data.

The Normal Forms

There are 5 Normal Forms (and a few other, more obscure, forms that some authorities feel are useful.) Each normal form removes another type of ambiguity and redundancy. (I've talked about the 1st through 4th Normal Forms in this article.) For all practical purposes, it is usually sufficient to reduce a system to 3rd or 4th NF, and indeed, most systems that are in 3rd NF are also in 5th NF. The Normal Forms are:

1st NF - Remove repeating groups
2nd NF - Remove redundant data
3rd NF - Remove attributes not dependent on the key
4th NF - Remove independent, multi-valued attributes
5th NF - Remove semantically related multiple relationships
(See the references for a further discussion of 5th NF.)

Conclusion:

Normalization is often discussed in formal, step-by-step terms, but don't be intimidated! In real life it usually occurs fairly intuitively, and holistically. I have never (outside of books) been required to take a system step by step from 1NF to 2NF to 3NF. I use formal normalization theory to verify that a proposed structure is in 4NF, but I usually develop the structure itself through many other, less formal, processes such as Entity Relationship Diagrams, Rapid Application Design, Object Oriented Analysis and Design, etc. Regardless of how I reach the final structure, I always take time to make sure it is in 4NF before I start cutting code. It's the best way I know of to protect my data and to make sure I always have accurate answers when those phones and radios start humming! Now, where'd I put that hot chocolate?

IDEF1X notation sidebar:

Each box is a table.
The fields listed above the line constitute the primary key for the table.
Each line is a relationship between tables, the dot goes on the 'many' side of the relationship. For instance:

One CUSTOMER may have none to many INVOICES.
One INVOICE must have one and only one CUSTOMER.