Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

10 Most Critical Database Mistakes Part 3

Disregarding Normalization

Critical Database Design Mistake #3

Database normalization is the process used to organize a database into tables and columns. Normalization means that each Table represents one and only one thing and that each column in that table is a property of that one thing.  There are valid reasons for this design:

  • Reduce duplicate data by storing it in one place;
  • Prevent diverging values due to partial updates;
  • Provide a more object-oriented schema;
  • Simplify queries.

Normalization is the basis behind Relational Database Management Systems (RDBMS). Normalization means understanding and applying basic data design principles.  Third Normal Form (3NF) rules were invented to ensure optimal design of Relational Databases.  There are rules for First NF, Second NF, Third NF and beyond. But each rule is cumulative so that 2NF means that 1NF rule was also followed. For this reason, 3NF is used to mean not only that the 3NF rule was applied but 3NF is often used interchangeably with the term normalization.

So why do many solutions not use 3NF?  For starters many databases are build by developers.  Do they not know 3NF or cannot understand the admittedly obtuse language used to describe them? The following is a simplified explanation along with common problems I have seen along with some solutions.

First Normal Form (1NF)

A table is in First Normal Form if every column in every row contains only values from a single domain (subject). Moreover, each column in each row contains only a single atomic value from that domain. So what are the typical mistakes made?

Repeating Data Elements

One short sighted practise is providing two or three telephone number columns on each row of a table under the belief that will be all that is ever needed.  Now if the users want to search or select a value, the query would have to look in multiple columns or else it will miss data. A classic example of repeating data elements is Phone Number as shown in the figure above.

Another example that I have not seen for a while is storing monthly amounts as named columns; e.g. Jan_Amount, Feb_Amount, Mar_Amount… You have to use UNION to select and combine the data but UNION decreases SQL performance. This is also a poor design as you cannot select data by a date range or aggregate data for a user-selected date range…

Multiple Values in a Single Column

To avoid the repeating data element problem, some designers only provide one column in the person or company Contact table.  Never underestimate humans!  They will store multiple phone numbers in one column in one of two formats:

  • “613-555-1212; 613-555-1213; 613-555-1214” — less likely as this usually won’t fit into the column length;
  • “613-555-1212, -1213, -1214” — more likely due to shorter length

Now try searching for all instances of phone number 555-1213. Imagine building an Enterprise data Warehouse (EDW) to provide a directory or cross-reference by phone number.  Each partial number has to expanded with the common area code and subscriber number prefix and stored in a separate row.  That costs your company time and money.

Note: In the examples below and in my models I do not use the Party concept to represent persons and organizations. I use the Entities concept because this hub table can store any object of interest such as Persons, Companies, Government Organizations, Political Parties, Sites, (Job or Information) Requests, Permits, or anything else that you want to relate to each other.

Junction Table Solution

One typical solution I have seen is to join Entities in a many to many relationship with Phones where Phone Numbers are stored using a surrogate auto-number id, unlike in the figure above.  But there is nothing to stop adding the same phone number thousands of times. I have seen companies that add a row multiple times with a null phone number! 

Are you really trying to ensure that phone numbers are never duplicated but are reused? Is it really necessary to have a table of unique phone numbers across the enterprise? Not usually. I believe that is overkill.

I have also seen many database designs that put a Junction Table between every two tables being linked. But they are using a many to many solution for a one to many relationship. This means a 33% increase in the number of tables as well as joins in SQL, which decreases performance.

Dependent Table Solution

Preventing Duplicate Phone Numbers

What you want is to ensure that phone number is unique WITHIN each Entity to avoid duplicates within each entity. So what is the best solution? Make the Phones table dependent on the Entities. In other words, the Phone table inherits the Primary Key Entity Id of the Entity table. Note that the Entity does not necessarily own that phone. They do not usually own the work phone, they merely use it. If you are tracking history, there could be many phone rows.

So you ask, how does using a dependent relationship avoid duplicates? In itself it cannot. But if we replace the surrogate auto-number key with the cleansed full phone number, you can never have a duplicate phone number.  This solution requires no application logic nor can it be superseded by an expert user, developer, or application. This works because the Phone Number column length is small. It does not work as well with Email Address as I allow for the maximum length (the standard is 320 characters). It does not work at all for addresses as it consists of multiple component fields and a composite Address column would be non-performance as an index.

Entities 3NF Using Identifying Relationships

Now you can see why I call the Entities table is a hub or directory of all objects of interest with satellite tables for Phones, Addresses, Emails, et al.  

Embedded Data

A column must contain only one thing (subject).  If you think this is obvious then you will be surprised how many times a column contains embedded information in the real world.  For instance, there is an international  livestock traceability system to track serious food safety issues like Mad Cow Disease. Every animal is given an ear tag with an id like “PAAB4574”.

The point here is that the id is not a simple sequential number. It consists of three embedded fields: state, species type, and a four digit number. There is nothing wrong with that: the problem is in the database not the ear tag number. The id is stored as one column. This means the State and Species Type validity cannot be controlled by a reference table. For some reason, many organizations think that only one field can be the primary key. Wrong! Defining a three-element composite primary key allows the State and Species to be controlled using Reference Code tables and Referential Integrity. The primary key not only is efficient but allows query selection by State and/or Species. It also does not require any change to the ear tags.

Over-written Columns

The original column purpose must not be over-written part way through time for some other usage – create a new column!  We see this a lot in Excel spreadsheets as users do not understand that a column should represent one subject.  But why is this being done in databases?  This is common even in SAP ERP. SAP provides many user-configured columns that could be used for a new purpose.

Business users seem so reticent to ask for a new column. As long as you are not changing the primary key, a new column can be appended to the table with minimal effort or impact. Perhaps, some people think it is a big problem and/or are worried about crashing the system.   

Second Normal Form (2NF)

A table is in Second Normal Form if it enforces First Normal Form and ensures that every non key attribute is fully dependent on the full (if composite) primary key.  Imagine an Address table above that records addresses for a entity but also records an email address. What if the Address is deleted or expired? You have just lost the email address for the Entity. If you really only want to capture only one email address then that field has to be placed on the Entities table. But the better approach is to create an email address table that would contain email addresses no matter which entity they concern. This allows one query to search one table for a specific email address.

This is where decomposition is used to partition the data into logical set of entities. Is this common? Well I recently saw a legacy system that created a table for every data entry form and report. This is not architecture; it is a symptom of the Just Build It approach by developers with little data design experience.

Multiple Columns Could Be the Primary Key

There can be one and only one column in a table that serves as the primary key. I cannot count how many times that I have seen a natural character key being ignored and an additional unique integer key added as the primary key. Technically, some may argue this does not contravene 2NF as this would apply to all surrogate key situations.  But to me the Country Name is dependent on the ISO Country Code — that is the ISO standard.  

There is some belief that integers are “faster”. Well they are not. I worked with a group of DBA/ETL developers that believed integers were faster. I had to design a test with hundreds of thousands of iterations to prove it to them.  There was only nanosecond difference in performance.  Key performance is more related to key length than data type. It is true that character is better than varying character. So a short length character field is better than a large 200+ varying character description field.

I have to disagree with Ralph Kimball for encouraging this behaviour in his data mart design.  Yes if you have operational data tables derived from multiple data sources then it make sense to avoid using any source primary key as the DW key.  I always use surrogate keys for operational data tables.

However, when it comes to reference code tables, one should use the mnemonic or natural key, based hopefully on international standards. I cannot tell you how many times I have seen companies add an eight byte integer PK to a table that already has a one to four character code. Moreover the ISO, ITU, IATA… code IS the standard value that must be exchanged with clients. Therefore it must be the primary key.  

For Reference Tables that originate from disparate source systems within your organization, pick the mnemonic codes that most closely represents the desired future standard.  What happens if the legacy systems change their code values.  Nothing!  No one says you have to follow their lead.  Just be sure you handle the changed value in your ETL.   

Third Normal Form (3NF)

A table is in Third Normal Form if all of its columns are not transitively dependent on the primary key.  Bill Kent changed this obtuse statement to “Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.” 

One thing I have seen is where the table not only has a code value but also includes a description of that value. For instance, the Purchase Orders table above includes the ISO Currency Code as well as the Currency Name.  But Currency Name is not dependent on PO Number.  It is dependent on ISO Currency Code.  This is where decomposition should be used to create a separate Currency Codes table for storing the Country Code, Currency Name and other properties of the Currency Code.  

Related Posts in this Series

Top 10 Database Design Issues Part 1 — Assigning Atrocious Element Names

Top 10 Database Design Issues Part 2
— Missing Data Documentation



This post first appeared on Terra Encounters, please read the originial post: here

Share the post

10 Most Critical Database Mistakes Part 3

×

Subscribe to Terra Encounters

Get updates delivered right to your inbox!

Thank you for your subscription

×