It might sound like a no brainer, but a major goal of a data analyst is to spend more time analyzing data and less time retrieving and cleaning it. Embracing data normalization is a great way to ensure you spend your time analyzing data, not looking for it. There are three main objectives when normalizing a database:
- Limit the amount of duplicate data stored inside our database.
- Construct a database such that, when modifications are made to it, we only need to make the modification in one place.
- Construct the database with proper relationships among different data items.
This is why data normalization will save huge headaches and keep our joins from producing inaccurate and duplicate information.
Duplicate information is easy to overlook so ensuring the database is clean is a must for any company. Before we go any further, it’s important to understand structure query language (SQL). A SQL join would combine data from two sets of data (e.g. two tables) using values common to each. Take a look at the example below to better understand how duplicates can be avoided.
Let’s assume we have a table in our database that contains customers and the various customer pertinent information. We have a city, customer name, state, and zip. In the example below, we have two instances of customer1.
This customer happens to have two different locations; one in Happyville, Iowa and another in Happiness, Iowa. If we already know the customer’s name, why must we store it twice? The good news is we don’t have to. We will fix this problem by adding another table that contains the customer’s name and another that contains location information.
Now there is only one instance of our customer and there is only one place where we need to modify the customer’s name if we so choose. Our next step is to relate this customer to its locations. We do this is by giving our customer table a primary key and our locations table a foreign key that can relate to the customer table.
The SQL join in our query will look like this:
SELECT customers.customer_name, locations.city, locations.state, locations.zip
ON customers.primary_key = locations.customer_foreign_key
This will return our customer and their location information. What we just created was a one-to-many relationship. In this case, one customer has many locations.
Let’s look at what can happen if we choose to have two instances of a customer in our customer table. Suppose we have another table that holds all a customer’s products, and our customer table with two instances of a customer.
Let’s look at what our join produces when we try to join these two tables.
Hold on a minute. I thought customer1 had one product1 and one product2? This appears to have returned inaccurate data because of the way we are storing our customers. We’ll see what happens when we use our other customer table.
This looks much better. We have our customer’s name and the correct products for this customer.
By applying data normalization to the structure, we could quickly and accurately query the database for this customer’s products. This way your analysts can spend more time analyzing data and less time searching for it.