There are a number of means by which a simple relational database can be built. Some of those processes are particularly complex, but ideal for formal situations where the designer requires both a rigourous and efficient database designed to handle large numbers of records well. However, sometimes designers can use simpler processes which produce a good database that will handle their needs, but without necessarily focusing on the complexities of relational algebra and the more esoteric forms of database normalization. Thus this lesson is focused on creating quick, simple and effective databases which meet the needs of the end users, but without getting too caught up in database theory.
Note that this lesson presumes that the reader understands the material covered in Introduction to Relational Databases.
Where to start
Curiously, when people look at where to start with database design, the answer is to look at the end of the process, not the beginning. There is a tendency to focus on what data can be collected, but focusing on this can lead to some problems:
- It is easy for designers to miss data that they did not expect to need, but that suddenly becomes important at the end of the process.
- In some jurisdictions, there are legal restrictions on what data can be collected. In general, all personal data collected in a database should be required, rather than just collected "because you can".
- From a user's perspective, database systems which ask for a minimal amount of data are generally easier and quicker to use than those which ask for a lot.
- The less data you collect, the less room that there is for errors.
Thus the first thing a good database designer does is look to the reports: what does the database have to output at the end? Because if the purpose of a database is to provide access to information, the driving consideration should be to ask what information the end user needs to access, so that the designer can make sure that it is allowed for.
An organization, let's call them U Walk Tours, runs short walking tours around a city. They have five current tours on their books:
- a Ghost Tour, run at night taking tourists and locals around the various haunted establishments in the area
- Historic Buildings, which tours and talks about the architecture of the region
- Chocolate Hunt, in which people are taken to the various chocolate sellers and cafes in the local area
- Gourmet Market Tour, where interested parties are taken through the local markets and shown where to buy the best food
- Art in the City, showing the various public art sites that are on display, and visiting some of the more noteworthy galleries
In the head office, the people running the show need to be able to find out who has been assigned to each tour, when the next tours are on, how many people have booked for the tour, how many people can book for the tour (especially with "Art in the City", as there is a limit to how many people can be taken through the galleries), if they have paid, and an emergency contact in case something goes wrong. For each guide they need to know the guide's address, home and mobile phone numbers, and an emergency contact. They would also like to have a postal address for each client who takes a tour so they can send special offers in the mail, but because they are polite, they only send offers to those who have opted in, so they need to know if each client has opted to receive them. For the same reason they would like an email address for each client – they have found that offers in the mail are better received, but they would like the option of reaching the client with email in the future.
Before each tour, the tour guide needs to be given a list of everyone who has booked for that tour. The list needs to include the person's name, age, and mobile phone number. The tour guide will not collect payments, so the list will only include people who have paid for the tour. (It is a rough city, and carrying around lots of money may not be wise).
From this, it can be assumed that for each tour the database will need to keep track of:
- Tour name
- Tour date
- Tour time
- Maximum people who can go on the tour
- Number of people booked for the tour
- Name of the tour guide
- The tour guide's contact details
- Home phone
- Mobile phone
- Emergency contact details for the tour guide
- Contact phone number
- Name of each person booked for the tour
- Address of each person booked for the tour
- Mobile phone number for each person who has booked
- Emergency contact details for each person who has booked
- Contact phone number
- Age of each person
- Email for each person
- Whether or not the person has opted to receive special offers
- Whether or not the person has paid
It is a lot of data when written down like that, but such is the way of databases.
Not all data needs to be stored. Some data can be derived from other data. This is generally a good thing – every time you store data you increase the risk of error, so calculating it on demand tends to be safer (on the grounds that if your method of calculating it is sound, it should always work).
There are a couple of fields in the example above which might be better changed to be derived rather than stored. In particular, there is the number of people booked for a tour. There are two ways you could get that data:
- Update it every time someone is added or removed from the tour
- Count the number of bookings
The first of these works, but what if there is ever a mistake? For example, what would happen if 6 was stored in the field, but 8 people had booked? If that was the case it might be that more people would show up for the tour than were supposed to, causing problems. Or worse yet, what if it said 0 when 7 people had booked? The guide wouldn't bother showing up, and the company would have seven angry people demanding their money back.
The second option might seem annoying, because it would be bothersome to have to count all the bookings for each trip, but there is a trick: the computer (or, to be more exact, the database management system) can do the counting for you. Computers are very good at counting. Thus, generally speaking, it would be better to go with the second option over the first.
The second field that may be better derived is the client's age. At the moment that is stored, and there is no other field that could provide it. However, age is something that is difficult to keep accurate. For example, what if someone booked six months in advance, and had a birthday in between? Or what if six years later they came back for another tour, but their age was still shown as it was six years ago? The solution is not to store their age, but instead to store their date of birth, and use that to derive their age on demand.
First normal form
That sounds cool, but the focus here is on practical application. So, using the data fields listed above, is there anything that is repeating? Do any fields represent more than one thing? The answer is yes: there are a number of fields relating to the people who have booked for the tour, and as there is (hopefully) more than one person, these represent a "repeating group". From a practical perspective, it makes sense to pull these out: the database can't represent multiple people in a single field, so either the database needs to have "person 1", "person 2" and so on, or the clients need their own table.
As a simple example of not meeting 1NF, a simple version might be:
|Tour Name||Tour Date||Client 1 Name||Client 1 Email||Client 2 Name||Client 2 Email||Client 3 Name||Client 3 Email|
|Ghost Tour||25 June 2011||Sam||sam@internet||John||Alex||alex@internet|
Given that the database also needs to include address details, phone numbers, if they have paid, and a variety of extra fields for each person, that will rapidly become a very ugly and unmanageable database. The alternative is just as bad:
|Tour Name||Tour Date||Names||Emails|
|Ghost Tour||25 June 2011||Sam; Alex; John||sam@internet; alex@internet|
How do we know which email address goes with which person? Again, this fails 1NF, and would be almost unmanageable.
Thus the data should be broken up into at least two tables, with repeating fields pulled out of the main one:
Each record in Clients will represent one, and only one, person, with the Tour only having fields that contain a single value. This is not enough, of course – relations need to be added, primary keys identified, and the database further refined, but it is a good start.
Identifying primary keys
There are formal approaches to identifying candidate and primary keys for relational databases. If interested, it is worth reading the candidate key article on Wikipedia, as it gives an overview of the process. However, it is also viable to use simpler rules of thumb when working at this scale.
A primary key has to be a unique value which can be used to identify an individual instance of a record. Thus in the above model, a primary key is needed for the Tour and Client tables, so that each unique record can be referred to. For example, if just "name" was used to distinguish between different clients in the Client table, there would be problems if two people with the same name were to register for tours (and this is a genuine possibility, so it is something to be concerned about). Names just aren't unique enough.
So the first question, using the Client table as an example, is whether or not any single field is sufficient to distinguish between different clients. The answer is no. More than one person can be born on the same day, or have the same name, or live at the same address. Mobile numbers might generally be unique to a person, but not everyone has one, and email is in the same boat. So in the end, there is no single field which is sufficient.
Thus the next question is if there is a combination of fields that could be sufficient when combined. For example, name and date of birth? Once again, the answer is (mostly) no. The odds of two people with the same name, both born on the same day, registering for tours in with the same company, are admittedly incredibly small. But it might happen.
Thus the easiest solution there is to create a primary key. Sometimes it is the best way forward, and, after all it is what most institutions do for clients. Thus a new field needs to be added to Clients: ID. To make things easier, the ID is typically numeric and sequential. Thus the first client is number 1, the second client is number 2, the third client is number 3, and so on. If a client is ever removed, though, it doesn't matter - at worst you have a missing number in the sequence, and as the number only matters at the database level, the fact that it isn't a perfect sequence does no harm.
The Tours table, on the other hand, is a tad different. There is still no one field that would do (there can be more than one tour run with the same name, more than one tour run on the same day, and more than one tour run at the same time). However, we can safely assume that one person will not run more than one tour at the same time and date. If they could that person wouldn't be a tour guide so much as an ongoing physics experiment. So for now, the proposal is to have a combined key consisting of four fields: name of tour guide, tour name, tour date and tour time. The combination of four fields creates a unique key which clearly distinguishes any tour from all others.
Don't enter data twice
There is a bit of a theme to database design – do everything you can to limit the risk of errors. Errors are bad. Unfortunately, there are some errors which are difficult to prevent, and the worst of those are data entry errors. There is not a lot to be done to prevent the end user making a mistake when typing in data. But it is possible to reduce the risk.
One method of reducing the risk is to lessen the number of times that they enter data. If we can't stop people from making mistakes, maybe we can reduce the opportunities that they have to do so. Thus looking at the above tables, the question is: what fields will have to be entered often, but with the same values each time? In the clients table the main issue is the address, but the risk is small. If a number of people from the same address all register (and this is likely to happen, especially if you consider the odds that a family will all sign up for a tour), that address will have to be entered more than once. This is a risk, but, on the plus side, it is not a huge risk, so for now it will be ignored. Mostly because the Tours table is a much bigger problem.
Using the existing tables, every time a tour is run (noting that the same tour might even be run three or four times a day), it needs to be entered into the database. And to do so, the person entering the data must enter:
- The name of the tour
- The date and time
- The tour guide's name
- The tour guide's contact details
- The tour guide's emergency contact details
Focusing on the biggest problem, there is a big risk that something will go wrong entering all of those tour guide details. But what if you didn't have to do that every time? What if, instead of entering those details, the user simply referred to the details in another table? That would make things wonderfully simple in comparison.
Thus the first step is to pull out the tour guide's details and keep them separate.
(There is now an ID for the guide because there needs to be a primary key).
This helps in two different ways. The first, as mentioned, is that the end user only needs to enter the data once. The second is that memory usage is much reduced. For example, imagine that all that data took, say, 1kb of memory. (It won't, but this is just an example). If a given a tour guide runs 1000 tours, and the old model was used, that guide's data would have been entered 1000 times, using up 1000kb of memory. But now, with the new model, the data only has to be entered once, with only the ID needing to be entered 1000 times. The ID still takes up some memory, but nothing like what the full tour guide record uses.
The next option for data that might be better treated in a separate table is the tour name. As mentioned at the start, five different tours are run. At the moment, every time an individual tour is created, the name of the tour has to be entered. But what if someone makes an mistake? While it probably won't do too much damage, it will make reports difficult. For example, it might report that there were 300 people on the "Ghost tour", and 12 people on the "Gost tour".
Along with the tour name, there are another three fields (or combinations of fields) that could be similarly treated:
Addresses, emergency details.
Now that the tables exist, we need to show how the relationships between the tables work.
One of the difficulties faced in database design is that there are a lot of different standards for representing relationships. The diagram at the right shows six different methods for representing the same one-to-many relationship, and it is not an exhaustive list. In each case, though the same relationship is being represented: a given person is born at one, and only one location, but a given location may have had zero, one or more people born at it. To make things simple, we'll be using the "crows foot" notation, but any for those would be a viable option, although if interested, there is a good discussion of the different models at Entity-relationship model on Wikipedia.
Things to try
- Although, as will be demonstrated, some knowledge of normalization is valuable.
- There are exceptions to this rule. For example, a researcher typically does not know what data is going to be needed at the end – often they want to collect as much as possible, in the hope that they can detect patterns which will be of value in determining their findings. In such cases the focus is on collecting everything that can reasonably be collected (noting the existence of ethical concerns whenever data is collected, and that there is always a limit to how much can be gathered), rather than guessing at how it will be employed.
- One exception is when you have to count it often. Counting is slower than reading, so if the computer needs that value all the time, it might be more efficient to store it as a value. However, that is relatively unlikely in the sorts of databases being discussed here, and if you are building a database big enough to warrant that sort of speed optimisation, you probably need to do a more advanced course anyway.
- Date, C. J. "What First Normal Form Really Means" in Date on Database: Writings 2000–2006 (Springer-Verlag, 2006), pp. 127–128.
- Note that the two derived fields noted above have now been removed or changed, per the discussion.
- This only holds if the tour guide takes more than one tour: if each guide only ever runs one, and only one, tour, then more more will be used on the IDs than is saved. However, this is incredibly unlikely in this case.