Relational Databases/Design

Introduction

There are a number of means by which a simple relational database can be built. Some of those processes are particularly complex, but ideal for formal situations where the designer requires both a rigourous and efficient database designed to handle large numbers of records well. However, sometimes designers can use simpler processes which produce a good database that will handle their needs, but without necessarily focusing on the complexities of relational algebra and the more esoteric forms of database normalization.^[1] Thus this lesson is focused on creating quick, simple and effective databases which meet the needs of the end users, but without getting too caught up in database theory.

Note that this lesson presumes that the reader understands the material covered in Introduction to Relational Databases.

Where to start

Curiously, when people look at where to start with database design, the answer is to look at the end of the process, not the beginning. There is a tendency to focus on what data can be collected, but focusing on this can lead to some problems:

It is easy for designers to miss data that they did not expect to need, but that suddenly becomes important at the end of the process.
In some jurisdictions, there are legal restrictions on what data can be collected. In general, all personal data collected in a database should be required, rather than just collected "because you can".
From a user's perspective, database systems which ask for a minimal amount of data are generally easier and quicker to use than those which ask for a lot.
The less data you collect, the less room that there is for errors.

Thus the first thing a good database designer does is look to the reports: what does the database have to output at the end? Because if the purpose of a database is to provide access to information, the driving consideration should be to ask what information the end user needs to access, so that the designer can make sure that it is allowed for.^[2]

Initial fields

An organization, let's call them U Walk Tours, runs short walking tours around a city. They have five current tours on their books:

a Ghost Tour, run at night taking tourists and locals around the various haunted establishments in the area
Historic Buildings, which tours and talks about the architecture of the region
Chocolate Hunt, in which people are taken to the various chocolate sellers and cafes in the local area
Gourmet Market Tour, where interested parties are taken through the local markets and shown where to buy the best food
Art in the City, showing the various public art sites that are on display, and visiting some of the more noteworthy galleries

In the head office, the people running the show need to be able to find out who has been assigned to each tour, when the next tours are on, how many people have booked for the tour, how many people can book for the tour (especially with "Art in the City", as there is a limit to how many people can be taken through the galleries), if they have paid, and an emergency contact in case something goes wrong. For each guide they need to know the guide's address, home and mobile phone numbers, and an emergency contact. They would also like to have a postal address for each client who takes a tour so they can send special offers in the mail, but because they are polite, they only send offers to those who have opted in, so they need to know if each client has opted to receive them. For the same reason they would like an email address for each client – they have found that offers in the mail are better received, but they would like the option of reaching the client with email in the future.

Before each tour, the tour guide needs to be given a list of everyone who has booked for that tour. The list needs to include the person's name, age, and mobile phone number. The tour guide will not collect payments, so the list will only include people who have paid for the tour. (It is a rough city, and carrying around lots of money may not be wise).

From this, it can be assumed that for each tour the database will need to keep track of:

Tour name
Tour date
Tour time
Maximum people who can go on the tour
Number of people booked for the tour
Name of the tour guide
The tour guide's contact details
- Street
- Suburb
- City
- Country
- Postcode
- Home phone
- Mobile phone
Emergency contact details for the tour guide
- Name
- Relationship
- Contact phone number
Name of each person booked for the tour
Address of each person booked for the tour
- Street
- Suburb
- City
- Country
- Postcode
Mobile phone number for each person who has booked
Emergency contact details for each person who has booked
- Name
- Relationship
- Contact phone number
Age of each person
Email for each person
Whether or not the person has opted to receive special offers
Whether or not the person has paid

It is a lot of data when written down like that, but such is the way of databases.

Derived fields

Not all data needs to be stored. Some data can be derived from other data. This is generally a good thing – every time you store data you increase the risk of error, so calculating it on demand tends to be safer (on the grounds that if your method of calculating it is sound, it should always work).

There are a couple of fields in the example above which might be better changed to be derived rather than stored. In particular, there is the number of people booked for a tour. There are two ways you could get that data:

Update it every time someone is added or removed from the tour
Count the number of bookings

The first of these works, but what if there is ever a mistake? For example, what would happen if 6 was stored in the field, but 8 people had booked? If that was the case it might be that more people would show up for the tour than were supposed to, causing problems. Or worse yet, what if it said 0 when 7 people had booked? The guide wouldn't bother showing up, and the company would have seven angry people demanding their money back.

The second option might seem annoying, because it would be bothersome to have to count all the bookings for each trip, but there is a trick: the computer (or, to be more exact, the database management system) can do the counting for you. Computers are very good at counting. Thus, generally speaking, it would be better to go with the second option over the first.^[3]

The second field that may be better derived is the client's age. At the moment that is stored, and there is no other field that could provide it. However, age is something that is difficult to keep accurate. For example, what if someone booked six months in advance, and had a birthday in between? Or what if six years later they came back for another tour, but their age was still shown as it was six years ago? The solution is not to store their age, but instead to store their date of birth, and use that to derive their age on demand.

First normal form

A database is said to be in First Normal Form (1NF) if it meets two basic criteria: it is an adequate representation of a relation, and it has no repeating groups.^[4]

That sounds cool, but the focus here is on practical application. So, using the data fields listed above, is there anything that is repeating? Do any fields represent more than one thing? The answer is yes: there are a number of fields relating to the people who have booked for the tour, and as there is (hopefully) more than one person, these represent a "repeating group". From a practical perspective, it makes sense to pull these out: the database can't represent multiple people in a single field, so either the database needs to have "person 1", "person 2" and so on, or the clients need their own table.

As a simple example of not meeting 1NF, a simple version might be:

Tour Name	Tour Date	Client 1 Name	Client 1 Email	Client 2 Name	Client 2 Email	Client 3 Name	Client 3 Email
Ghost Tour	25 June 2011	Sam	sam@internet	John		Alex	alex@internet

Given that the database also needs to include address details, phone numbers, if they have paid, and a variety of extra fields for each person, that will rapidly become a very ugly and unmanageable database. The alternative is just as bad:

Tour Name	Tour Date	Names	Emails
Ghost Tour	25 June 2011	Sam; Alex; John	sam@internet; alex@internet

How do we know which email address goes with which person? Again, this fails 1NF, and would be almost unmanageable.

Thus the data should be broken up into at least two tables, with repeating fields pulled out of the main one:^[5]

Tour	Client
Tour name Tour date Tour time Name of the tour guide The tour guide's contact details Street Suburb City Country Postcode Home phone Mobile phone Emergency contact details for the tour guide Name Relationship Contact phone number	Name Address Street Suburb City Country Postcode Mobile phone number Emergency contact details Name Relationship Contact phone number Date of birth Email Opted to receive special offers? Has paid?

Each record in Clients will represent one, and only one, person, with the Tour only having fields that contain a single value. This is not enough, of course – relations need to be added, primary keys identified, and the database further refined, but it is a good start.

Identifying primary keys

There are formal approaches to identifying candidate and primary keys for relational databases. If interested, it is worth reading the candidate key article on Wikipedia, as it gives an overview of the process. However, it is also viable to use simpler rules of thumb when working at this scale.

A primary key has to be a unique value which can be used to identify an individual instance of a record. Thus in the above model, a primary key is needed for the Tour and Client tables, so that each unique record can be referred to. For example, if just "name" was used to distinguish between different clients in the Client table, there would be problems if two people with the same name were to register for tours (and this is a genuine possibility, so it is something to be concerned about). Names just aren't unique enough.

So the first question, using the Client table as an example, is whether or not any single field is sufficient to distinguish between different clients. The answer is no. More than one person can be born on the same day, or have the same name, or live at the same address. Mobile numbers might generally be unique to a person, but not everyone has one, and email is in the same boat. So in the end, there is no single field which is sufficient.

Thus the next question is if there is a combination of fields that could be sufficient when combined. For example, name and date of birth? Once again, the answer is (mostly) no. The odds of two people with the same name, both born on the same day, registering for tours in with the same company, are admittedly incredibly small. But it might happen.

Thus the easiest solution there is to create a primary key. Sometimes it is the best way forward, and, after all it is what most institutions do for clients. Thus a new field needs to be added to Clients: ID. To make things easier, the ID is typically numeric and sequential. Thus the first client is number 1, the second client is number 2, the third client is number 3, and so on. If a client is ever removed, though, it doesn't matter - at worst you have a missing number in the sequence, and as the number only matters at the database level, the fact that it isn't a perfect sequence does no harm.

The Tours table, on the other hand, is a tad different. There is still no one field that would do (there can be more than one tour run with the same name, more than one tour run on the same day, and more than one tour run at the same time). However, we can safely assume that one person will not run more than one tour at the same time and date. If they could that person wouldn't be a tour guide so much as an ongoing physics experiment. So for now, the proposal is to have a combined key consisting of four fields: name of tour guide, tour name, tour date and tour time. The combination of four fields creates a unique key which clearly distinguishes any tour from all others.

Don't enter data twice

This next step involves a combination of second normal form and third normal form, but is couched in different terms.

There is a bit of a theme to database design – do everything you can to limit the risk of errors. Errors are bad. Unfortunately, there are some errors which are difficult to prevent, and the worst of those are data entry errors. There is not a lot to be done to prevent the end user making a mistake when typing in data. But it is possible to reduce the risk.

One method of reducing the risk is to lessen the number of times that they enter data. If we can't stop people from making mistakes, maybe we can reduce the opportunities that they have to do so. Thus looking at the above tables, the question is: what fields will have to be entered often, but with the same values each time? In the clients table the main issue is the address, but the risk is small. If a number of people from the same address all register (and this is likely to happen, especially if you consider the odds that a family will all sign up for a tour), that address will have to be entered more than once. This is a risk, but, on the plus side, it is not a huge risk, so for now it will be ignored. Mostly because the Tours table is a much bigger problem.

Using the existing tables, every time a tour is run (noting that the same tour might even be run three or four times a day), it needs to be entered into the database. And to do so, the person entering the data must enter:

The name of the tour
The date and time
The tour guide's name
The tour guide's contact details
The tour guide's emergency contact details

Focusing on the biggest problem, there is a big risk that something will go wrong entering all of those tour guide details. But what if you didn't have to do that every time? What if, instead of entering those details, the user simply referred to the details in another table? That would make things wonderfully simple in comparison.

Thus the first step is to pull out the tour guide's details and keep them separate.

Tours	Guides	Clients
Tour name Tour date Tour time	ID Name Address Street Suburb City Country Postcode Home phone Mobile phone Emergency contact details Name Relationship Contact phone number	ID Name Address Street Suburb City Country Postcode Mobile phone number Emergency contact details Name Relationship Contact phone number Date of birth Email Opted to receive special offers? Has paid?

(There is now an ID for the guide because there needs to be a primary key).

This helps in two different ways. The first, as mentioned, is that the end user only needs to enter the data once. The second is that memory usage is much reduced. For example, imagine that all that data took, say, 1kb of memory. (It won't, but this is just an example). If a given a tour guide runs 1000 tours, and the old model was used, that guide's data would have been entered 1000 times, using up 1000kb of memory. But now, with the new model, the data only has to be entered once, with only the ID needing to be entered 1000 times. The ID still takes up some memory, but nothing like what the full tour guide record uses.^[6]

The next option for data that might be better treated in a separate table is the tour name. As mentioned at the start, five different tours are run. At the moment, every time an individual tour is created, the name of the tour has to be entered. But what if someone makes an mistake? While it probably won't do too much damage, it will make reports difficult. For example, it might report that there were 300 people on the "Ghost tour", and 12 people on the "Gost tour".

Along with the tour name, there are another three fields (or combinations of fields) that could be similarly treated:

Country
Suburb/postcode
Relationship

Addresses, emergency details.

Relations

Now that the tables exist, we need to show how the relationships between the tables work.

Representing relationships

One of the difficulties faced in database design is that there are a lot of different standards for representing relationships. The diagram at the right shows six different methods for representing the same one-to-many relationship, and it is not an exhaustive list. In each case, though the same relationship is being represented: a given person is born at one, and only one location, but a given location may have had zero, one or more people born at it. To make things simple, we'll be using the "crows foot" notation, but any for those would be a viable option, although if interested, there is a good discussion of the different models at Entity-relationship model on Wikipedia.

Things to try

References

↑ Although, as will be demonstrated, some knowledge of normalization is valuable.
↑ There are exceptions to this rule. For example, a researcher typically does not know what data is going to be needed at the end – often they want to collect as much as possible, in the hope that they can detect patterns which will be of value in determining their findings. In such cases the focus is on collecting everything that can reasonably be collected (noting the existence of ethical concerns whenever data is collected, and that there is always a limit to how much can be gathered), rather than guessing at how it will be employed.
↑ One exception is when you have to count it often. Counting is slower than reading, so if the computer needs that value all the time, it might be more efficient to store it as a value. However, that is relatively unlikely in the sorts of databases being discussed here, and if you are building a database big enough to warrant that sort of speed optimisation, you probably need to do a more advanced course anyway.
↑ Date, C. J. "What First Normal Form Really Means" in Date on Database: Writings 2000–2006 (Springer-Verlag, 2006), pp. 127–128.
↑ Note that the two derived fields noted above have now been removed or changed, per the discussion.
↑ This only holds if the tour guide takes more than one tour: if each guide only ever runs one, and only one, tour, then more more will be used on the IDs than is saved. However, this is incredibly unlikely in this case.

← Introduction

Relational Databases

[1] Although, as will be demonstrated, some knowledge of normalization is valuable.

[2] There are exceptions to this rule. For example, a researcher typically does not know what data is going to be needed at the end – often they want to collect as much as possible, in the hope that they can detect patterns which will be of value in determining their findings. In such cases the focus is on collecting everything that can reasonably be collected (noting the existence of ethical concerns whenever data is collected, and that there is always a limit to how much can be gathered), rather than guessing at how it will be employed.

[3] One exception is when you have to count it often. Counting is slower than reading, so if the computer needs that value all the time, it might be more efficient to store it as a value. However, that is relatively unlikely in the sorts of databases being discussed here, and if you are building a database big enough to warrant that sort of speed optimisation, you probably need to do a more advanced course anyway.

[4] Date, C. J. "What First Normal Form Really Means" in Date on Database: Writings 2000–2006 (Springer-Verlag, 2006), pp. 127–128.

[5] Note that the two derived fields noted above have now been removed or changed, per the discussion.

[6] This only holds if the tour guide takes more than one tour: if each guide only ever runs one, and only one, tour, then more more will be used on the IDs than is saved. However, this is incredibly unlikely in this case.

[1]

[2]

[3]

[4]

[5]

[6]