Relational Databases/Introduction

Introduction

Relational databases are a common and powerful approach to storing information. They make efficient use of computer storage by connecting sets of data together. In this course we are going to engage in a quick introduction to relational databases, starting from the basics of how they work and finishing at a point where the reader should be able to build their own simple database in an SQL-based relational database management system (RDMS), such as MySQL, SQL Server or PostgreSQL.

Some background: how data is stored in a flat file database

I would like to propose a simple database: a small database for a personal library, where I hope to keep track of who has borrowed my books. To do so, I want to keep track of:

Borrower's name
Borrower's email
Borrower's phone number
When the book was borrowed
Title
Author
Year published

Of course, a more complex database would store more data, but for a personal library that might be enough. This will let me know what the book is, (title, author and year of publication), who has it, when they took it, and how to contact them in order to ask for it back.

One method of tracking of these records is a spreadsheet. In a spreadsheet, each column could represent one of the fields above, while each row would cover one of the records. For example, I could use:

Loans
Name	Email	Phone	Date	Title	Author	Published
Sam	sam@internet	1234-5678	1 January 1971	Alice's Adventures in Wonderland	Lewis Carroll	1865
Alex	alex@internet	2345-6789	1 March 1971	The War of the Worlds	H. G. Wells	1898
Kim	kim@internet	3456-7890	1 May 1971	A Study in Scarlet	Sir Arthur Conan Doyle	1887

This is what is commonly referred to as a flat file database: a database in which all of the data is kept in a single file. It has the advantage that it is relatively easy to use. If I wished to know what book Alex had borrowed, I could quickly scan through the names column until I found Alex, then read across to find the title. Similarly, a computer could check the first column of each row until it found the desired name, and then quickly output the title of the book. There is a bit of a speed hit if there are a lot of records, as it could take a while to scan a file consisting of tens of thousands of entries, but I would get there in the end. And if needed, I could always optimise the data: for example, I could order the records by the name of the borrower or the title of the book, which would make searching even faster if I was only looking for one of those two fields.^[1] Thus flat file databases are useful where either speed (given certain constraints) or simplicity is a concern, especially when the data set can be well defined and is small, or where searches are limited to just one field in an ordered set.

However, the limitations of this approach quickly become apparent. For example, what if Alex borrows two books at once? There are two options:

Create a second record, repeating all of the information about Alex
Add more columns to allow for more books

The first option creates:

Loans (Version 2)
Name	Email	Phone	Date	Title	Author	Published
Sam	sam@internet	1234-5678	1 January 1971	Alice's Adventures in Wonderland	Lewis Carroll	1865
Alex	alex@internet	2345-6789	1 March 1971	The War of the Worlds	H. G. Wells	1898
Kim	kim@internet	3456-7890	1 May 1971	A Study in Scarlet	Sir Arthur Conan Doyle	1887
Alex	alex@internet	2345-6789	1 March 1971	The Time Machine	H. G. Wells	1895

Unfortunately, this introduces a number of concerns, although for now the focus on but two. First, by repeating data, the possibility of errors has increased. Every time the same information has to be entered, there is a chance that something will go wrong. And this problem becomes worse when Alex's details need to be updated, as those details must be updated in every one of the rows related to Alex. Databases need to be reliable, and anything which can increase the possibility of errors is a bad thing. Second, it wastes memory. The system already has Alex's details on record – now it has the same data twice, using up twice the memory.

Alternatively, more columns be added per record:

Loans (Version 3)
Name	Email	Phone	Date 1	Title 1	Author 1	Published 1	Date 2	Title 2	Author 2	Published 2
Sam	sam@internet	1234-5678	1 January 1971	Alice's Adventures in Wonderland	Lewis Carroll	1865
Alex	alex@internet	2345-6789	1 March 1971	The War of the Worlds	H. G. Wells	1898	1 March 1971	The Time Machine	H. G. Wells	1895
Kim	kim@internet	3456-7890	1 May 1971	A Study in Scarlet	Sir Arthur Conan Doyle	1887

This makes the risk of errors smaller, but it is still wasting memory: all of those empty fields take up memory in the database, and most of them are unused. (Many databases will reserve a certain amount of memory per field, irrespective of what is actually in it – variable width fields do exist, but many fields are not variable, and even those that are have to commit some memory to the task). Plus, what happens if Alex now decides to borrow a third book? I could add a third set of columns, but then the memory usage is even worse - and, of course, that places a limit of three books per user. What if someone wants to borrow four?

Thus it is time for an alternative approach.

An alternative approach

If we stop thinking about computers, and instead think about how this would be handled this in real life, it seems that the solution is simple: create two sets of records. One as a list of "borrowers", containing their names, email addresses and phone numbers, and a second containing a list of books with the names of those who borrowed them. Implementing this using the above data, we get:

Borrowers
Name	Email	Phone
Sam	sam@internet	1234-5678
Alex	alex@internet	2345-6789
Kim	kim@internet	3456-7890

Books
Title	Author	Published	Name	Date
Alice's Adventures in Wonderland	Lewis Carroll	1865	Sam	1 January 1971
The War of the Worlds	H. G. Wells	1898	Alex	1 March 1971
The Time Machine	H. G. Wells	1895	Alex	1 March 1971
A Study in Scarlet	Sir Arthur Conan Doyle	1887	Kim	1 May 1971

This is much more effective that the flat file database: it has the potential to use less memory (although there is a need to include the borrower's name in two places); there is less information being repeated; and there is no limit as to how many books someone can borrow. There is a cost, of course: using it has become harder. For example, to find out who to ring to chase up a copy of The Time Machine, I would need to look in the list of books, find it, read the name of the borrower, then scan the list of borrowers until I find their name and look up their phone number. Not impossible, but certainly more difficult than with the flat file. (Doing this, in the language of relational databases, is called "joining" the two tables, and results in a speed hit).

There another problem, though. What would happen if there were two people called Alex? In such a situation, the system would collapse: there would be no means of telling if the Alex referred to in the books table is the first Alex in the borrowers table or a different Alex. The name is not sufficiently unique to identify all of the potential people who could borrow books. What is needed is something unique - a code to represent each borrower that would be different for each one. This, in computing terms, is called a unique identifier, and in databases a unique identifier can be a primary key (PK): a unique code that is the primary means of identifying a record in the database.^[2] A common example is the registration number for a car, or an ID given to each student at a school.

The simplest solution to creating a unique identifier is to just number each record, using a numeric sequence starting at 1. Doing so will provide the unique identifier that the database needs, but it also has a neat side-effect: in computing numbers are generally much smaller than names, and thus the result is a reduction in how much memory is required to refer to the different records.

Borrowers (with IDs)
ID	Name	Email	Phone
1	Sam	sam@internet	1234-5678
2	Alex	alex@internet	2345-6789
3	Kim	kim@internet	3456-7890
4	Alex	otheralex@internet	4567-8901

Books
Title	Author	Published	Borrower ID	Date
Alice's Adventures in Wonderland	Lewis Carroll	1865	1	1 January 1971
The War of the Worlds	H. G. Wells	1898	2	1 March 1971
The Time Machine	H. G. Wells	1895	2	1 March 1971
A Study in Scarlet	Sir Arthur Conan Doyle	1887	3	1 May 1971

This does make it a bit harder for people to read, perhaps, but computers won't suffer from the same problem. They can handle the numbers as easily as they can handle words.

Expanding the model

This is good, but it is still not perfect. One problem is that there is no way of knowing who borrowed the book last – the system doesn't retain a borrowing history. It has the date the book was borrowed on (which will presumably be blank if it has been returned), but as there is only one date box, we need to overwrite that date each time it is borrowed.

On solution to this problem could be to modify the "Books" list. For example, we could add a "returned" column so we know when it came back, and just add a new record when it is borrowed again.

Books
Title	Author	Published	Borrower ID	Date	Returned
Alice's Adventures in Wonderland	Lewis Carroll	1865	1	1 January 1971
The War of the Worlds	H. G. Wells	1898	2	1 March 1971	12 March 1971
The Time Machine	H. G. Wells	1895	2	1 March 1971
A Study in Scarlet	Sir Arthur Conan Doyle	1887	3	1 May 1971
The War of the Worlds	H. G. Wells	1898	3	14 March 1971

However, now there is the same problems as before: data is being repeated, which is both risky (because of the possiblity of errors) and wastes memory.

So, how about an alternative? The solution last time was to create a new list. How about separating the dates, borrower IDs and dates returned from the books list, and placing it in one of its own?

Books
ID	Title	Author	Published
1	Alice's Adventures in Wonderland	Lewis Carroll	1865
2	The War of the Worlds	H. G. Wells	1898
3	The Time Machine	H. G. Wells	1895
4	A Study in Scarlet	Sir Arthur Conan Doyle	1887

Loans
Book ID	Borrower ID	Date	Returned
1	1	1 January 1971
2	2	1 March 1971	12 March 1971
3	2	1 March 1971
4	3	1 May 1971
2	3	14 March 1971

It is true that this does not make it any easier for people to work with – now they have to scan through three lists to work out who has borrowed a book, and they need to check to see if it has been returned as they go – but this has reduced the incidence of repeated data, improving the reliability and potential size of the system. Reliability is one of the primary priorities for database design.

Bringing it together

This final design is a relational database. Indeed, the design could be readily implemented in a relational database such as MySQL, Oracle, Microsoft Access or SQL Server with no significant modifications at all. The "relations" part of the "relational database" name comes from the IDs: the Books list is related to the Loans list by the Book ID, and the Loans list is related to the Borrowers by the Borrower ID. There is still room for some refinement, but if you can understand how these lists (or "tables" as they should be called in a database) relate to each other then you've passed the biggest hurdle.

Tasks to try

After the above discussion, we now have a basic database consisting of three tables and a number of records:

Borrowers
ID	Name	Email	Phone
1	Sam	sam@internet	1234-5678
2	Alex	alex@internet	2345-6789
3	Kim	kim@internet	3456-7890
4	Alex	otheralex@internet	4567-8901

Books
ID	Title	Author	Published
1	Alice's Adventures in Wonderland	Lewis Carroll	1865
2	The War of the Worlds	H. G. Wells	1898
3	The Time Machine	H. G. Wells	1895
4	A Study in Scarlet	Sir Arthur Conan Doyle	1887

Loans
Book ID	Borrower ID	Date	Returned
1	1	1 January 1971
2	2	1 March 1971	12 March 1971
3	2	1 March 1971
4	3	1 May 1971
2	3	14 March 1971

Using these tables, there are a number of things you can try.

Question 1

When was "The Time Machine" borrowed?

Solution
Looking in the Books table, the ID of "The Time Machine" is "3". Looking up 3 in the Loans table under "Book ID" reveals one record, which confirms that the book was borrowed on 1 March 1971.

Question 2

How many books has Alex with the email address otheralex@internet borrowed?

Solution

There are two people with the same name, Alex, so relying on the name wouldn't be enough. However, using the email address, the Borrowers table reveals that the Alex with that email has the ID of 4. Looking at the Borrower ID column in the Loans tables reveals that there are no loans for a person with the ID of 4. Therefore, Alex with the email address otheralex@internet has not borrowed any books.

Question 3

Who has borrowed the copy of H. G. Wells' "The War of the Worlds".

Solution

Looking in the Books table, the ID of "The War of the Worlds" is "2". Looking up 2 in the Loans table under "Book ID" reveals two records. However, only the second record shows that the book hasn't been returned, and that is by a borrower with the ID of "3". Checking the Borrowers table reveals that the borrower with that ID is Kim.

Therefore Kim currently has a copy of the book. Given that Kim borrowed it in 1971, it is probably overdue.

Question 4

How many books Kim has currently borrowed.

Solution
Kim's ID in the Borrowers table is "3". There are two records in the Loans table for someone with that ID, neither of which has been returned. Therefore Kim has borrowed two books. (Note that there was no need to look into the Books table).

Question 5

What would you need to change in the database in order for Sam to return her book, and for Kim to borrow it after her.

Solution

You would need to change the record in Loans representing Sam's loan (Borrower ID 1, Book ID 1) to include a return date, and then add a new record to the Loans table, with the book ID (1), Kim's ID (3), and the date the book was borrowed.

Technically you could do this without accessing the Books table, but you would need to find out both Kim and Sam's IDs from the Borrowers table.

Question 6

Add a new borrower (Jim, jim@internet, ph 9876-5432) and a new book (Dracula).

Solution

Add to Borrowers:

Borrowers
ID	Name	Email	Phone
5	Jim	jim@internet	9876-5432

Technically you could use any ID not currently being used, but it is easier just to go with a sequential number, as that way it won't repeat. To the Books table, you need to add:

Books
ID	Title	Author	Published
5	Dracula	Bram Stoker	1897

Question 7

Make a new table ("Authors") to contain a list of authors, and relate it back the "Books" table. (This one is more challenging, but would be a good step towards understanding relational databases).

Solution

You will need to make two changes. First, create a new Authors table consisting of Name and ID:

Authors
ID	Name
1	Lewis Carroll
2	H. G. Wells
3	Sir Arthur Conan Doyle

Then you need to adjust the Books table to use an Author ID rather than an author's name:

Books
ID	Title	Author ID	Published
1	Alice's Adventures in Wonderland	1	1865
2	The War of the Worlds	2	1898
3	The Time Machine	2	1895
4	A Study in Scarlet	3	1887

This will,once again, make it more difficult to find data, as you need to join more tables together. But it means that you only need to spell an author's name correctly once, and it uses less memory (so long as enough authors have written more than one book in order to make it worthwhile).

Next steps

Once this makes sense, as I hope that it will, it is time to move on to looking at how to design a relational database.

References

↑ Search optimisation works well if records can be placed into order first. If I wanted to know if the letter "C" was in the sequence "G,H,I,R,T,V,A,W,Z", as an example, I would need to read each of the values to check. But if they were in alphabetical order, "A,G,H,I,R,T,W,Z", I would know it wasn't there the instant I reached "G", as I would have already found it if it was present.
↑ Sometimes it is not possible to find an existing unique value, or creating one would just waste more memory. Instead, you can use what is referred to as a "composite key", where a combination of two or more values provides a unique identifier. But don't worry about that now, as it is something that won't be relevant until later.

Relational Databases

Design →

[1] Search optimisation works well if records can be placed into order first. If I wanted to know if the letter "C" was in the sequence "G,H,I,R,T,V,A,W,Z", as an example, I would need to read each of the values to check. But if they were in alphabetical order, "A,G,H,I,R,T,W,Z", I would know it wasn't there the instant I reached "G", as I would have already found it if it was present.

[2] Sometimes it is not possible to find an existing unique value, or creating one would just waste more memory. Instead, you can use what is referred to as a "composite key", where a combination of two or more values provides a unique identifier. But don't worry about that now, as it is something that won't be relevant until later.

[1]

[2]