Topic:Data structures

From Wikiversity
Jump to: navigation, search

Data Structures[edit]

Data structures, as is apparent from the name, is a "good" way of holding data. Before attempting to define good, lets see why such a thing is really needed. Every application or program has its set of requirements. Let's take the classic example of a student database at a university. Say, this database needs to hold information like student particulars, staff particulars and course particulars. The data elements student, staff, and course are not trivial elements, they are made of other elements. For instance, the student element is made up of the name element, the age element, and so on. It would make a lot of sense(and a lot of things easy), to hold all the relevant information together. Same is the case with staff element. The structures which help hold such pieces of related information together is called a data structure.

The course element, as one might observe, is more complex. It is made up of the staff(the instructor-in-charge and other instructors), and student(registered student element). The whole thing can be implemented in two ways. One, make a complex element called course, which holds the student and staff elements. This would mean redundancy, as the student and the staff elements already exist, with or without the course element, as they are separate entities. Very obviously, this is a bad solution. Instead, we could have "links" to the already existing student and staff elements. Again, very obviously, this is a more efficient solution (assuming "links" are properly defined). In this case, we have optimized the element or the Data structure for efficiency in terms of space.

Data structures and efficiency[edit]

The course element we have designed now seems efficient. But there is an upside to this. Consider a scenario, where the system is presenting a course details to a user. When it needs to present the list of students or staff, it needs to take the link and follow it. Having the elements directly in there would have saved us this overhead. That is, in an attempt to save on space, we are loosing on running time. The reader needs to understand that this loosing time is only relevant in this particular use case of having to display course details.

As is apparent, efficiency of data structures is two fold: Space-wise and Time-wise. The level of compromises one has to make on either depends on the user requirements.

Basic data structures[edit]

Linked-List: A linked-list is one of the most basic data structures available. Each element (called a listNode) contains the data being stored, and a pointer to the next, and previous objects in the list (called "next" and "prev" respectively.) Then, we have 2 pointers that point to the first, and last listNodes. In this way, we can find any element in the list, simply by starting either at the beginning or the end, and using the "next" or "prev" pointers to trace through the entire list.

This data structure is very good at inserting elements into the list, since insertion only needs to create the new object, and modify a couple of pointers. However, finding a specific entry in the list takes longer, because following pointers is relatively slow.

Linked-List Time Complexity:
Insertion Given a pointer O(1)
Given an index O(n)
Deletion Given a pointer O(1)
Given an index O(n)
Search O(n)
Index Lookup O(n)

Vector: A vector is an array that resizes itself when it doesn't have enough room for its contents. When a vector runs out of space, it will allocate a new chunk of memory that is double the size, and copy all of its current entries over into the new array. This means that the vector's internal structure should have at least 3 things: a pointer to the current array, an integer for the array's size, and another integer for the number of objects being stored. Whenever the number of elements exceeds the size of the array, the array is reallocated, and the old entries are copied over.

This data structure has nearly the same time complexity as an array, since they are driven by basically the same concepts. Adding or removing items from the end is very fast. However, inserting or deleting an item from anywhere else requires all elements above the one being inserted to be shifted, resulting in heavy slowdowns. Searching is fairly quick with vectors. Index lookups are completed in constant time, and searches in a sorted vector can be completed in O(log n). If the vector is unsorted, then it would take O(n) time to find an item. However, a vector is typically quicker than linked-lists for these linear searches, since working in an array is far more efficient for the processor, and results in lots of cache hits.

Note, however that vectors are not suitable for time-critical scenarios, since they face seemingly random slowdowns, when the array has to be reallocated.

Vector Time Complexity:
Insertion At the end O(1)
Elsewhere O(n)
Deletion At the end O(1)
Elsewhere O(n)
Search Unsorted O(n)
Sorted O(log n)
Index Lookup O(1)

Hash Table: A hash table is an array that uses keys for quick look-ups. Using the example at the top of the page, say that we were wanting to store "Student" objects, which contained information on currently enrolled students. In order to use a hash table, we need to select a key. (Preferably a numeric one.) In this example, we would use the Student ID as the key is guaranteed to be different for every student, and it is a number.

This key is put through some sort of calculation so we can link this key to a particular entry inside the array. However, since there are many possible different values for a key, and relatively few spots inside the hash table, there will be collisions when more objects are added to the hash table. There are many different algorithms used in hash tables. Some of them are:

Separate Chaining: Every entry in the hash table's array is actually a linked-list. We perform look-ups by taking the modulus of the ID and the length of the array, to get an index. Then, if two entries collide, (i.e. they try to use the same index) the second entry is simply added into the linked list. This algorithm is very simple, and good for small databases. However, if many objects are added to the table, lookups will stop being O(1), and will begin looking more like O(log n) searches, or worse.

Non-Linked List Methods:

These algorithms use vectors with wrap-around, that will resize themselves once they get too full. Most implementations will say a hash table is "too full" when the table is 50% filled.

Linear Probing: Every entry in the array contains a "status" field, which can either be EMPTY, FULL, or DELETED.

  • When inserting, we take the modulus of the ID and the array length, and if it collides, then we simply increment by 1, until we get to a spot with an EMPTY or a DELETED flag.
  • When looking-up, we take the modulus of the ID and the array length, and if it collides, we increment the index until we either find the value, we get to an EMPTY slot, or until we have traversed the entire array.
  • When deleting, we perform a lookup, and if we find the value, we set the "status" to DELETED.

Quadratic Probing: This method is very similar to the Linear probing example. The main difference, is that instead of incrementing by 1, we increment by perfect squares. (i.e. 1, 4, 9, ...) This is done, because it can be mathematically proven that if we increment by perfect squares, keep the table under 50% full, and make the table size a prime number, then we will always be able to find an EMPTY or a DELETED spot to insert into. (i.e. We will never enter an infinite loop.)

Note: When doing lookups in quadratic probing, we want to test the original spot, then test the (original spot)+1, then the (original spot)+4, etc.

In all situations, the length of the array is a prime number. If the hash table needs more room, the hash table will find its new size by doubling its current size, and then increasing this number until it is prime. This way, it will avoid situations where various IDs will collide often, because they are multiples of the table-size.

If we need to "re-hash," we resize the old table using the method described in the previous paragraph. We then go item-by-item through the old table, and insert them into the new one.

Hash Table Time Complexity:
Insertion O(1)
Deletion O(1)
Search O(n)
Index Lookup O(1)

Generic data structures[edit]

The concept of data structure is very requirement-specific. Almost every application which has a good deal of information needs to define its own ways of storing and retrieving data. But, some patterns of requirements have been noted more often than others, and data structures for these have been studied in general. We look at two of them here.

Queues: This is one of the most commonly used data structure. In many cases, one needs a data structure, which hold a set of items, and present the item which had arrived the first. For instance, a lift which queues all the reqest for stops(thought lifts do more complex stuff). This is represented by what is called a Queue data structure.

Dictionary: This is also a very commonly used structure. Imagine having to search for a word in a list of thousands or millions of words. Going through the entire list would be very complex. Given that every two different words have atleast one alphabet difference, we can fairly confidently say that we can come up with a function, which when given the word, can locate a small possible range where the word will exist(i.e. the word is added in that range, using the function). A well devised function can reduce the effort required by hunderds of times. Such a data structure is called a Dictionary, and a very common implementation is called the Hash table.

Tree: It isn't a kind of linear data structure.

-- 19:44, 28 May 2007 (UTC): Vicky