Data structures/Transwiki Data Structures
Introduction
[edit | edit source]Computer scientists have created many ways to store data. For example, they have designed several ways to store a list. A computer might be able to look at the data in one kind of list very quickly. Another type of list might trade off retrieval speed for quick searches for a specific piece of data. These data structures, with their various tradeoffs, are the subject of this course.
It is important not to confuse data structures with object classes. Data structures are designed by computer scientists to tell the computer exactly how to physically organize and work with the data. Classes are combinations of data structures with algorithms working with these structures, designed by programmers to solve actual problems.
Pseudocode that is close to the Pascal programming language appears here. There are some examples of the code in other languages also.
Arrays
[edit | edit source]An array, in general, is a constant-sized sequential set of blocks, each block containing a value of the elected datatype. The use of arrays is integral in most high-level languages today. Versions of dynamic arrays exist, but this approach is usually costly, because arrays expanded into sizes larger than is available at the specific location where the array is stored, would require the copying of the entire array into an area large enough to hold the expanded array. For example, consider an array of 5 1-byte-long blocks, which sits at address 1000, and assume there is a variable X, 1 byte long, sitting at address 1007. Should you wish to extend your array to hold 10 blocks, you would need to copy the first 5 blocks to another location in order to accommodate the rest of the array. Another disadvantage of the sequential storage method is that there has to be a free sequential block large enough to hold the array. If you have an array of 500,000,000 blocks, each 1 byte long, you need to have roughly 500 megabytes of sequential space to be free; Sometimes this will require a defragmentation of the memory, which takes a long time.
Advantages of arrays include:
- Random access in O(1) (See Big O notation (Wikipedia))
- Ease of use: Integrated into most modern languages
Disadvantages include:
- Constant size
- Constant data-type
- Large free sequential block to accommodate large arrays
Linked lists
[edit | edit source]One way to make a list of data in a certain order is to treat it like a chain. Each link in the chain tells you where its piece of data is and where the next link in the chain is.
Singly-linked lists
[edit | edit source]Let's make a linked list version of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] :
class Node var int data var Node* next class List var Node* head := None function insert(data) head := new Node(data, head.next) function remove(data) var Node* current := head var Node* previous := None var boolean found := false while (current != None) and (found == False) if data == current.data found := True else previous := current current := current.next if found == True previous := current.next
This is called a singly-linked list, because each node has only a single link to the next node. You can only go along the list in one direction.
Special features:
- You are required to keep a reference to the head node.
- It is simple to think about and implement.
- It is hard to make errors.
- It is easy to figure out when you've gone through all the elements.
Efficiency features:
- It needs little memory (two references per item).
- It will quickly get to the items near the head of the list and the next item in the list.
- It will slowly get to the items near the tail of the list and the previous item in the list.
- It is slow at searching for a specific item.
A common optimization is to also keep a reference to the tail node. This allows new nodes to be quickly inserted at the end of the list.
Doubly-linked lists
[edit | edit source]Let's do the same thing with a small change:
class ListNode attr_accessor :prev_node, :item, :next_node # Add the next item in the list def insert(new_item) @next_node = ListNode.new @next_node.prev_node = self @next_node.item = new_item end def inspect "#{@item.inspect}, #{@next_node.inspect}" end end node = ListNode.new node.item = 0 (1..10).each do |item_to_add| node.insert item_to_add node = node.next_node end # Move from the tail of the list to the head of the list. node = node.prev_node until node.prev_node.nil? puts "The list: #{node.inspect}." puts "The third item: #{node.next_node.next_node.item}."
This is called a doubly-linked list, because each node has links to both the previous and next nodes. You can go along the list in either direction.
Special features:
- You are no longer required to keep a reference to the head node.
- It is simple to think about and implement
- Errors can easily be made when the next node doesn't believe that this is its previous node or the previous node disagrees that this is its next node.
- It is easy to figure out when you've gone through all the elements.
Efficiency features:
- It needs little memory (three references per item).
- It will quickly get to the previous or next item in the list.
- It will slowly get to items near the head or tail of the list.
- It is slow at searching for a specific item.
Some common optimizations are to keep references to the head and tail nodes. This allows quick moves to either end of the list and quick insertions of new nodes at the end of the list.
Stacks
[edit | edit source]Stacks, as well as queues, are variations on the linked list. Stacks are also known as Last In First Out (LIFO) data structures, and this is indicative of their behavior. Simply put, when you put an item into a stack, generally called a 'push' operation, you are putting that element at the 'top' of the stack. When you remove an item from the stack, commonly known as a 'pop' operation, you're removing an element from the 'top' of the stack.
One common metaphor for a stack data structure is a vertical stack of dishes. The only feasible way to add a dish to the pile is to place it on top, and likewise, the only reasonable way to remove a dish is to remove it from the top.
An Example
[edit | edit source]Stack of numbers: (bottom) 2, 4, 9, 1, 3, 8 (top)
When you remove the first item, you'll get an 8. Making the stack as follows:
2, 4, 9, 1, 3
Then when you "pop" the stack again, you'll get a 3, leaving the stack in this state:
2, 4, 9, 1
If you were then to add a 7 to the stack, 'pushing' it on, the stack would become:
2, 4, 9, 1, 7
and so on.
Real world use
[edit | edit source]Stacks are generally implemented such that pushing and popping operations can be performed very quickly, in constant time.
Queues
[edit | edit source]Trees
[edit | edit source]Binary Trees
[edit | edit source]Are quickly searched, but may become unbalanced as nodes are added and deleted.
B+ Trees
[edit | edit source]Are similar to Binary trees, but overcome the short-comings with the addition of self-balancing insert and delete functions.
B* Trees
[edit | edit source]Red Black Trees
[edit | edit source]Oct-trees
[edit | edit source]A number of other Tree data structures exist, in a variety of flavors.
Heaps
[edit | edit source]A heap is a data structure which implements the priority queue abstract data type. The most important operation a priority queue defines is the retrieval of an element with minimum (or maximum) value. This versatility of the heap stems from the fact that which of the maximum or minimum is at the root depends on the comparison function we provide the heap. The heap invariant is as follows: all children must have a greater (or lesser) value than their parents. This implies that the smallest (or largest) value must be located at the root. The heap facilitates this operation by moving the node of least (or most) value to the root of the heap after each insertion operation. This transformation is known as re-heapification.
Since a heap is equivalent to a balanced full binary tree, we can economize and represent it as a flat array. This has the advantage of allowing us to compute the child or parent of an arbitrary node given its index with a simple arithmetic expression, rendering explicit parent or child pointers unnecessary.
The re-heapify operation is bounded by the height of the tree, and movement through the tree is only in one direction. The worst case is when swaps have to be performed all the way from leaf to root, hence it is O(log n).
A heap can be used to implement heapsort, an optimal sorting algorithm, simply by inserting keys into the heap, then removing them one by one. Due to the heap invariant, as they are extracted they will be in sorted order.
class Heap attr_reader :heap def to_s @heap.to_s end def initialize @heap = Array.new @size = 0 end def <<(item) @heap << item @size += 1 reheapify_up end def pop ret = @heap[0] @heap[0] = @heap.pop reheapify_down() @size -= 1 return ret end def size @size end def empty? @size == 0 end private def reheapify_up ptr = @heap.length - 1 while ptr > 0 if (@heap[parent_of(ptr)] <=> @heap[ptr]) > 0 swap(parent_of(ptr), ptr) end ptr = parent_of(ptr) end end def reheapify_down ptr = 0 while (2*ptr+1) < @heap.size kid = 2*ptr + 1 if kid < @heap.size-1 and @heap[kid] > @heap[kid+1] kid += 1 end if @heap[ptr] < @heap[kid] then break else swap(kid, ptr) ptr = kid end end end def parent_of(ind) (ind-1) / 2 end def swap(a, b) @heap[a], @heap[b] = @heap[b], @heap[a] end end
Hash tables
[edit | edit source]In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value.