Data structures/Transwiki Data Structures

< Wikiversity < Wikiversity:School of Computer Science < Wikiversity:Theory

Introduction

Computer scientists have created many ways to store data. For example, they have designed several ways to store a list. A computer might be able to look at the data in one kind of list very quickly. Another type of list might trade off retrieval speed for quick searches for a specific piece of data. These data structures, with their various tradeoffs, are the subject of this course.

It is important not to confuse data structures with object classes. Data structures are designed by computer scientists to tell the computer exactly how to physically organize and work with the data. Classes are combinations of data structures with algorithms working with these structures, designed by programmers to solve actual problems.

Pseudocode that is close to the Pascal programming language appears here. There are some examples of the code in other languages also.

Stub

This section is a stub. You can help Wikiversity by expanding it.

Arrays

An array, in general, is a constant-sized sequential set of blocks, each block containing a value of the elected datatype. The use of arrays is integral in most high-level languages today. Versions of dynamic arrays exist, but this approach is usually costly, because arrays expanded into sizes larger than is available at the specific location where the array is stored, would require the copying of the entire array into an area large enough to hold the expanded array. For example, consider an array of 5 1-byte-long blocks, which sits at address 1000, and assume there is a variable X, 1 byte long, sitting at address 1007. Should you wish to extend your array to hold 10 blocks, you would need to copy the first 5 blocks to another location in order to accommodate the rest of the array. Another disadvantage of the sequential storage method is that there has to be a free sequential block large enough to hold the array. If you have an array of 500,000,000 blocks, each 1 byte long, you need to have roughly 500 megabytes of sequential space to be free; Sometimes this will require a defragmentation of the memory, which takes a long time.

Advantages of arrays include:

Random access in O(1) (See Big O notation (Wikipedia))
Ease of use: Integrated into most modern languages

Disadvantages include:

Constant size
Constant data-type
Large free sequential block to accommodate large arrays

Linked lists

One way to make a list of data in a certain order is to treat it like a chain. Each link in the chain tells you where its piece of data is and where the next link in the chain is.

Singly-linked lists

Let's make a linked list version of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] :

class Node
    var int data
    var Node* next
class List
    var Node* head := None
    function insert(data)
        head := new Node(data, head.next)
    function remove(data)
        var Node* current := head
        var Node* previous := None
        var boolean found := false
        while (current != None) and (found == False)
            if data == current.data
                found := True
            else
                previous := current
                current := current.next
        if found == True
            previous := current.next

This is called a singly-linked list, because each node has only a single link to the next node. You can only go along the list in one direction.

Special features:

You are required to keep a reference to the head node.
It is simple to think about and implement.
It is hard to make errors.
It is easy to figure out when you've gone through all the elements.

Efficiency features:

It needs little memory (two references per item).
It will quickly get to the items near the head of the list and the next item in the list.
It will slowly get to the items near the tail of the list and the previous item in the list.
It is slow at searching for a specific item.

A common optimization is to also keep a reference to the tail node. This allows new nodes to be quickly inserted at the end of the list.

Doubly-linked lists

Let's do the same thing with a small change:

class ListNode
  attr_accessor :prev_node, :item, :next_node
  
  # Add the next item in the list
  def insert(new_item)
    @next_node = ListNode.new
    @next_node.prev_node = self
    @next_node.item = new_item
  end
  
  def inspect
    "#{@item.inspect}, #{@next_node.inspect}"
  end
end

node = ListNode.new
node.item = 0
(1..10).each do |item_to_add|
  node.insert item_to_add
  node = node.next_node
end

# Move from the tail of the list to the head of the list.
node = node.prev_node until node.prev_node.nil?

puts "The list: #{node.inspect}."
puts "The third item: #{node.next_node.next_node.item}."

This is called a doubly-linked list, because each node has links to both the previous and next nodes. You can go along the list in either direction.

Special features:

You are no longer required to keep a reference to the head node.
It is simple to think about and implement
Errors can easily be made when the next node doesn't believe that this is its previous node or the previous node disagrees that this is its next node.
It is easy to figure out when you've gone through all the elements.

Efficiency features:

It needs little memory (three references per item).
It will quickly get to the previous or next item in the list.
It will slowly get to items near the head or tail of the list.
It is slow at searching for a specific item.

Some common optimizations are to keep references to the head and tail nodes. This allows quick moves to either end of the list and quick insertions of new nodes at the end of the list.

Stacks

Stacks, as well as queues, are variations on the linked list. Stacks are also known as Last In First Out (LIFO) data structures, and this is indicative of their behavior. Simply put, when you put an item into a stack, generally called a 'push' operation, you are putting that element at the 'top' of the stack. When you remove an item from the stack, commonly known as a 'pop' operation, you're removing an element from the 'top' of the stack.

One common metaphor for a stack data structure is a vertical stack of dishes. The only feasible way to add a dish to the pile is to place it on top, and likewise, the only reasonable way to remove a dish is to remove it from the top.

An Example

 Stack of numbers: (bottom) 2, 4, 9, 1, 3, 8 (top)

When you remove the first item, you'll get an 8. Making the stack as follows:

 2, 4, 9, 1, 3

Then when you "pop" the stack again, you'll get a 3, leaving the stack in this state:

 2, 4, 9, 1

If you were then to add a 7 to the stack, 'pushing' it on, the stack would become:

 2, 4, 9, 1, 7

and so on.

Real world use

Stacks are generally implemented such that pushing and popping operations can be performed very quickly, in constant time.

Queues

Trees

Binary Trees

Are quickly searched, but may become unbalanced as nodes are added and deleted.

B+ Trees

Are similar to Binary trees, but overcome the short-comings with the addition of self-balancing insert and delete functions.

B* Trees

Red Black Trees

Oct-trees

A number of other Tree data structures exist, in a variety of flavors.

Heaps

A heap is a data structure which implements the priority queue abstract data type. The most important operation a priority queue defines is the retrieval of an element with minimum (or maximum) value. This versatility of the heap stems from the fact that which of the maximum or minimum is at the root depends on the comparison function we provide the heap. The heap invariant is as follows: all children must have a greater (or lesser) value than their parents. This implies that the smallest (or largest) value must be located at the root. The heap facilitates this operation by moving the node of least (or most) value to the root of the heap after each insertion operation. This transformation is known as re-heapification.

Since a heap is equivalent to a balanced full binary tree, we can economize and represent it as a flat array. This has the advantage of allowing us to compute the child or parent of an arbitrary node given its index with a simple arithmetic expression, rendering explicit parent or child pointers unnecessary.

The re-heapify operation is bounded by the height of the tree, and movement through the tree is only in one direction. The worst case is when swaps have to be performed all the way from leaf to root, hence it is O(log n).

A heap can be used to implement heapsort, an optimal sorting algorithm, simply by inserting keys into the heap, then removing them one by one. Due to the heap invariant, as they are extracted they will be in sorted order.

 class Heap
   attr_reader :heap
     
   def to_s
     @heap.to_s
   end
   
   def initialize
     @heap = Array.new
     @size = 0
   end
   
   def <<(item)
     @heap << item
     @size += 1
     reheapify_up
   end
   
   def pop
     ret = @heap[0]
     @heap[0] = @heap.pop
     reheapify_down()
     
     @size -= 1 
     return ret
   end
   
   def size
     @size
   end
   
   def empty?
     @size == 0
   end
   
 private
   def reheapify_up
     ptr = @heap.length - 1
     while ptr > 0
       if (@heap[parent_of(ptr)] <=> @heap[ptr]) > 0
         swap(parent_of(ptr), ptr)
       end
       ptr = parent_of(ptr)
     end
   end
   
   def reheapify_down
     ptr = 0
 
     while (2*ptr+1) < @heap.size
       kid = 2*ptr + 1
 
       if kid < @heap.size-1 and @heap[kid] > @heap[kid+1]
         kid += 1
       end
       
       if @heap[ptr] < @heap[kid] then
         break
       else
         swap(kid, ptr)
         ptr = kid
       end
     end
   end
   
   def parent_of(ind)
     (ind-1) / 2
   end
   
   def swap(a, b)
     @heap[a], @heap[b] = @heap[b], @heap[a]
   end
 end

Hash tables

In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value.