Data Structures and Algorithms/Sorting Data
The wish for order is pretty old and there are as many reasons as there are things that could be sorted. But the main reason for sorting is the fact that finding can be much faster when one can rely on an order. Sorting is the main condition for effective searching. This is true for stores as well as dictionaries.
In this chapter sorting is restricted to the sorting of data records. Since the times are long gone when tapes were used as data storage, we restrict to the sorting of data where we have random access. It does not matter whether the records are stored in the fast RAM or on disc; the functionality of the algorithms stays the same.
Additionally we require that the data records have all the same length and structure and that they are stored as a list; hence this chapter is called „Sorting of Arrays“. The structure of the data records is of no importance for us; we only have a look at the key, which is that part of the data record, by which the records are sorted.
At the times when storage was very expensive, the memory needs of an algorithm were of importance and algorithms that could work "in place" were favored (the array with the data was given to the routine doing the sorting and the sorted data was returned in the same array). Nowadays it is much more interesting to have the possibility to select the behavior depending on the wish to later use the data again in their original order.
When checking the usability of an algorithm the following criteria are of importance:
- is the algorithm able to handle the sorting key
- is there sufficient memory available
- is the algorithm faster than alternative ones
- is it possible to implement the algorithm in the wanted programming language
In this chapter we will have a look at some selected sorting algorithms in order to explain the basic strategies; after studying this chapter one should be able to select a suitable algorithm for a given problem and to understand the principles and characteristics of algorithms not explained here.
Simplified one could say that sorting is based on these elementary operations:
- selecting and inserting
- spreading and collecting
Sorting in applications
In the mayority of all applications sorting is an important operation and is responsable for a big amount of processing time when there are more than a few hundred data records. There are estimates that 20% to 30% of the runtime of professional programs is used for sorting. Sorting we find in very different fields.
- words in a dictionary (alphabetical order)
- statement of account (ordered by date)
- student records (name, social security number, subject, ...)
- addresses (zip-code, name of town/city, street, ....)
Definition of the Problem
When a child sorts its toys all blue things go into one box, all yellow ones into another and all red ones into a third box and all others into box number four. We see that the criteria for sorting must not be numerical, they must only be distinguishable.
Therefore we have to introduce another restriction: when comparing the elements of a list, it must be possible to tell whether element A is smaller, of same value, or bigger than element B. When sorting colored toys we have to number the colors -making a numbered list- or we could use the wavelength of the colors.
When comparing numerical values, we normally work with build in functions; the compiler will automaticly know how to treat
if(A > B)
depending on the definition of the variables. If somebody writes a program in assembler for an old 8-bit processor, he will have to check the definition of how the bits of a 32-bit float have to be interpreted so that a meaningful sorting routine can be programed.
For alphabetical order we need to have a list with two columns. In the first column we have the number of the row and in the other column we have the interpretation. One of the most commonly used list of this type is the ASCII-table. What the computer does is not really an alphabetical sort, the sort is by position numbers.
All the algorithms discussed in this chapter work with numerical representations where a smaller-, bigger-, same-relation is defined.
The time necessary for swapping data records can be reduced drastically when it is possible to maintain a list with references to the records and just to swap these references.
After the sorting the array with the data records is still untouched; but in the array with the references we have the numbers of the data records in such a way that when rearranging the data records following these numbers, the data records will be sorted. The time necessary for this rearrangement increases in linear fashion with the number of data records.
Working with indirect sort is indicated when the costs for swapping are high (the length of the data record is much longer than the 4 bytes for an index) and there is no problem to provide the additional memory (since it is just 4 bytes per record this problem should occur seldom).
In applications it happens frequently that data records have to be sorted by more than one criteria. For example we have a list with addresses and we want to sort it by second name and within the second name by first name.
A sorting algorithm is called stable when such a list is sorted first by first name and then by second name, and every block with identical second name is still sorted by first name. A stable sorting algorithm does not destroy the results of previous sorts.
With algorithms that are not stable there is much more effort involved for getting the same result. First the list has to be sorted by second name and then all the blocks with identical second name have to be detected and sorted by first name.
For sorting an address list by zip-code, within zip-code by street name, within street name by second name, within second name by first name, the program code will become a bit complicated when using an unstable algorithm. With a stable algorithm the four sorts are done in reverse order and everything is done.
Sorting algorithms are playing an important role in computer science and during the years a lot of algorithms were developed. A classical way to make up groups of algorithms is to distinguish whether they are comparison-based or not. Since most algorithms of the last type were very limited concerning their use this meant "usable" or "exotic".
We will show here that it is a much better idea to distinguish dividing and non-dividing algorithms in order to explain the different speeds. With dividing is meant making subgroups from groups. In this chapter we only give a very brief description of the algorithms; a detailed description can be found further down.
First we have a look at an algorithm of the non-dividing type.
|SelectionSort compares the keys of the data records in order to decide whether they have to be swapped or not. The algorithm starts with the first record and checks all the other records whether their key value is smaller; if so, the records are swapped. If no smaller key can be found, the algorithm continues with the second record, then with the third record and so on. In a rough estimate one can say that every key is compared with all the others. As a result we get a square characteristic. The time for a sort can be estimated by
T = B * n2
where B is a constant reflecting the implementation.
If there are only two data records, the algorithms of this type are reduced to the question whether the two records have to be swapped or not. In this special case these algorithms are faster than all the others.
Now we have a look at some algorithms which follow the motto "divide and conquer".
The algorithm QuickSort is frequently used. When it starts the totality of the data records is taken as a single domain. For this domain a so called pivot-element is selected in such a way that approximately half of the keys have a smaller value than the pivot. Then the data records are separated into a lower and a higher sub-domain. The method is then repeated recursively on the sub-domains. Domains with one or no record are defined as sorted. This repeated splitting could be visualized as a binary tree and the number of levels tells how many times the domains could be divided into two sub-domains. Therefor the time needed can be estimated by
T = C * n * lg2(n)
where C is a constant reflecting the implementation.
|The classic under the sorting algorithms is BucketSort. It works with m buckets and a -however defined- function for evaluation into which bucket a record shall go. If, after sorting all records, we have records with different values in a bucket, then the content of this bucket has to be sorted again internally. This can, of course, be done by BucketSort but with a different function for bucket evaluation (the key values are different!). If these functions are generated automatically and in a way that the distribution into the buckets is even into again m buckets, the time for sorting can be estimated by
T = D * n * lgm(N)
where D is a constant reflecting the implementation; N is the maximum number of different keys possible.
|The absolute classic (invented 1887 by Herman Hollerith) is RadixSort, which was used for sorting numbered punched cards. A sorting machine with 10 buckets was adjusted to read the last digit of the numbers. Depending on the value of this digit the cards were distributed to the buckets. Then, starting with bucket number zero, the cards were recollected. Then the machine was adjusted to the second last digit, the cards were distributed and recollected. This was repeated until the stack was sorted by the first digit. The time for sorting can be estimated by
T = E * n * lg10(N)
where E is a constant reflecting the implementation; N is again the maximum number of different keys possible.
Often the equation
T ≈ n * d
is found where d stays for the number of digits. Since the logarithm tells how many times N can be divided by 10 repeatedly, both equations tell the same, but the first one makes it possible to compare the characteristic with other algorithms.
|ExtraDix is an extension of RadixSort. It simulates sorting machines with 256 buckets. The number 256 is somehow "natural" because one byte is the size of the smallest key in use (8-bit-numbers or characters). The time for sorting can therefor be estimated by
T = F * n * lg256(N)
where F is a constant reflecting the implementation; N is again the maximum number of different keys possible. The expression lg256(N) gives the number of bytes of the key and for a 32-bit integer this is 4.
Automatic Division vs Fixed Division
It is conspicuous that for QuickSort we work with the logarithm of n and for the other algorithms with N; we have a closer look at the why.
There is an unlimited number of different real numbers and it is possible to proof that there is even an unlimited number of real numbers in any given interval. QuickSort and algorithms alike work with the assumption, that it is impossible to know beforehand, how many times the domains have to be subdivided. Therefor the number of division levels is evaluated during the sort.
The other algorithms work with the assumption, that on a computer the number of different values is limited and countable. Even when working with floats of double precision the maximum number of different values is known beforehand and this number is 264. When the dividing factor stays constant -same number of buckets or domains on any level- it is possible to tell the number of division levels that make sense; from there on only records with identical key values will be found in a bucket.
How to compare algorithms
If we have 4 data records, QuickSort has finished its job after two recursion levels (on any recursion level all the records have to be handled once). If working with RadixSort and numbers from 00.000 up to 99.999, all the data records have to be handled five times. We have a real advantage for QuickSort.
This changes instantly when we have to sort 200.000 data records. RadixSort still has to handle all the records only five times but QuickSort now generates 18 recursion levels (218 = 262 144) and handles all the records on any level.
When looking for the fastest algorithm for a task we can use the equations in order to estimate the relation between them. If one wants to know from which number of records onwards ExtraDix will be faster than QuickSort when sorting 64-bit integers, one gets
|F * n * lg256(N)||<||C * n * lg2(n)|
|F * n * 8||<||C * n * lg2(n)|
|F * 8||<||C * lg2(n)|
|n||>||2 F * 8 / C|
Under the assumption that C and F have the same value, F / C becomes 1 and the result is that for more than 256 data records ExtraDix will be faster than QuickSort.
The algorithm runs repeatedly through an array comparing element i with element i+1. If these records are not sorted they are swapped. The number of comparisons and swaps can be estimated with as well for the worst-case as for the average-case
Link to a detailed description of BubbleSort
|SelectionSort runs in a double loop. The first element is compared with all the other ones. If a smaller one is found, the records are swapped. Then the second element is compared with all the other ones and so on.
The number of comparisons and swaps can be estimated for the Worst-Case and the Average-Case with
Link to a detailed description of SelectionSort
|InsertionSort is a "natural" sorting algorithm and it is used to insert a new record into an already sorted list. The first step is to find the position where the new record should be; then the rest of the list is moved one position up and in the last step the record is inserted.
The number of comparisons and record moves can be estimated for the Worst-Case and the Average-Case with
Link to a detailed description of InsertionSort
The algorithms of this category follow the motto "divide and conquer".
Why the motto "divide and conquer" brings a benefit when sorting you can learn here.
As already shown in the runtime-analysis these algorithms can be subdivided into those which calculate the number of recursion levels dynamically and those which work with a fixed number of repetitions.
The algorithms of this type do the division into sub-domains based on the number of data records. Most of these algorithms do not work stable.
|MergeSort is a recursive algorithm which splits an input list into two sublists. When a level is reached where there is only one element in the list, it returns. Since returned sublists are sorted, a single sorted list can be build by merging the two sublists. Starting with the smallest elements of both lists the smaller one is moved to the result and deleted from the sublist. If one of the sublists is empty, the rest of the other sublist is moved to the result without compare operations. The algorithm is stable but not "in place".
With evenly distributed key values the timing follows the function for average case and worst case sorts. By grouping sorted sequences into sublists (called runs) which are not divided any more, the best case can be .
Link to a detailed description of MergeSort
|From a domain a so called pivot-element is determined. This pivot divides the domain into two sub-domains; one domain with the keys being smaller or equal to the pivot and the other sub-domain with keys being bigger or equal to the pivot. The sub-domains are processed recursively. Domains with one or zero records are seen as sorted.
With evenly distributed key values the timing follows the function for best case and average case sorts.
When using the algorithm in its basic form (as pivot-element always the key of the first or last element of the domain is selected) the timing depends strongly on the distribution of the key values. If the key values are already sorted the worst case is characterized by .
If the pivot element is selected by pure chance it is very improbable to make disadvantageous selections all the time. With some more effort (histograms) an optimal pivot can be selected.
Link to a detailed description of QuickSort
|In BucketSort a function valuates the key of a record and gives the number of the bucket into which it has to go. The number of buckets is determined when this function is generated. After a sorting phase it is normal to find records with different key values in the same buckets. Then the content of these buckets have to be sorted internally. This can again be done by help of BucketSort, but the valuating functions have to be adapted to other key values.
If the number m of buckets is identical on any level of recursion, the time necessary for sorting follows the function .
Link to a detailed description of BucketSort
|In this implementation BucketSort is recursive and sorts by integer keys. The distribution function is evaluated automatically for any bucket by searching for the smalles and biggest key value and by dividing this range into as many subranges as buckets are requested.The sorting is stable and indirect. The number of buckets to be used on any recursion level is adjusted by a parameter.
If this number is 2, a QuickSort with optimized pivot detection is realized. And here comes a novelty: After finding the smallest and the biggest key value it is checked whether they are identical; if so, all the key values in this bucket are identical as well and there is no need for another recursion level.
We get the lower border for computing time if all key values are identical; there we get . If there are no identical key values at all, we first get , but then there is a limitation. With the sorting key being a 4 byte integer, there will be in maximum 32 recursion levels, no matter how many data records we got, because on the 32nd level all key values in a bucket have to be identical!
The proof that is the lower border for comparison based sorting algorithms is only a faulty thesis!
Before starting the sort it is known how many different key values are possible. If sorting by an 4 byte unsigned integer the values can be between 0 and a bit more than 4 billions. When using RadixSort it is clear that the numbers in decimal notation can have up to ten digits so we need ten sorting sequences. When using ExtraDix four are enough.
RadixSort is not comparison-based and it is stable. A punched card sorting machine with ten buckets is simulated. First the machine is adjusted to the last digit and the records are distributed depending on the last digit of the key. Then the records are recollected, the machine is adjusted to the second last digit and the records are distributed depending on that digit. This cycle of adjusting, distributing and recollecting is repeated until the records are sorted by the very first digit.
The time necessary for this follows the function . N is the maximum number of different key values. The order of the records before they are sorted has no effect whatsoever on the time necessary for a sort.
Link to a detailed description of RadixSort
|ExtraDix is an extension of RadixSort, hence the name Extended raDix. This algorithm simulates punched card sorting machines with 256 buckets. For any of the basic data types a variation is implemented. ExtraDix works stable and indirectly.
The time necessary for a sort follows the function . N is again the maximum number of different key values. The order of the records before the sort has no influence whatsoever on the time necessary, therefor there is no best- nor worst-case.
The memory needs are calculated by 4kByte + 24 * n which is pretty moderate.
The complete C source code of the algorithm can be downloaded from SourceForge under the keyword ExtraDix.
Link to a detailed description of ExtraDix