RadixSort
Subject classification: this is an information technology resource. |
RadixSort was developed in 1887 by Herman Hollerith and was used for sorting numbered punched cards. The algorithm is [w:numerical stability|stable] and fast.
How RadixSort works[edit]
BucketSort in its natural variant is the predecessor of RadixSort. If there were for example bills to be sorted by date, one started by building stacks with bills from the same year. Lets assume there were bills from ten years so we have ten stacks with bills now. Then each of these stacks was subdivided into twelve stacks representing the month (we have 120 stack now on the table). Now each of them was subdivided into 30/31 stacks representing the days of the months. We end up with about 3.600 stacks that have to be recollected in correct order (nobody will ever do it in this way; it shall only show the complexity).
The brilliant idea of Hollerith was to reverse the sorting order. In order to show what his idea was, we sort 20 three digit numbers, presented here.
387 310 529 174 011 227 020 901 522 771 993 348 974 213 226 004 918 098 107 109
In the first run the numbers are sorted by the last digit; since there are 10 different digits we use ten buckets. The result of this sort we have here:
B-0 B-1 B-2 B-3 B-4 B-5 B-6 B-7 B-8 B-9 -------------------------------------------------------------------------- 310 011 522 993 174 226 387 348 529 020 901 213 974 227 918 109 771 004 107 098
Now we recollect the numbers from the buckets without changing their order.
310 020 011 901 771 522 993 213 174 974 004 226 387 227 107 348 918 098 529 109
Sorting by the middle digit gives this result:
B-0 B-1 B-2 B-3 B-4 B-5 B-6 B-7 B-8 B-9 -------------------------------------------------------------------------- 901 310 020 348 771 387 993 004 011 522 174 098 107 213 226 974 109 918 227 529
It is obvious that the content of the buckets is still sorted ascending by the last digit. We recollect again the content of the buckets without changing their order.
901 004 107 109 310 011 213 918 020 522 226 227 529 348 771 174 974 387 993 098
The third and last run gives this distribution to the buckets:
B-0 B-1 B-2 B-3 B-4 B-5 B-6 B-7 B-8 B-9 -------------------------------------------------------------------------- 004 107 213 310 522 771 901 011 109 226 348 529 918 020 174 227 387 974 098 993
After recollecting the numbers the sort is finished.
004 011 020 098 107 109 174 213 226 227 310 348 387 522 529 771 901 918 974 993
The result of Hollerith idea was, that one only needs as many buckets as there are different symbols. Instead of one thousand buckets for a naive BucketSort a set of ten buckets is used three times in order to get the same result.
But what exactly is the trick for this enormous reduction? We have a look at the stack of numbers after the first recollection. We see in the stack a layer with numbers ending on 0, followed by a layer ending on 1 and so on. After the second recollection we have a layer where the second last digit is 0 and this layer could have 10 sub-layers with the last digit being 0, 1 etc. The trick of Hollerith was to substitute the sub-buckets by sub-layers which are automatically in perfect order all the time. And there is no effort necessary for administration of sub-layers. If a sub-layer exists, fine, if not, fine as well; it is not necessary for the algorithm to care about it.
And we see directly the reason why RadixSort is stable. When a layer is divided into sub-layers, the order of the data records is not changed. Maybe they do not have the same neighbors any more, but the position numbers (from the original stack) are still ascending.
We go back to the example with the bills. We want them to be sorted by year, within the years sorted by month and within the months by day. Since RadixSort is stable it becomes very easy. We first sort (in two runs) by days, then we sort (in two runs) by month and at the end (in two or four runs) by year. Sorting done!
Timing[edit]
It is obvious that any digit of any key is only processed once. Hence the number of key handlings is
- A = n * d
where n is the number of data records and d is the maximum number of digits in the key. Since the processing of any digit of any key takes the same time RadixSort has a linear characteristic.
In order to be able to compare RadixSort with other algorithms it is better to describe the characteristic by
- T = B * n * lg_{256}(N)
where B is a constant reflecting the implementation and the term lg_{256}(N) gives the number of repetitions where N is the maximum number of different key-values.