ExtraDix
Subject classification: this is an information technology resource. |
The sorting algorithm ExtraDix is an extension of the well-known RadixSort (hence its name Extended raDix). It is not based on comparing key values, it is stable and sorts indirectly by first calculating the correct sorting order and only then moving the data records. The algorithm is very fast, there is no best case nor worst case as regards timing, the time necessary for sorting increases directly and in linear fashion with the number of records and the algorithm can be used for all basic data types. A pseudo-code and a C-version were put under GPL and can be downloaded from SourceForge.
How it works[edit]
RadixSort was originally used to sort punched cards. When sorting punched cards RadixSort uses 10 different symbols, the numbers from 0 to 9. The punched card sorting machine was first adjusted to sort by the least significant digit and would distribute the punched cards into the 10 appropriate buckets numbered 0 to 9. Into each bucket went all the cards having that digit at this particular position in the key - in their original order. After finishing the first pass, the complete stack was collected again starting with bucket number zero and keeping the cards in the order in which they were put into the bucket. Then the sorting machine was adjusted to sort on the second last digit and the sorting was repeated. This adjusting, sorting and collecting was repeated until all the punched cards were in correct sorting order.
ExtraDix works with the 256 different symbols that can be represented by a single byte.
ExtraDix produces a 'logical' sort simply because the interpretation of such a byte is done using a list (such as the extended ASCII-code when dealing with strings) and the sorting of this list follows directly the binary numbering. For instance, when sorting strings, if the actual byte consists of the digit zero, then the record goes into bucket number 48; if it is an 'A', the record goes into bucket 65. The algorithm does not take into account what might be meant by the contents, it is simply a number between zero and 255. The same applies to the bytes of long integers or float numbers.
ExtraDix eliminates the difference between data and address since the value of the actual byte in the key is automatically also the address of the destination bucket.
Normally the sorting key will consist of more than one byte. When processing just one byte of the keys of all records this is called a sorting sequence and processing all the bytes of all the keys is called a sorting process. When sorting by unsigned integer, which is normally composed of four bytes, there will be four sorting sequence for the whole sorting process.
The simulated sorting machine[edit]
For the realisation of the algorithm a sorting machine with 256 buckets is simulated. Since the data records have to be collected exactly in the order they were distributed to the buckets, FIFOs are used.
If, for example, 1.000 records were to be sorted, it would be wasteful to format 256 FIFOs with 1.000 elements in each; since one can not know beforehand how many records will go into any particular bucket, one would have to allow for the worst posibility for every one of them. This problem is eliminated by using a central administration for all FIFOs and they share the space for storing 1.000 elements. When a sorting process is started, memory for storing all the necessary elements is requested from the operating system.
Originally in Radixsort, for any byte of the key the complete data record / punched card was moved. With random access to the input data this is not necessary any more. It is enough to store the position of the record in the FIFO. In which FIFO the record number has to go is simply determined by the value of the byte in the key at the actual sorting position.
Internally the FIFOs are organized as linked lists and the administration of these lists is done by help of an array in which pointers to the first element and to the last element of the individual FIFOs are stored. The content of the FIFO elements are the record numbers of the input data and a pointer to the next element within the same FIFO. In this way a very simple and fast handling of the FIFOs is achieved.
With Radixsort after each sorting sequence all the data records are collected in order to form a new stack that will feed into the next sorting sequence. Since for a simulated sorting machine only a relatively small amount of memory is needed, ExtraDix works with two sets of FIFOs. When collecting the references to the data records from one set of FIFOs, this information is directly sorted into the other set of FIFOs. These sets of FIFOs, representing the buckets, are used alternately.
Sorting unsigned integers[edit]
On Personal Computers unsigned integers are represented by four bytes; the first byte is the least significant and the fourth byte has the highest significance. ExtraDix starts with the least significant byte, i.e. the first one.
When starting a sorting process all the references to the data records are stored in the first FIFO of the second set, making the data in their original order available for the first sorting sequence.
In the first sorting sequence the first byte of every key is processed. During such a sequence all the FIFOs of the other set are emptied, starting with FIFO number zero up to FIFO number 255. When reading from a FIFO we get the number of the data record in the input data. Using this number, a pointer to the key in the data record is computed and the content of the byte at the sorting position gives the number of the FIFO where the reference to the data record shall go, which is identical with the index in the FIFO-administration.
After three more sorting sequences the correct sort is known. Now the references to the data records have to be collected from the FIFOs by reading the record numbers from them; moving the data records in this order to the target array gives the desired result. With a little bit more effort (two integer arrays to remember the new position when a data record had to be swapped because its space was needed to store a result) the records can be sorted at the original location as well.
Every byte of the sorting keys is processed exactly once. The time needed for processing each of these bytes is exactly the same for any other record. So the time needed for a sorting process increases in a linear fashion with the number of records to be sorted.
Sorting unsigned integer with one, two or eight bytes is done in a similar fashion.
Sorting signed integers[edit]
When sorting signed integers the sorting process has to be modified slightly. For the bytes without a sign bit everything stays as before. When reaching the highest byte a little trick is used. We add 128 to this byte and ignore the carry. The result is that the most negative number is pushed up to zero and the positive numbers start at 128; the sorting order is preserved and thats all we need.
Timewise there is no difference between sorting signed and unsigned integers.
Sorting byte sequences[edit]
The convention with byte sequences and strings is that the first byte is the most significant one. ExtraDix starts sorting with the least significant byte which is the last one. Al sequences have to have the same length.
Again every byte of the keys is only handled once and the time needed increases in a linear fashion equal to the number of records to be sorted.
Sorting strings[edit]
In C/C++ a string is terminated with a binary zero; behind this byte the rest of the byte sequence could be filled with random data. If strings are sorted with the function for sorting sequences, it could happen that the sorting appears unstable; this is due to the fact that the records were sorted by random data.
When storing strings like names in a table, the space for storing them is normally much bigger than the average length of the strings. When using the the function for sorting by sequences for sorting strings, a lot of time is used for handling random data in the trailing bytes of the string. In order to speed up the sorting of strings an additional variant was implemented which can be used for strings up to 255 characters.
All the data records are first sorted by length. For this an additional register is used. During the sort, before reading from the other register, it is checked whether there are strings which now have to be sorted by their last character. If so, the references to these records are sorted to the other register, before the actual register is emptied. By doing it in this way it is simulated that these records had been sorted by zero-bytes during all the previous sequences; the stability is preserved.
Lets say the space for containing the strings is 24 bytes long and the average string is only 6 bytes long. In this case 75% of the sorting operations were senseless when using the method for byte sequences. In the variant for strings there is one sorting sequence for sorting by length and the other activities sum up as if all the strings had the average length. If the space for storing strings is much bigger than the average string length, the difference can be considerable.
Sorting floating-point and double numbers[edit]
In IEEE P754 a float with 32 bits is defined as follows
MMMMMMMM MMMMMMMM MMMMMMME EEEEEEES +--------+--------+--------+--------+ 0 8 16 24 31
where M stands for mantissa, E for exponent and S for the sign-bit. And a double with 64 bits has this definition:
MMMMMMMM MMMMMMMM MMMMMMMM MMMMMMMM MMMMMMMM MMMMMMMM MMMMEEEE EEEEEEES +--------+--------+--------+--------+--------+--------+--------+--------+ 0 8 16 24 32 40 48 56 63
When sorting integers it was only necessary to consider the sign-bit when handling the most significant byte. Since the numerical representation of floats and doubles differs, the sign bit has to be taken into account with every byte of a key.
On the other hand it is not necessary to distinguish between mantissa and exponent. This is because the mantissa is normalized; the decimal equivalent is to move the point to the left -and increasing the exponent- until one has something like 0.xxxx * 10^{yy}. If we make a bitwise BucketSort starting with the least significant bit, the actual bit will always have a higher significance than the bit before, no matter whether it is a bit of the mantissa or of the exponent. The only exception is again the sign-bit. Normally a binary 1 is bigger that a binary 0, but with the sign-bit it is different. Therefore the byte with the highest significance has to be handled twice. Read on for an explanation as to why.
Sorting floats[edit]
First we have four sorting sequences. In each of these sequences the sign of a number is evaluated; if the sign-bit is 1, the numbering of the buckets is inverted. When handling byte number four the sort is done including the sign-bit but its influence can be ignored. Afterwards we have a sort by the sign-bit. For sorting by 32-bit floats we need 5 sorting sequences.
Sorting doubles[edit]
First we have eight sorting sequences. In each of these sequences the sign of the number is evaluated; if the sign-bit is 1, then the numbering of the buckets is inverted. When handling byte number eight the sort is done including the sign-bit but its influence can be ignored. Afterwards we have a sort by the sign-bit. For sorting by 64-bit doubles we need 9 sorting sequences.
Memory required[edit]
For the sorting process two FIFO sets are needed. Each FIFO administration has exactly 256 elements and each element has two integers, namely the pointers to the first and to the last entry. Normally integers have 4 bytes so that each FIFO administration needs 2k Bytes and for two FIFO administration 4k Bytes are necessary, no matter how many data records have to be sorted.
Every element of a FIFO consists of the original position of the data record and the link to the next element of the same FIFO. Any of these elements has 8 bytes; since two sets of FIFOs are used the necessary memory comes to S = n * 16 bytes where n is the number of data records to be sorted.
If the result of a sort has to be stored at the same location as the data input, then two integer arrays of length n are needed to remember where a record was stored when its space was needed for the result, and space for intermediate storage of one data record.
In total the memory needs of ExtraDix are aproximately S = 4k + n * 24, which is moderate.
Time needed to sort data with ExtraDix[edit]
Before every sort sequence the FIFO-administration has to be reset by storing 512 negative values. In addition, at the end of each sort sequence, all the 256 FIFOs have to be checked. When there are only few records to be sorted these procedures take up a relatively large portion of the total time needed for the sort.
However, when large numbers of records are to be sorted, this "overhead cost" becomes negligable, because the actual time of this overhead component is constant regardless of the number of records to be sorted.
Normally each byte in any sorting key is only processed once; if the byte contains a sign or part of a mantissa or an exponent, it has to be processed once for any of these aspects. Additionally a few address calculations are necessary and some values have to be changed for linking the elements within the FIFOs.
For any byte of any key only two conditions have to be checked plus some trivial computations made, and thus only a few processor cycles are necessary per key byte, which in turn means that the processing of the keys is very fast. The processing time is the same for every key byte so that the time for sorting increases solely in line with the number of records to be sorted.
The time necessary to process one byte of the key does not change with the content of the individual records. This means that there is no best case and no worst case. All sorts at all times take exactly the same time for the same type of data.
Comparing ExtraDix with Quicksort[edit]
In order to compare different algorithms for sorting data, various texts were copied until a length of one million words was reached. A program was then written which produced a set of one million data records. Each record consisted of the word itself, its length, the position of the word in the text, and long, long long, float and double numbers were generated by bit shiftings thus generating what appeared to be random numbers. The data records had a length of 54 bytes and the word itself was stored in a string of 25 bytes. The computer used was a normal PC running at 1.8 GHz; the table with the test data was loaded into its working memory before starting the sort.
To make a useful comparison, the Quicksort algorithm was used in an implementation in which the pivot element was always the last element in a scope. Sometimes this decision produces very long response times; in the following diagrams this effect was eliminated by interpolation so that the result obtained using Quicksort looks somewhat better than would normally be obtained.
In the next two diagrams the linear increase of sorting time with the number of records of ExtraDix can be seen clearly. The increase of sorting time with Quicksort follows the function f(n) = n * ln(n) so that the difference in time between ExtraDix and Quicksort becomes bigger and bigger as more and more records are sorted.
From the above it can be seen that using ExtraDix does make sense. However, this leaves open the question at what point it is more efficient to use ExtraDix. The next two diagrams show what happened when 2.000 data records were sorted; the time was measured in microseconds. From a few hundred data records onwards the administrative overhead for the FIFO routine was negligable. This shows that ExtraDix under normal conditions acts as a good general purpose sorting algorithm.
Calling the ExtraDix function[edit]
The subroutine ExtraDix is re-entrant; when it is made part of a shared library, the code can be called by any desired number of routines at the same time without mutual interaction.
If ExtraDix becomes part of a library, its source code is compiled before the structure of the data records to be sorted is known. Because of this, the function call had to be designed in such a way that it would be suitable for any type of sort. The function call has only one parameter i.e. the pointer to a structure of the type ED_SortInfo. In this structure all the information needed by ExtraDix to perform a sort is stored. The structure is built as follows:
struct ED_SortInfo { int SortOrder; // ascending or descending int ColType; // the type of column int ColStart; // where the column to be sorted starts int ColWidth; // the width of the column int NumLines; // how many lines there are in the table int LineSize; // the width of the lines void* Source; // the pointer to the data input void* Dest; // where to store the result };
The data records to be sorted are stored in the working memory at the position given by Source. Data records normally have their own structures but the types of these records can not be known by ExtraDix. As a neutral solution a pointer to void is used in order to point to the start of data. The data records have the length of LineSize and in total NumLines shall be sorted. Adapting Source and NumLines makes it possible to sort parts of a list.
Since ExtraDix does not care about the content of the data record, it is sufficient to know where in the record the sorting key is and which variant of the algorithm has to be used. The variant is selected by help of ColType. Possible variants are strings, unsigned integer, signed integer, floating point with single and double precision. Since there are different interpretations depending on the processor, the used compiler and the programming language, it is necessary to tell ExtraDix where in the record the key starts and how many bytes it has. ExtraDix has no problem at all with 7 byte signed integers but floats must have 4 or 8 bytes.
Through SortOrder, which can have the values ED_ASCEND or ED_DESCEND, is selected whether the smallest key shall to be at the top of the resulting list or at the bottom.
The last parameter of the structure is a pointer to the position in the working memory, where the result is to be stored. If this pointer has the value NULL the records are sorted within the original data space. If Dest is different from NULL, this is the address where the result will go. It is the responsibility of the user to make sure that there are no conflicts when storing the result.
The information about where in the data record the key is found is a little bit complicated. This is due to the fact that different processors do not permit the storage of any data type at any memory location. For example integer values with 4 bytes have to start at addresses that can be divided by 4 without a remainder; if the address does not meet this condition, the process will be terminated.
In order to prevent this from happening, the compiler adds additional bytes to the structure and each compiler does this differently. This insertion of filling bytes is called padding. To make sure that programs using ExtraDix will run without changes on different platforms, the parameters of the structure should be computed by the compiler. Examples for how to do this can be found in the test suite.
Tip: When transferring data between platforms you have to make sure that the same padding is used. Mostly it should be sufficient to compare the length of the structures.
Dual license[edit]
The way ExtraDix works is explained both in pseudo-code and in a translation to a C-program. Both were published under GPL and are available as a download from SourceForge.
The author of a novel does not lose his intellectual property when somebody copies his work and renames the main characters or when the work is translated to a foreign language. In the same way the intellectual property on ExtraDix is seen: Any translation of the pseudo-code to a natural language or a programming language (including the compilates) has to be put under GPL, if copies are distributed.
When programs are linked with ExtraDix by embedding the source code, linking the program with static / dynamic libraries containing ExtraDix or using server processes or operating systems exporting the functionality of ExtraDix, this program must only be used by the programmer himself or within his organisation. If copies of the program are distributed to third parties, it has to be under the GPL or there must be an individual license with the author of ExtraDix. Any other use is not legal.