Optimizing Sort Algorithms for the PS3 Part 3 (Merge Sort)
11 July 2011
Sorting is a simple but important concept for implementing games. For example, sorting transparent objects before rendering or sorting objects by state attribute (e.g. texture) to improve batching. Because of the way most algorithms work (comparing and swapping pairs of items) sorting often takes precious time. With multi-core architectures like the PS3 it makes sense to parallelize this operation to maximize the use of these cores.
In the first two parts we have implemented the quicksort and merge algorithms, optimized them and and compared their performance to the STL versions. In this part we will do the same with merge sort.
Quick Links == Part 1 (Quicksort)
Part 5 (Parallel Sort on 2 SPUs)
The entire series is also available as a single PDF document.
Merge Sort == Like quicksort, this sort also takes O(n * log n) time on average but, unlike quicksort, requires a temporary array whose size is equal to the array to be sorted. On the positive side and also unlike quicksort, this sort is stable. This means that items that are equal will be kept in the same order. This can be useful when sorting values of user-defined types such as structs. In addition, its performance doesn't degrade to O(n^2) with already sorted arrays liked quicksort does.
The divide-and-conquer approach of this algorithm makes it easy to implement as a recursive function. It goes like this: the array is divided into two equal parts and the function recursively calls itself on each half of the array. When both halves of the array are sorted, they are merged together to the temporary array. After the contents of this array is copied back to the input array the latter is sorted (see code below).
template<typename T> void mergeSort(T *data, T *temp, size_t start, size_t end) { size_t count = end - start + 1; if(count < 2) return; // divide the array to sort into two parts and sort them size_t mid = (start + end) / 2; mergeSort(data, temp, start, mid); mergeSort(data, temp, mid + 1, end); // merge two parts (data) into one (temp) merge(temp + start, data, start, mid, data, mid + 1, end); // copy one part from temp to data __builtin_memcpy(data + start, temp + start, (end - start + 1) * sizeof(T)); }
This naive recursive implementation takes 29.5 ms to sort 65k floats and 19.5 ms for sorting 65K integers. As a comparison std::stable_sort takes 31.8 ms and 20 ms, respectively. The same "small array" optimisation can be used. When sorting arrays of size 8 and less with insertion sort, we get 21.2 ms for floats and 11.4 ms for integers.
The iterative merge algorithm is a bit different from the recursive one. It consists of several passes with an increasing block size m. For the first pass (m = 16), each block of m items is sorted using the insertion sort. Then pairs of adjacent blocks are merged together, i.e. blocks 0 and 1 will be merge into block 01, then blocks 2 and 3 into block 23, then 3 and 4 into block 34 and so on. With each pass the value of m doubles. This means that in the second pass block 01 is merged with block 23, block 45 with block 67 and so on:
template<typename T> void mergeSortIterative(T *data, T *temp, size_t start, size_t end) { size_t count = end - start + 1; size_t m = 16; // this algorithm requires that the source array can be divided into // two chunks, one being as least m in size if(count < (m + 1)) { insertionSort(data, start, end); return; } // sort chunks of m = 16 items size_t i = start; for(; i <= (end - m + 1); i += m) sort16(data, i); if(i < end) insertionSort(data, i, end); Merge<T> merge; T *src = temp, *dst = data, *swap; // with each merge pass the chunks double in size for(; m < count; m *= 2) { // alternate between merging data to temp and temp to data // this is to avoid copying temp to data every time swap = src; src = dst; dst = swap; // merge 2 * N chunks of m items into N chunks of 2 * m items mergeIterative(src, dst, start, end, m); // now all chunks of size 2 * m are sorted } // make sure that the sorted data ends up in the 'data' array if(dst == temp) __builtin_memcpy(data + start, temp + start, count * sizeof(T)); } template<typename T> void mergeIterative(T *data, T *temp, size_t start, size_t end, size_t m) { // merge 2 * N chunks of m items into N chunks of 2 * m items size_t i = start; for(; i <= (end - m); i += (2 * m)) { size_t endChunk = min(2 * m - 1, end - i); T *src = (T *)(data + i); T *dst = (T *)(temp + i); mergeImpl(dst, src, 0, m - 1, src, m, endChunk); } // copy trailing items that make the array not divisible by 2 * m for(; i <= end; i++) temp[i] = data[i]; } template<typename T> void sort16(T *data, size_t start) { insertionSort(data, start, start + 15); }
With this approach it takes 22.1 ms to sort the float array and 10.7 ms for the integer array which is no improvement over the recursive version. With the iterative approach, unlike with the recursive approach, we can control the size of array chunks that will be sorted using the simpler sort. We also control the alignment of arrays to be merged. This will be useful for vectorizing parts of our implementation and mitigates the additional complexity.
Vectorizing == Our implementation so far processes items one at a time and has a lot of branching. As we have seen in the previous part, vectorizing our code using AltiVec SIMD instructions can improve on both counts. Consider the function sortVec4 below which sorts a vector containing four items without any branching:
// T is either vec_float4 (4 floats) or vec_int4 (4 ints) template<typename T> inline void sortVec4(T *a) { // va = [x, y, x, w] T va = *a; T v1 = vec_perm(va, va, PERM(VA, VA, VC, VC)); T v2 = vec_perm(va, va, PERM(VB, VB, VD, VD)); // v3.x = [min(v1.x, v2.x), min(v1.y, v2.y), ..., ...] T v3 = vec_min(v1, v2); T v4 = vec_max(v1, v2); // X = v3.x, Y = v3.y and so on with Z, W. Similarly A = v4.x, B = v4.y ... // v1 = [v3.x, v4.x, v3.y, v3.w] v1 = vec_perm(v3, v4, PERM(VX, VA, VY, VW)); v2 = vec_perm(v3, v4, PERM(VZ, VC, VB, VD)); v3 = vec_min(v1, v2); v4 = vec_max(v1, v2); v1 = vec_perm(v3, v4, PERM(VX, VA, VY, VB)); v2 = vec_perm(v3, v4, PERM(VX, VY, VA, VB)); v3 = vec_min(v1, v2); v4 = vec_max(v1, v2); *a = vec_perm(v3, v4, PERM(VX, VY, VC, VD)); }
Using this function and the mergeVec4 function seen in the previous part we can create a sortVec8 function. sortVec8 sorts two vectors using sortVec4 and merges them using mergeVec4. Similarly with sortVec8 and mergeVec8 (adapted from [1]) we can create a sortVec16 function. This function can be used to specialize the sort16 function for float and integer items; the plain insertion sort will be used for sorting other types of arrays.
Results ==
64K floats
64K integers
std::stable_sort
31.8 ms
20 ms
Recursive merge sort
29.5 ms
19.5 ms
Recursive merge sort, small array (N <= 8)
21.2 ms
11.4 ms
Iterative merge sort, small array (N = 16)
22.1 ms
10.7 ms
Same as above, but sort16 is defined as sortVec16
16.6 ms
8.7 ms
Same as above, but merge is vectorized (mergeVec4)
10.8 ms
9.6 ms
Specializing sort16 gives us some improvements over using insertion sort for small arrays: we get 16.6 ms for float arrays and 8.7 ms for integer arrays. Finally, if we use the vectorized merge implementation described in the previous part, we get 10.8 ms for float arrays and 9.6 ms for integer arrays. This is a big speed-up for sorting floats but a regression for sorting integers.
This is not surprising since we observed the same kind of slowdown when using the vectorized merge on integer arrays. With template specialization we can easily choose to use the vectorized variant only for float arrays. Also, as we will see this won't be an issue when offloading to SPUs.
In the next part in the series we will see how to run some of the code we have presented in this series on a SPU using Offload.
References == [1] http://seven-degrees-of-freedom.blogspot.com/2010/07/question-of-sorts.html