4bb735f7966eedf414aa8e24533967cffd5bbcdb
15 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c1d8e8da71 |
Optimize vector quantization step
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
The vector quantization algorithm takes floating point vectors as input and performs vector preprocessing right before the quantization. At the same time, selector training vectors are generated directly from integer selector values, packed into a single uint64. It would therefore be more efficient to perform preprocessing of the selector training vectors (which includes sorting and deduplication) while still having them in a packed form. Additional performance boost is achieved by using multiple threads for sorting the training vectors.
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
21eb70bc10 |
Optimize tile computation step
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
In the tile computation step, pixels within the tiling area are palettized using a general purpose tree clusterization algorithm. At the same time, clusterization of the tile pixels is always performed with the following restrictions: the maximum number of palettized pixels is 64, the maximum number of clusters is 2. The performance can therefore be improved by solving the palettizing task with a specialized version of the tree clusterizer, which does not maintain the tree structure and uses constant memory.
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
dbbef6a21f |
Perform multithreaded node preprocessing for faster vector quantization
This change significantly improves the compression speed for both DXT and ETC encodings.
Explanation:
On each iteration of the vector quantization algorithm, the leaf with the highest variance is selected for splitting. If the leaf gets split, then two new leaves are created (while the leaves that can not be split will be ignored on the future iterations). There does not seem to be any simple way to compute or reliably predict the variances of the future leaves in advance, which means that there is no simple way to efficiently perform split operations in parallel.
And still, there is an interesting observation. Even though the order of the split operations depends on the previous iterations, the split operations performed in different subtrees are completely independent. So what if instead of solving the main quantization task we will first solve an alternative quantization task, which has a lot in common with the main task, but at the same time can be efficiently parallelized. Then the intermediate computation results of the alternative solution can be reused when solving the main task. Specifically, the idea is to efficiently compute an alternative split tree, which is more or less balanced, and has approximately the same number of nodes as the main tree. Then the overlapping part of the main and alternative trees can be reused while solving the main quantization task.
In order to achieve this, the initial root is first split normally until the number of splittable leaves reaches the number of available threads. Then each leaf is split in a separate thread, while the maximum number of split iterations for each subtree is defined as the maximum number of split iterations for the whole main tree divided by the number of used threads. This way the total number of nodes in the alternative tree will be approximately the same as the number of nodes in the main tree.
Note that in general, the alternative tree does not match the main tree, so some nodes of the alternative tree will never be reused. In practice however, the portion of such unnecessarily precomputed nodes is not very big. And considering that the nodes of the alternative tree are precomputed in parallel using multiple threads, in most cases the overall performance is significantly improved.
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
1028520280 |
Use multiple threads for node split in vector quantization
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
During the node split iteration, identical computations are performed for all the vectors of the split node. The overall performance can be improved by performing independent computations in separate threads. In order to avoid possible performance overhead, on each iteration the number of threads is selected in such a way so that each thread processes at least 512 vectors.
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
fbe3f6ca10 |
Optimize vector quantization algorithm
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
On each iteration of the vector quantization algorithm, the leaf with the highest variance is selected for splitting. At the same time, each split operation adds at most 2 new leaves. Considering this, the search of the leaf with the highest variance can be performed more efficiently if all the leaves are stored in a priority queue (in order to guarantee that texture decompression gives identical result to the original version of Crunch, the node comparison operation also takes the node index into account).
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
11a89d25ed |
Optimize vector quantization algorithm
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
When a node is split during the quantization step, all of its vectors are split between the child nodes, and new memory is allocated to store each new set of vectors. At the same time, the set of vectors of the parent node is no longer accessed after the split. Considering that the sets of vectors of the child nodes do not intersect, it is possible to reuse the memory allocated for the parent set of vectors, to store the child sets of vectors. This can be achieved in the following way. All the source vectors are initially stored in an array. Let's assume that it is possible to reorder this common array of vectors in such a way, so that vectors of each node would form a continuous block within this array. Then it would be sufficient to store only two indices for each node (pointing to the first and to the last node vectors in the common array of vectors) in order to describe the complete set of vectors of this node. This assumption is correct for the root node, which has initial vector indices pointing to the first and to the last elements of the complete vector array. When a node is split, let's reorder its vectors (stored in a continuous block within the common array of vectors) in such a way, so that vectors of the left child node are put in front, and then followed by the vectors of the right child node (the indices of the first and last vectors of the child nodes should be set accordingly). This way each child node will also have its vectors stored in a continuous block within the common array of vectors, defined by two indices, and the split iteration can be repeated. Note that the memory, which is used to store the sets of vectors for all the nodes, now needs to be allocated only once.
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
a14a313361 |
Optimize color endpoint solution evaluation
This change improves the compression speed for DXT encoding.
Explanation:
In order to evaluate an endpoint solution, it is necessary to compute the sum of the squared distances from the source pixels to their nearest block colors, defined by the evaluated endpoint solution. Such computation is quite complicated, so before it is performed, we can compute the sum of the squared distances from the source pixels to the axis-aligned bounding box enclosing all the evaluated block colors (if the source pixel appears to be inside the AABB of the evaluated solution, then the distance is considered to be 0). If the sum of the squared distances to the AABB of the current solution is already bigger than the sum of the squared distances computed for the previously found best solution, then the current solution does not need to be evaluated.
The actual trick here is that the sum of the squared distances to the AABB of the current solution can be computed in constant time using the following approach. The sums of the squared distances for each color component can be computed separately. For each color component the AABB determines 2 planes: the "lower" plane, defined by the lower boundary of the AABB, and the "upper" plane, defined by the upper boundary of the AABB. The sum for each color component is combined from two parts: the sum of the squared distances from the lower plane to all the source pixels which are below the lower plane, and the sum of the squared distances from the upper plane to all the source pixels which are above the upper plane. Considering that the endpoints of the evaluated solution are encoded as RGB565, there are 32 possible planes for the red and blue components, and 64 possible planes for the green component. For each plane it is sufficient to precompute the following two values: the sum of the squared distances from the plane to all the source pixels which are "below" this plane, and the sum of the squared distances from the plane to all the source pixels which are "above" this plane. The total sum of the squared distances from the source pixels to any evaluated AABB can then be represented as a sum of 6 precomputed values, while all the used values can be precomputed in linear time with dynamic programming.
Note: The AABB check seems to work faster than inserting a solution into the hash map. For this reason the AABB check is performed first.
Additional improvements: A few minor adjustments have been made in order to make sure that the texture decompression gives identical result to the original version of Crunch also for 32-bit builds (original Crunch library uses different floating point models for 32-bit and 64-bit builds).
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
65f44319c0 |
Optimize computation of the endpoint cluster indices
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
The vectors which are processed in the cluster indices computation step, are the very same vectors which were used in the vector quantization step. This means that every processed vector already has a specific centroid associated with it. Even though the associated centroid is not necessarily the closest one to the processed vector, the distance to the associated centroid can be used as an upper boundary of the distance to the closest centroid. This allows to efficiently perform early out while computing the distances to the other centroids.
Note: The modified algorithm is supposed to generate decompression result identical to the original version of Crunch. For this reason the centroid associated with a specific training vector is not used as an initial best solution, because it could potentially change the decompression result in cases when the processed training vector is equidistant from multiple centroids (selection of the closest centroid in such cases depends on the processing order).
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
51f73fdfed |
Optimize vector quantization algorithm
This change improves the compression speed for both DXT and ETC encodings.
Explanation:
The main ideas used for optimization of the vector quantization algorithm:
- intermediate structures can store vector indices instead of the vector data, which minimizes the total amount of copied data when splitting a node (this is especially important for selector quantization, where processed vectors have 16 components)
- weighted vectors and weighted dot products can be cached
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
3053c9dd93 |
Optimize DXT endpoints computation
This change improves the compression speed for DXT encoding.
Explanation:
The main ideas used for the DXT endpoints computation optimization:
- Instead of using map in tree clusterizer, the source vectors can be stored in an array and sorted before the quantization. This might increase the amount of used memory, but is much more efficient in terms of memory reallocation.
- Endpoint caching can be used throughout the color endpoint computation, and not just within the optimize_endpoints function. The only place where endpoint caching can not be used is the final step of the try_combinatorial_encoding function, where alternate rounding is used.
- When computing endpoint codebooks, endpoint optimizer and endpoint refiner can be reused, which eliminates unnecessary memory reallocations.
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
3e12aff909 |
Fix miscellaneous compiler warnings
DXT Testing:
The modified algorithm has been tested on the Kodak test set using 64-bit build with default settings (running on Windows 10, i7-4790, 3.6GHz). All the decompressed test images are identical to the images being compressed and decompressed using original version of Crunch (revision
|
||
|
|
7c02055d05 | Reformat the source files. The source files have been reformatted using: clang-format.exe -style="{BasedOnStyle: Google, AllowAllParametersOfDeclarationOnNextLine: false, AllowShortFunctionsOnASingleLine: Inline, AllowShortIfStatementsOnASingleLine: false, AllowShortLoopsOnASingleLine: false, ColumnLimit: 0, DerivePointerAlignment: false, SortIncludes: false}" | ||
|
|
f71b49be60 | Initial checkin of v1.04 - KTX file format support, basic ETC1 compression/decompression, Linux makefile with proper gcc options, lots of high-level improvements to get crnlib into a state where I can more easily add additional formats. | ||
|
|
f63e26aee6 | v1.03 prerelease - Full Linux port of crnlib/crunch, in progress - still more testing to do, and some cmd line options (such as -timestamp) don't work under linux yet, but the core stuff (compression/decompression/transcoding) should work fine and performance under Linux is comparable to Windows. The 3 examples haven't been ported yet. | ||
|
|
9f98ea7e22 |