Welcome to part 4 of the DXT compression series. In this series I go over the techniques and results used to compress Firefall's texture data as they are uncovered and implemented. In this post I go over a lossy algorithm reducing the data size on disk - Or how I went from 2.5bpp to 1.5bpp with very little visible and measurable loss in quality. Previous posts on this topic: Part 1 - Intro Part 2 - The Basics Part 3 - Transposes In Part 2 we determined the baseline of optimized LZMA compression on DXT5 data, which is 2.28bpp on average for my test data set from Orbital Comm Tower in Firefall. In Part 3, we went over various transposes of data and found that they only make a small impact to 2.25bpp. Alright, here we go! It turns out, sliding windows/code books/delta/etc probably won't help much (if at all). So I won't waste my time with that. link What will help LZMA? Attacking the heart of the problem. Reducing the entropy of the data. What is entropy? Entropy is simply a lack of order or predictability in a system. IE, if it's random data, it's very hard to compress. Where is the entropy? First, let's estimate the compressibility of each part of DXT so we can understand the situation better. The total size of the color selection bits after LZMA is 242.49mb. The total size of the alpha selection bits after LZMA is 324.79mb. The total size of the color and alpha lines combined after LZMA is 189.99mb. In other words: 25% of the data is the color & alpha lines. 75% of the data is the color & alpha selection bits. Clearly, the problem is not the color/alpha lines, it's the selection bits. So lets focus on that. How do we reduce the entropy of the selection bits? This can be most simply boiled down to reducing the number of different variations of selection bits that actually occur (aka vector quantization). For instance, each texture has 32x32x3 DXT5 blocks. If of those 4x4 pixel DXT5 blocks, there are 900 out of 3072 unique variations of the 4-byte color selection bits. Out of the 900, 600 variations occur only once and are typically only minor variations compared to other, more common, selection bits. The goal then is to switch out as many as we can of the 900 uncommon blocks of selection bits for more common blocks of selection bits if the distortion is reasonable. This idea in the end got me 1.51bpp and a mean squared error (MSE) of 0.64! (vs the DXT5 data, not source art) Lets first implement a naive greedy version first as an example. Then we can discuss on how to make it better. // Reduce DXT5 Entropy // Excuse the pseudo code format. I'm making it up as I go along. const int numBlocks = 1024*3; // 32*32 DXT5 blocks * 3 textures int numUniqueBits = 0; unsigned unqiueBits[numBlocks]; int uniqueBitsCnt[numBlocks]; int blockToUniqueTbl[numBlocks]; // First find all unique selection bits and set up some other tables at the same time // for later use. for(int i = 0; i < numBlocks; ++i) unsigned *bp = std::find(uniqueBits, uniqueBits+numUniqueBits, blocks[i].colorBits); if(bp != uniqueBits+numUniqueBits) uniqueBitsCnt[bp-uniqueBits]++; blockToUniqueTbl[i] = bp-uniqueBits; continue; blockToUniqueTbl[i] = numUniqueBits; uniqueBitsCnt[numUniqueBits] = 1; uniqueBits[numUniqueBits] = blocks[i].colorBits; // Sort the uniqueBits table by uniqueBitsCnt with the help of uniqueBitsSortedByCnt int uniqueBitsSortedByCnt[numBlocks] = {0,1,2, .. numBlocks-1}; sort(uniqueBitsSortedByCnt, sortBy(uniqueBitsCnt)); // Do the entropy reduction. IE, replace uncommon bits with common ones when // the MSE is below a threshold. auto decodeDXT5 = [](unsigned char *rgba, auto &dxt5, unsigned newBits) { ... }; auto calcMSE = [](unsigned char *rgba1, unsigned char *rgba2) { ... }; float mseLimit = 8.0; // mean squared error for(int i = 0; i < numBlocks; ++i) unsigned char block1[16]; decodeDXT5(block1, blocks[i], blocks[i].colorBits); for(int j = 0; j < numUniqueBits; ++j) int k = unqiueBitsSortedByCnt[j]; unsigned char block2[16]; decodeDXT5(block2, blocks[i], uniqueBits[k]); float mse = calcMSE(block1, block2); if(mse < mseLimit) blocks[i].colorBits = uniqueBits[k]; break; This is the greedy approach. It accepts the first set of color selection bits that meet its error criteria. It works great, however it suffers from quality issues. Quality wise, it is too greedy. It sacrifices too much for the sake of bitrate. What we really want to do is select from the top N candidates the best match, then if not found fall back to the greedy approach. This produces much higher quality results for a relatively small increase in bitrate. // Do the entropy reduction. IE, replace uncommon bits with common ones when // the MSE is below a threshold. // Note: simple for explanatory purposes. Can be optimized further. // Helper functions. // decodeDXT5 decodes a DXT5 block into RGBA, but overrides the color selection // bits of the DXT5 block with newBits. auto decodeDXT5 = [](unsigned char *rgba, auto &dxt5, unsigned newBits) { ... }; // calcColorMSE calculates the mean squared error of the color channels. auto calcColorMSE = [](unsigned char rgba1[16], unsigned char rgba2[16]) { ... }; // Do a normal greedy search after looking for a best case in the first N entries const int greedyAfter = 16; // How much MSE is acceptable? const float mseLimit = 8.0; // For each DXT5 block for(int i = 0; i < numBlocks; ++i) unsigned char block1[16]; decodeDXT5(block1, blocks[i], blocks[i].colorBits); // Find the best match in the first N best selector bits. If less than MSE limit, trivial // accept. Otherwise go one by one greedily and choose the first one that is under // the limit. unsigned bestBits = blocks[i].colorBits; float bestMSE = mseLimit; int j = 0; int jj = std::min(greedyAfter, numUniqueBits); for( ; bestMSE >= mseLimit; ++jj) for(; j < jj; ++j) int k = unqiueBitsSortedByCnt[j]; unsigned char block2[16]; decodeDXT5(block2, blocks[i], uniqueBits[k]); float mse = calcMSE(block1, block2); if(mse < mseLimit) bestMSE = mse; bestBits = uniqueBits[k]; break; blocks[i].colorBits = bestBits; This achieves 1.43bpp and MSE of 0.84! not bad for a short piece of code. If you look through the worst case textures you will find a few textures with blocking artifacts. Especially on gradients you may see visual blocking artifacts such as these: I improved this (and other oddities) by binary searching greedyAfter to find a target MSE, a form of rate-distortion optimization. I also targeted a different MSE for both color and alpha groups of those textures independently. The blocky artifacts are greatly reduced, but not entirely eliminated. Pushing the MSE target lower would further improve quality, at the expense of a higher bitrate. For improved performance you could target a different MSE for each of the 3 textures independently. That would improve the accuracy of the MSE at a cost of encoding time. Example psuedo code for the binary search. // What MSE are we aiming for? float mseTarget = 0.5; // quality level to be passed in to above algorithm int greedyAfter = 0; // 9 iterations for 9 bits of search space for(int bit = 0; bit < 9; ++bit) // This piece here defines the binary search, when the search space is a power of // 2, this boils down to setting a bit, testing, then setting a lower bit, testing, etc int test = greedyAfter | (0x100>>bit); // run the algorithm from above (test + 1 cause zero is erroneous) reduceEntropy(test + 1); // compute the mean squared error of the color channels float mse = computeColorMSE(); // If the mse is greater than the target, then search in higher numbers // else search in lower numbers. if(mse > mseTarget) greedyAfter = test; Binary search is probably not the fastest way to converge. It is however very simple. ResultsThe method described above attains a MSE of 0.64 with a bitrate of 1.51bpp on the Orbital Comm Tower (OCT) data set. This is a 21.19:1 compression vs uncompressed data. The implementation of the above algorithm for alpha line/selection bits is very similar. So similar that in my implementation I coded a single function to work on both alpha or color based on a function argument. As for other knobs, I target an MSE of 1.0 for the alpha bits, and 0.5 for color. The per-block MSE limit for alpha was left to 8.0 for both alpha and color. Comparison to CrunchCrunch is a library for compressing DXT data. It's used in a number of games. Link There are two primary ways to configure crunch. One is to target a specific bitrate, the other is to set a quality parameter and try to achieve a specific bitrate on average over the whole dataset. Also when I LZMA compress the resulting textures, I do it like our baseline, except I use the parameters that Crunch uses internally for their LZMA compression in their rate/distortion calculations. This is different than crunch calculates bpp internally, which it calculates bpp for each texture individually, and not all 3 in a set. Compressing the 3 as a set turns out to be a way more efficient way to encode the data (as we learned from Part 2). Results when specifying a target bitrate was less than stellar. With a target bitrate of 1.51bpp, Crunch achieved a MSE of 6.92. Thats pretty darn horrible (and it showed very visibly in the textures too). I compressed them in a set, like the baseline instead of individually like crunch, and is was much more efficient at about 1.1bpp. Thats a 37% difference in bpp! Results when specifying a quality factor (124) while optimizing that quality factor to achieve an average bitrate of 1.52bpp gives a MSE of 2.77. 4.3 times worse than our algorithm, but overall not horrible. While it's a more than fair comparison, I can help crunch to be even more competitive by applying what we learned from Part 2 and Part 3 of this blog series. After application of those ideas, and the looong task of recompressing 2.6gb of texture data to 1.48bpp. That gives some room to improve MSE, but nowhere near enough to be competitive. Encoding speed... Using crunch while targeting a specific bpp took just over 24 hours to finish. Even when using crunch to target a single quality level it takes 4+ hours or so to complete. Despite its slowness, if computing is spread across multiple computers, I think it's doable, but the slow speed of crunch in general is not something to be used on a whim for your entire asset pipeline unless your pipeline is set up to be highly distributed. In contrast, when targeting a specific quality level my algorithm takes 15 minutes to process the 2.6gb set - and it's not parallelized. Lines of Code Not even a fair comparison :) but I'm happy that I was able to do as well as I did in about 100 lines of code. Overall I wish I had more time to experiment with crunch, but it was so slow it was prohibitive. In general, I don't feel crunch is up to the quality standards that I need for Firefall. Update: Rich Geldreich clarified that Crunch was never designed to function from DXT encoded data, but only source RGB data. Future WorkI could just keep going on this and continuing to improve the blog post & algorithm, but I really wanted to finish this blog post (and possibly series for at least a few months). Here is my laundry list of possible improvements.
- The code above should in some cases be more commented. I hoped that simplicity would help explain the algorithm, but I'm not sure that always works up there in the code. If anything is unclear, please comment below and I will clarify. - Obviously the code above is not optimized. When implementing your own version you would need to spend some time with that. - A better search algorithm other than binary search should be implemented. The search space may indeed be linear or be made to be linear, which would mean finding optimal parameters could be many times faster. - Another tweakable parameter is the mseLimit in the inner loop. I set it to 8.0, but optimizing it for bitrate & quality is not exhaustively tested. - Optimizing selection bits while also choosing different line end points would likely improve the MSE for bpp. It is however a very large search space, so more thinking would need to be done to figure out how to do it efficiently. - Instead of only choosing selection bits from the pool of existing ones, you could instead fabricate new ones which can overall produce a better MSE for bpp - theoretically. - The choice of selection bits could also pay more attention the internals of LZMA, which could in theory improve the MSE for bpp. - Mean squared error is not always the best predictor of image quality. This is especially so at higher MSEs. MSE is also subject to biases due to large flat areas in textures. A different metric may also be able to give a quality for bpp boost. - Improving the uncommon gradient block artifact may be better fixed by means other than just lowering MSE targets. I attempted error diffusion and other things, but it didn't have a significant effect. Low MSE was the most reliable predictor of quality that I've found. - When reducing entropy, comparing MSE vs the original RGBA data sets would likely yield improved overall quality vs measuring from the post-encoded DXT5 data. In effect, combining the encoding of the DXT5 blocks with the entropy reduction in mind. - There are two parts to LZMA. The LZ part and the M part. Thus far this algorithm has focused on the LZ part as that has proven to be fruitful. However, you could also in addition focus on the distribution of the literal coder. Initial experiments were OK with this, and I feel that might also be a good avenue to pursue further. - As Bryan Mcnett suggested in the comments to my last post, It may be possible to adjust some of the parameters per mip-level and still achieve a visually pleasing result. I tried this early on with decent success, but later removed it due to some (very small) mip-mapping transition artifacts when viewing the textures in game. In closing, I've presented an algorithm with good quality and also low MSE which is over 4 times better than state of the art (crunch). I think that its pretty good, but can be made to be better with some of these suggestions. Overall, the artists are also happy with the quality thus far. Firefall will use this technique in the coming patches and people will see a pretty dramatic improvement in streaming performance. Win-win.
14 Comments
8/16/2013 03:54:01 pm
Nice series, and great results!
Reply
Jon Olick
8/17/2013 04:41:12 am
Thanks! Thats a great suggestion!
Reply
8/20/2013 09:42:23 pm
Really interesting work, and I'm glad that others are experimenting in this space. I wish more devs and companies would take compression seriously. Bandwidth and customer time is definitely not free or something to be taken for granted. I’m stuck on WiMax or 4G most of the week and I hate it when devs shove montly multi-GB updates or game downloads on Steam.
Reply
Jon Olick
8/20/2013 10:42:34 pm
> crunch optimizes for, computes and displays MSE/RMSE/etc. in the traditional way, i.e. vs. the original (uncompressed/raw) input (which should be the source art), not against some DXTc compressed version of the image
Reply
Jon Olick
8/20/2013 11:00:00 pm
and by "if you would like the original source art" I mean if you would like the data I used in these tests :) Which is the source DXT5 encoded data.
Jon Olick
8/20/2013 10:46:53 pm
In general, crunch would make odd encoding decisions which greatly effect MSE. It would move high frequency elements around at random, and do it differently on each of the 3 pairs of textures! This caused inconsistencies when rendering between normal maps and diffuse that looked rather odd in some cases. This resulted in any one specific textures looking ok, but when put together for rendering looking wrong.
Reply
Jon Olick
8/20/2013 10:56:40 pm
Oh also, I didn't output from crunch as CRN files, but as DDS data, which I then stripped the resulting DDS header to get the raw DXT data. This was also to be a fair comparison.
Reply
Jon Olick
8/20/2013 11:47:12 pm
Just read over the comment again and I realized that you also may be confused in that you thought I was computing MSE of your program vs the *output* of my algorithm. That is not the case. All MSE's in the article are vs the source DXT5 data. So they are all directly comparable and meaningful in terms of how much distortion they are adding to the image.
Reply
RICH GELDREICH
8/21/2013 04:13:13 am
Cool, thanks. I can see why you would be interested in doing this, but crunch is not designed (and was never tested) to work with already compressed DXTc data. In the mode you're using, crunch itself is just a DXTc compressor that tries to cluster its output, so you're compressing each image twice with two radically different DXTc compressors.
Thanks for documenting your experiments, we will soon be looking at compressing our assets and this will be a great starting point!
Reply
8/26/2013 02:10:21 pm
I'll explain again what I said before, and I'll try to make it simple...
Reply
Jon Olick
8/26/2013 02:30:10 pm
> It's possible to encode the endpoints for a mip0 block as nothing more than a predefined riff on the endpoints in the corresponding block of mip1. No palette.
Reply
Rich Geldreich
6/11/2016 06:43:33 pm
BTW - Google, EA, Sony, Unity and a bunch of other companies now use crunch.
Reply
Leave a Reply. |
Archives
August 2024
Categories |