Welcome to part 4 of the DXT compression series. In this series I go over the techniques and results used to compress Firefall's texture data as they are uncovered and implemented. 

In this post I go over a lossy algorithm reducing the data size on disk - Or how I went from 2.5bpp to 1.5bpp with very little visible and measurable loss in quality. 

Previous posts on this topic:
Part 1 - Intro
Part 2 - The Basics
Part 3 - Transposes

In Part 2 we determined the baseline of optimized LZMA compression on DXT5 data, which is 2.28bpp on average for my test data set from Orbital Comm Tower in Firefall.

In Part 3, we went over various transposes of data and found that they only make a small impact to 2.25bpp.

Alright, here we go!
It turns out, sliding windows/code books/delta/etc probably won't help much (if at all). So I won't waste my time with that. link

What will help LZMA?
Attacking the heart of the problem. Reducing the entropy of the data.

What is entropy?
Entropy is simply a lack of order or predictability in a system. IE, if it's random data, it's very hard to compress.

Where is the entropy?
First, let's estimate the compressibility of each part of DXT so we can understand the situation better.

The total size of the color selection bits after LZMA is 242.49mb.  
The total size of the alpha selection bits after LZMA is 324.79mb. 
The total size of the color and alpha lines combined after LZMA is 189.99mb. 

In other words:
25% of the data is the color & alpha lines.
75% of the data is the color & alpha selection bits.

Clearly, the problem is not the color/alpha lines, it's the selection bits. So lets focus on that.

How do we reduce the entropy of the selection bits?
This can be most simply boiled down to reducing the number of different variations of selection bits that actually occur (aka vector quantization). 

For instance, each texture has 32x32x3 DXT5 blocks. If of those 4x4 pixel DXT5 blocks, there are 900 out of 3072 unique variations of the 4-byte color selection bits. Out of the 900, 600 variations occur only once and are typically only minor variations compared to other, more common, selection bits. The goal then is to switch out as many as we can of the 900 uncommon blocks of selection bits for more common blocks of selection bits if the distortion is reasonable.

This idea in the end got me 1.51bpp and a mean squared error (MSE) of 0.64! (vs the DXT5 data, not source art)

Lets first implement a naive greedy version first as an example. Then we can discuss on how to make it better.

// Reduce DXT5 Entropy
// Excuse the pseudo code format. I'm making it up as I go along.
const int numBlocks = 1024*3; // 32*32 DXT5 blocks * 3 textures

int numUniqueBits = 0;
unsigned unqiueBits[numBlocks];
int uniqueBitsCnt[numBlocks];
int blockToUniqueTbl[numBlocks];


// First find all unique selection bits and set up some other tables at the same time
// for later use.
for(int i = 0; i < numBlocks; ++i)
  unsigned *bp = std::find(uniqueBits, uniqueBits+numUniqueBits, blocks[i].colorBits);
  if(bp != uniqueBits+numUniqueBits) 
    uniqueBitsCnt[bp-uniqueBits]++;
    blockToUniqueTbl[i] = bp-uniqueBits;
    continue;
  blockToUniqueTbl[i] = numUniqueBits;
  uniqueBitsCnt[numUniqueBits] = 1;
  uniqueBits[numUniqueBits] = blocks[i].colorBits;


// Sort the uniqueBits table by uniqueBitsCnt with the help of uniqueBitsSortedByCnt
int uniqueBitsSortedByCnt[numBlocks] = {0,1,2, .. numBlocks-1};
sort(uniqueBitsSortedByCnt, 
sortBy(uniqueBitsCnt));

// Do the entropy reduction. IE, replace uncommon bits with common ones when
// the MSE is below a threshold.
auto decodeDXT5 = [](unsigned char *rgba, auto &dxt5, unsigned newBits) { ... };
auto calcMSE = [](unsigned char *rgba1, unsigned char *rgba2) { ... };
float mseLimit = 8.0; // mean squared error
for(int i = 0; i < numBlocks; ++i)
  unsigned char block1[16];
  decodeDXT5(block1, blocks[i], blocks[i].colorBits);
  for(int j = 0; j < numUniqueBits; ++j)
    int k = unqiueBitsSortedByCnt[j];
    unsigned char block2[16];
    decodeDXT5(block2, blocks[i], uniqueBits[k]);
    float mse = calcMSE(block1, block2);
    if(mse < mseLimit) 
      blocks[i].colorBits = uniqueBits[k];
      break;


This is the greedy approach. It accepts the first set of color selection bits that meet its error criteria. It works great, however it suffers from quality issues. Quality wise, it is too greedy. It sacrifices too much for the sake of bitrate. What we really want to do is select from the top N candidates the best match, then if not found fall back to the greedy approach. This produces much higher quality results for a relatively small increase in bitrate.

// Do the entropy reduction. IE, replace uncommon bits with common ones when
// the MSE is below a threshold.
// Note: simple for explanatory purposes. Can be optimized further.

// Helper functions.
// decodeDXT5 decodes a DXT5 block into RGBA, but overrides the color selection 
// bits of the DXT5 block with newBits.

auto decodeDXT5 = [](unsigned char *rgba, auto &dxt5, unsigned newBits) { ... };
// calcColorMSE calculates the mean squared error of the color channels.
auto calcColorMSE = [](unsigned char rgba1[16], unsigned char rgba2[16]) { ... };

// Do a normal greedy search after looking for a best case in the first N entries
const int greedyAfter = 16;

// How much MSE is acceptable?
const float mseLimit = 8.0;

// For each DXT5 block 
for(int i = 0; i < numBlocks; ++i)
  unsigned char block1[16];
  decodeDXT5(block1, blocks[i], blocks[i].colorBits);


  // Find the best match in the first N best selector bits. If less than MSE limit, trivial 
  // accept. Otherwise go one by one greedily and choose the first one that is under
  // the limit.
  unsigned bestBits = blocks[i].colorBits;
  float bestMSE = mseLimit;
  int j = 0;
  int jj = std::min(greedyAfter, numUniqueBits);
  for( ; bestMSE >= mseLimit; ++jj)
    for(; j < jj; ++j)
      int k = unqiueBitsSortedByCnt[j];
      unsigned char block2[16];
      decodeDXT5(block2, blocks[i], uniqueBits[k]);
      float mse = calcMSE(block1, block2);
      if(mse < mseLimit) 
        bestMSE = mse;
        bestBits = uniqueBits[k];
        break;
  blocks[i].colorBits = bestBits;


This achieves 1.43bpp and MSE of 0.84! not bad for a short piece of code.

If you look through the worst case textures you will find a few textures with blocking artifacts. Especially on gradients you may see visual blocking artifacts such as these:
Picture
Original on the left. Entropy reduced on the right.
I improved this (and other oddities) by binary searching greedyAfter to find a target MSE, a form of rate-distortion optimization. I also targeted a different MSE for both color and alpha groups of those textures independently. 
The blocky artifacts are greatly reduced, but not entirely eliminated. Pushing the MSE target lower would further improve quality, at the expense of a higher bitrate. For improved performance you could target a different MSE for each of the 3 textures independently. That would improve the accuracy of the MSE at a cost of encoding time. 

Example psuedo code for the binary search.

// What MSE are we aiming for?
float mseTarget = 0.5;
// quality level to be passed in to above algorithm
int greedyAfter = 0;
// 9 iterations for 9 bits of search space
for(int bit = 0; bit < 9; ++bit) 
  // This piece here defines the binary search, when the search space is a power of
  // 2, this boils down to setting a bit, testing, then setting a lower bit, testing, etc
  int test = greedyAfter | (0x100>>bit);
  // run the algorithm from above (test + 1 cause zero is erroneous)
  reduceEntropy(test + 1);
  // compute the mean squared error of the color channels
  float mse = computeColorMSE();
  // If the mse is greater than the target, then search in higher numbers 
  // else search in lower numbers. 

  if(mse > mseTarget)
    greedyAfter = test;


Binary search is probably not the fastest way to converge. It is however very simple.

Results

The method described above attains a MSE of 0.64 with a bitrate of 1.51bpp on the Orbital Comm Tower (OCT) data set. This is a 21.19:1 compression vs uncompressed data.

The implementation of the above algorithm for alpha line/selection bits is very similar. So similar that in my implementation I coded a single function to work on both alpha or color based on a function argument. As for other knobs, I target an MSE of 1.0 for the alpha bits, and 0.5 for color. The per-block MSE limit for alpha was left to 8.0 for both alpha and color.

Comparison to Crunch

Crunch is a library for compressing DXT data. It's used in a number of games. Link

There are two primary ways to configure crunch. One is to target a specific bitrate, the other is to set a quality parameter and try to achieve a specific bitrate on average over the whole dataset. Also when I LZMA compress the resulting textures, I do it like our baseline, except I use the parameters that Crunch uses internally for their LZMA compression in their rate/distortion calculations. This is different than crunch calculates bpp internally, which it calculates bpp for each texture individually, and not all 3 in a set. Compressing the 3 as a set turns out to be a way more efficient way to encode the data (as we learned from Part 2).

Results when specifying a target bitrate was less than stellar. With a target bitrate of 1.51bpp, Crunch achieved a MSE of 6.92. Thats pretty darn horrible (and it showed very visibly in the textures too). I compressed them in a set, like the baseline instead of individually like crunch, and is was much more efficient at about 1.1bpp. Thats a 37% difference in bpp!

Results when specifying a quality factor (124) while optimizing that quality factor to achieve an average bitrate of 1.52bpp gives a MSE of 2.77. 4.3 times worse than our algorithm, but overall not horrible.

While it's a more than fair comparison, I can help crunch to be even more competitive by applying what we learned from Part 2 and Part 3 of this blog series. After application of those ideas, and the looong task of recompressing 2.6gb of texture data to 1.48bpp. That gives some room to improve MSE, but nowhere near enough to be competitive.

Encoding speed... 

Using crunch while targeting a specific bpp took just over 24 hours to finish. Even when using crunch to target a single quality level it takes 4+ hours or so to complete. Despite its slowness, if computing is spread across multiple computers, I think it's doable, but the slow speed of crunch in general is not something to be used on a whim for your entire asset pipeline unless your pipeline is set up to be highly distributed. 

In contrast, when targeting a specific quality level my algorithm takes 15 minutes to process the 2.6gb set - and it's not parallelized. 

Lines of Code

Not even a fair comparison :) but I'm happy that I was able to do as well as I did in about 100 lines of code.

Overall I wish I had more time to experiment with crunch, but it was so slow it was prohibitive. In general, I don't feel crunch is up to the quality standards that I need for Firefall.

Update: Rich Geldreich clarified that Crunch was never designed to function from DXT encoded data, but only source RGB data.

Future Work

I could just keep going on this and continuing to improve the blog post & algorithm, but I really wanted to finish this blog post (and possibly series for at least a few months). Here is my laundry list of possible improvements.

- The code above should in some cases be more commented. I hoped that simplicity would help explain the algorithm, but I'm not sure that always works up there in the code. If anything is unclear, please comment below and I will clarify.

- Obviously the code above is not optimized. When implementing your own version you would need to spend some time with that. 

- A better search algorithm other than binary search should be implemented. The search space may indeed be linear or be made to be linear, which would mean finding optimal parameters could be many times faster. 

- Another tweakable parameter is the mseLimit in the inner loop. I set it to 8.0, but optimizing it for bitrate & quality is not exhaustively tested.

- Optimizing selection bits while also choosing different line end points would likely improve the MSE for bpp. It is however a very large search space, so more thinking would need to be done to figure out how to do it efficiently.

- Instead of only choosing selection bits from the pool of existing ones, you could instead fabricate new ones which can overall produce a better MSE for bpp - theoretically.

- The choice of selection bits could also pay more attention the internals of LZMA, which could in theory improve the MSE for bpp. 

- Mean squared error is not always the best predictor of image quality. This is especially so at higher MSEs. MSE is also subject to biases due to large flat areas in textures. A different metric may also be able to give a quality for bpp boost. 

- Improving the uncommon gradient block artifact may be better fixed by means other than just lowering MSE targets. I attempted error diffusion and other things, but it didn't have a significant effect. Low MSE was the most reliable predictor of quality that I've found.

- When reducing entropy, comparing MSE vs the original RGBA data sets would likely yield improved overall quality vs measuring from the post-encoded DXT5 data. In effect, combining the encoding of the DXT5 blocks with the entropy reduction in mind. 

- There are two parts to LZMA. The LZ part and the M part. Thus far this algorithm has focused on the LZ part as that has proven to be fruitful. However, you could also in addition focus on the distribution of the literal coder. Initial experiments were OK with this, and I feel that might also be a good avenue to pursue further. 

- As Bryan Mcnett suggested in the comments to my last post, It may be possible to adjust some of the parameters per mip-level and still achieve a visually pleasing result. I tried this early on with decent success, but later removed it due to some (very small) mip-mapping transition artifacts when viewing the textures in game. 

In closing, I've presented an algorithm with good quality and also low MSE which is over 4 times better than state of the art (crunch). I think that its pretty good, but can be made to be better with some of these suggestions. Overall, the artists are also happy with the quality thus far. Firefall will use this technique in the coming patches and people will see a pretty dramatic improvement in streaming performance. Win-win.
 


Comments

08/17/2013 12:54am

Nice series, and great results!

When I was doing my own experimentation with image processing, I found that MSE was indeed a very good quality metric, particularly if you convert the final product and original image to YUV space and weight the channels asymmetrically toward the Y, as the eye is more attuned to intensity than color errors. Obviously this is not going to work when encoding data into your textures, such as alpha or normal maps, in which case MSE may be a terrible metric anyway, so you might need some custom metric code based on the meaning of the data. However, some significant gains can come from custom tolerances for the importance of errors in the final texture data, if certain channels are less important than others (or ignored completely).

Reply
Jon Olick
08/17/2013 1:41pm

Thanks! Thats a great suggestion!

Reply
08/21/2013 6:42am

Really interesting work, and I'm glad that others are experimenting in this space. I wish more devs and companies would take compression seriously. Bandwidth and customer time is definitely not free or something to be taken for granted. I’m stuck on WiMax or 4G most of the week and I hate it when devs shove montly multi-GB updates or game downloads on Steam.

About your MSE figures: For your method you made it clear that you compute MSE vs the compressed DXT5 images: "This idea in the end got me 1.51bpp and a mean squared error (MSE) of 0.64! (vs the DXT5 data, not source art)". Your MSE figures seemed strangely low to me for DXTc, it wasn't until I read the "vs the DXT5" part (which is key) that I understood why. Unfortunately, if this is the metric you're using to evaluate the quality of results then crunch isn't the right thing to use (and I can see why your MSE results are so much better).

crunch optimizes for, computes and displays MSE/RMSE/etc. in the traditional way, i.e. vs. the original (uncompressed/raw) input (which should be the source art), not against some DXTc compressed version of the image. With the "vs DXT5" metric your results can't be reproduced (and really aren't meaningful) until I know exactly what DXTc compressor and settings where used to create your reference images. Even knowing what compressor/settings you used isn't that helpful in interpreting your results, because I may want or need to use a different DXTc compressor in my pipeline. In many cases there are many possible block encodings that result in the same or very nearly the same RMSE, so it's really a toss-up what you're optimizing against with this metric.

To make an extreme example: Let's say you used a DXTc compressor set to low quality mode (or a real-time compressor, or had dithering enabled, or used uniform distance metrics while crunch used perceptual, whatever), and you and your artists where perfectly happy with this compressor's output. You then decide to use this compressor's output as your "golden reference" used to judge all other alternatives. I could see how crunch, or some alternative method like JPEG+real time DXTc, or whatever are going to fall flat on their faces with this metric because they are minimizing error vs. the original image, not some particular DXTc compressed variant of the original image.

It would be much more interesting and useful to see the RMSE results comparing crunch's output vs. the original image, contrasted against this method's output vs. the original image (ideally with some content we can access or on some standard test images). Also, a R&D distortion plot showing the traditional RMSE of this method vs. crunch at .75-2bpp would be interesting.

Anyway, ignoring the MSE thing, crunch's current backend coder isn't that good. Its uses too many Huffman symbols, it wastes too many bits on encoding Huffman code lengths, and it doesn't use RLE let alone LZ. On images where RLE or LZ is able to exploit many runs/patterns I highly suspect LZMA will run circles around crunch. For real-world game textures the lack of a powerful backend in crunch could be a serious issue. In my testing with photos, satellite images and videos crunch's current backend performs relatively better vs. simpler game textures.

Yes, crunch's compressor is slow, especially when targeting specific bitrates (which it really wasn't designed to do but I put this feature in there anyway for testing). In the large scale/real world usages I've seen, the compression process is offloaded to the cloud or a local farm of boxes with Incredibuild or whatever. Compression throughput isn't all that interesting of an axis to work on when this much compute is easily available.

FWIW, I'm a huge fan of LZMA, but it's just too slow and energy inefficient for many real world use cases. crunch's backend is designed for use on low end CPU's or via Javascript by cross compiling its transcoder with Emscripten (in which case it can outperform even native JPEG decoders, and will run circles around LZMA). It would be interesting to compare various methods using something like a "RMSE per million decode/transcode CPU cycles" type metric.

-Rich

Reply
Jon Olick
08/21/2013 7:42am

> crunch optimizes for, computes and displays MSE/RMSE/etc. in the traditional way, i.e. vs. the original (uncompressed/raw) input (which should be the source art), not against some DXTc compressed version of the image

When I used crunch, I did not give it the original source art. I decoded the DXT data and fed it that. So it was a fair comparison in that respect.

Using original source art is unfortunately not available in most cases for Firefall's textures as we use many textures which are pre-compressed by artists (by hand unfortunately). We would like to break them of this habit, but like most things, it takes time.

If you would like the original source art for your testing, send me off an email (use the contact form).

Reply
Jon Olick
08/21/2013 8:00am

and by "if you would like the original source art" I mean if you would like the data I used in these tests :) Which is the source DXT5 encoded data.

Charles Bloom also asked for this so he could measure and compare vs his algorithms.

Jon Olick
08/21/2013 7:46am

In general, crunch would make odd encoding decisions which greatly effect MSE. It would move high frequency elements around at random, and do it differently on each of the 3 pairs of textures! This caused inconsistencies when rendering between normal maps and diffuse that looked rather odd in some cases. This resulted in any one specific textures looking ok, but when put together for rendering looking wrong.

Reply
Jon Olick
08/21/2013 7:56am

Oh also, I didn't output from crunch as CRN files, but as DDS data, which I then stripped the resulting DDS header to get the raw DXT data. This was also to be a fair comparison.

Reply
Jon Olick
08/21/2013 8:47am

Just read over the comment again and I realized that you also may be confused in that you thought I was computing MSE of your program vs the *output* of my algorithm. That is not the case. All MSE's in the article are vs the source DXT5 data. So they are all directly comparable and meaningful in terms of how much distortion they are adding to the image.

Reply
RICH GELDREICH
08/21/2013 1:13pm

Cool, thanks. I can see why you would be interested in doing this, but crunch is not designed (and was never tested) to work with already compressed DXTc data. In the mode you're using, crunch itself is just a DXTc compressor that tries to cluster its output, so you're compressing each image twice with two radically different DXTc compressors.

I guess our definitions of "source" data are different. But in your situation, if that's all you have then your approach sounds great. On the products I've worked on or seen our asset pipeline always preserved the source art as .PNG's or whatever so we could do experiments on the compression front without worrying about generational artifacts.

08/26/2013 11:10pm

I'll explain again what I said before, and I'll try to make it simple...

It's possible to encode the endpoints for a mip0 block as nothing more than a predefined riff on the endpoints in the corresponding block of mip1. No palette.

And, this doesn't generalize to other mips. mip0 is special because it's the one you walk past, not the one you scrutinize. Loss is more acceptable in mip0.

Eight bits of riff per mip0 block is plenty, and reduces the size of mip0 by 3/8 (BC1). This reduces size of whole texture by 3/8 * 3/4 = 9/32. And all the loss is concentrated in the mip you scrutinize least (in an action game.)

Does this make any sense? Because it doesn't seem to have made sense to you before.

Reply
Jon Olick
08/26/2013 11:30pm

> It's possible to encode the endpoints for a mip0 block as nothing more than a predefined riff on the endpoints in the corresponding block of mip1. No palette.

I understand, my situation is a bit tricky though (and perhaps non-standard unless your doing software virtual texturing). At decode time I don't have easy access to other mips, so I can't use that (or anything like it). Even if I did, we use a 1-pixel border on the VT tiles, so the usual 4 blocks for 1 block isn't the case. So for my data, each mip is really most simply encoded and decoded individually.

This is however, really very good advice for others who do have that mip information to work with.

> Loss is more acceptable in mip0.

Thats true in general if your game has a good texel per world unit ratio. However, with anisotropic filtering on I did notice small mip transition artifacts in some cases. I didn't do any visual tests with these artifacts to see if people would notice them. I made the judgement that if they happen in a small way in one place, then they may be more egregious somewhere else and I didn't want to have to verify every surface in the entire game for quality.

> Does this make any sense? Because it doesn't seem to have made sense to you before.

I do think this makes more sense now, yeah. Although the definition of a riff is somewhat unclear. It sounds like just 256 possible deltas on the end point which you search for the least objectionable one. Which in general sounds like a pretty good idea :)








Reply



Leave a Reply