In this post I go over a lossy algorithm reducing the data size on disk - Or how I went from 2.5bpp to 1.5bpp with very little visible and measurable loss in quality.

Previous posts on this topic:

Part 1 - Intro

Part 2 - The Basics

Part 3 - Transposes

In Part 2 we determined the baseline of optimized LZMA compression on DXT5 data, which is 2.28bpp on average for my test data set from Orbital Comm Tower in Firefall.

In Part 3, we went over various transposes of data and found that they only make a small impact to 2.25bpp.

Alright, here we go!

**What will help LZMA?**

Attacking the heart of the problem. Reducing the entropy of the data.

**What is entropy?**

Entropy is simply a lack of order or predictability in a system. IE, if it's random data, it's very hard to compress.

**Where is the entropy?**

First, let's estimate the compressibility of each part of DXT so we can understand the situation better.

The total size of the color selection bits after LZMA is 242.49mb.

The total size of the alpha selection bits after LZMA is 324.79mb.

The total size of the color and alpha lines combined after LZMA is 189.99mb.

In other words:

25% of the data is the color & alpha lines.

75% of the data is the color & alpha selection bits.

Clearly, the problem is not the color/alpha lines, it's the selection bits. So lets focus on that.

How do we reduce the entropy of the selection bits?

This can be most simply boiled down to reducing the number of different variations of selection bits that actually occur (aka vector quantization).

For instance, each texture has 32x32x3 DXT5 blocks. If of those 4x4 pixel DXT5 blocks, there are 900 out of 3072 unique variations of the 4-byte color selection bits. Out of the 900, 600 variations occur only once and are typically only minor variations compared to other, more common, selection bits. The goal then is to switch out as many as we can of the 900 uncommon blocks of selection bits for more common blocks of selection bits if the distortion is reasonable.

This idea in the end got me 1.51bpp and a mean squared error (MSE) of 0.64! (vs the DXT5 data, not source art)

Lets first implement a naive greedy version first as an example. Then we can discuss on how to make it better.

// Excuse the pseudo code format. I'm making it up as I go along.

const int numBlocks = 1024*3; // 32*32 DXT5 blocks * 3 textures

int numUniqueBits = 0;

unsigned unqiueBits[numBlocks];

int uniqueBitsCnt[numBlocks];

int blockToUniqueTbl[numBlocks];

// First find all unique selection bits and set up some other tables at the same time

// for later use.

for(int i = 0; i < numBlocks; ++i)

unsigned *bp = std::find(uniqueBits, uniqueBits+numUniqueBits, blocks[i].colorBits);

if(bp != uniqueBits+numUniqueBits)

uniqueBitsCnt[bp-uniqueBits]++;

blockToUniqueTbl[i] = bp-uniqueBits;

continue;

blockToUniqueTbl[i] = numUniqueBits;

uniqueBitsCnt[numUniqueBits] = 1;

uniqueBits[numUniqueBits] = blocks[i].colorBits;

// Sort the uniqueBits table by uniqueBitsCnt with the help of uniqueBitsSortedByCnt

int uniqueBitsSortedByCnt[numBlocks] = {0,1,2, .. numBlocks-1};

sort(uniqueBitsSortedByCnt, sortBy(uniqueBitsCnt));

// Do the entropy reduction. IE, replace uncommon bits with common ones when

// the MSE is below a threshold.

auto decodeDXT5 = [](unsigned char *rgba, auto &dxt5, unsigned newBits) { ... };

auto calcMSE = [](unsigned char *rgba1, unsigned char *rgba2) { ... };

float mseLimit = 8.0; // mean squared error

for(int i = 0; i < numBlocks; ++i)

unsigned char block1[16];

decodeDXT5(block1, blocks[i], blocks[i].colorBits);

for(int j = 0; j < numUniqueBits; ++j)

int k = unqiueBitsSortedByCnt[j];

unsigned char block2[16];

decodeDXT5(block2, blocks[i], uniqueBits[k]);

float mse = calcMSE(block1, block2);

if(mse < mseLimit)

blocks[i].colorBits = uniqueBits[k];

break;

// the MSE is below a threshold.

// Note: simple for explanatory purposes. Can be optimized further.

// Helper functions.

// decodeDXT5 decodes a DXT5 block into RGBA, but overrides the color selection

// bits of the DXT5 block with newBits.

auto decodeDXT5 = [](unsigned char *rgba, auto &dxt5, unsigned newBits) { ... };

// calcColorMSE calculates the mean squared error of the color channels.

auto calcColorMSE = [](unsigned char rgba1[16], unsigned char rgba2[16]) { ... };

// Do a normal greedy search after looking for a best case in the first N entries

const int greedyAfter = 16;

// How much MSE is acceptable?

const float mseLimit = 8.0;

// For each DXT5 block

for(int i = 0; i < numBlocks; ++i)

unsigned char block1[16];

decodeDXT5(block1, blocks[i], blocks[i].colorBits);

// Find the best match in the first N best selector bits. If less than MSE limit, trivial

// accept. Otherwise go one by one greedily and choose the first one that is under

// the limit.

unsigned bestBits = blocks[i].colorBits;

float bestMSE = mseLimit;

int j = 0;

int jj = std::min(greedyAfter, numUniqueBits);

for( ; bestMSE >= mseLimit; ++jj)

for(; j < jj; ++j)

int k = unqiueBitsSortedByCnt[j];

unsigned char block2[16];

decodeDXT5(block2, blocks[i], uniqueBits[k]);

float mse = calcMSE(block1, block2);

if(mse < mseLimit)

bestMSE = mse;

bestBits = uniqueBits[k];

break;

blocks[i].colorBits = bestBits;

If you look through the worst case textures you will find a few textures with blocking artifacts. Especially on gradients you may see visual blocking artifacts such as these:

Example psuedo code for the binary search.

float mseTarget = 0.5;

// quality level to be passed in to above algorithm

int greedyAfter = 0;

// 9 iterations for 9 bits of search space

for(int bit = 0; bit < 9; ++bit)

// This piece here defines the binary search, when the search space is a power of

// 2, this boils down to setting a bit, testing, then setting a lower bit, testing, etc

int test = greedyAfter | (0x100>>bit);

// run the algorithm from above (test + 1 cause zero is erroneous)

reduceEntropy(test + 1);

// compute the mean squared error of the color channels

float mse = computeColorMSE();

// If the mse is greater than the target, then search in higher numbers

// else search in lower numbers.

if(mse > mseTarget)

greedyAfter = test;

## Results

The implementation of the above algorithm for alpha line/selection bits is very similar. So similar that in my implementation I coded a single function to work on both alpha or color based on a function argument. As for other knobs, I target an MSE of 1.0 for the alpha bits, and 0.5 for color. The per-block MSE limit for alpha was left to 8.0 for both alpha and color.

## Comparison to Crunch

There are two primary ways to configure crunch. One is to target a specific bitrate, the other is to set a quality parameter and try to achieve a specific bitrate on average over the whole dataset. Also when I LZMA compress the resulting textures, I do it like our baseline, except I use the parameters that Crunch uses internally for their LZMA compression in their rate/distortion calculations. This is different than crunch calculates bpp internally, which it calculates bpp for each texture individually, and not all 3 in a set. Compressing the 3 as a set turns out to be a way more efficient way to encode the data (as we learned from Part 2).

Results when specifying a target bitrate was less than stellar. With a target bitrate of 1.51bpp, Crunch achieved a MSE of 6.92. Thats pretty darn horrible (and it showed very visibly in the textures too). I compressed them in a set, like the baseline instead of individually like crunch, and is was much more efficient at about 1.1bpp. Thats a 37% difference in bpp!

Results when specifying a quality factor (124) while optimizing that quality factor to achieve an average bitrate of 1.52bpp gives a MSE of 2.77. 4.3 times worse than our algorithm, but overall not horrible.

While it's a more than fair comparison, I can help crunch to be even more competitive by applying what we learned from Part 2 and Part 3 of this blog series. After application of those ideas, and the looong task of recompressing 2.6gb of texture data to 1.48bpp. That gives some room to improve MSE, but nowhere near enough to be competitive.

**Encoding speed...**

Using crunch while targeting a specific bpp took just over 24 hours to finish. Even when using crunch to target a single quality level it takes 4+ hours or so to complete. Despite its slowness, if computing is spread across multiple computers, I think it's doable, but the slow speed of crunch in general is not something to be used on a whim for your entire asset pipeline unless your pipeline is set up to be highly distributed.

In contrast, when targeting a specific quality level my algorithm takes 15 minutes to process the 2.6gb set - and it's not parallelized.

**Lines of Code**

Not even a fair comparison :) but I'm happy that I was able to do as well as I did in about 100 lines of code.

Overall I wish I had more time to experiment with crunch, but it was so slow it was prohibitive. In general, I don't feel crunch is up to the quality standards that I need for Firefall.

**Update:**Rich Geldreich clarified that Crunch was never designed to function from DXT encoded data, but only source RGB data.

## Future Work

- The code above should in some cases be more commented. I hoped that simplicity would help explain the algorithm, but I'm not sure that always works up there in the code. If anything is unclear, please comment below and I will clarify.

- Obviously the code above is not optimized. When implementing your own version you would need to spend some time with that.

- A better search algorithm other than binary search should be implemented. The search space may indeed be linear or be made to be linear, which would mean finding optimal parameters could be many times faster.

- Another tweakable parameter is the mseLimit in the inner loop. I set it to 8.0, but optimizing it for bitrate & quality is not exhaustively tested.

- Optimizing selection bits while also choosing different line end points would likely improve the MSE for bpp. It is however a very large search space, so more thinking would need to be done to figure out how to do it efficiently.

- Instead of only choosing selection bits from the pool of existing ones, you could instead fabricate new ones which can overall produce a better MSE for bpp - theoretically.

- The choice of selection bits could also pay more attention the internals of LZMA, which could in theory improve the MSE for bpp.

- Mean squared error is not always the best predictor of image quality. This is especially so at higher MSEs. MSE is also subject to biases due to large flat areas in textures. A different metric may also be able to give a quality for bpp boost.

- Improving the uncommon gradient block artifact may be better fixed by means other than just lowering MSE targets. I attempted error diffusion and other things, but it didn't have a significant effect. Low MSE was the most reliable predictor of quality that I've found.

- When reducing entropy, comparing MSE vs the original RGBA data sets would likely yield improved overall quality vs measuring from the post-encoded DXT5 data. In effect, combining the encoding of the DXT5 blocks with the entropy reduction in mind.

- There are two parts to LZMA. The LZ part and the M part. Thus far this algorithm has focused on the LZ part as that has proven to be fruitful. However, you could also in addition focus on the distribution of the literal coder. Initial experiments were OK with this, and I feel that might also be a good avenue to pursue further.

- As Bryan Mcnett suggested in the comments to my last post, It may be possible to adjust some of the parameters per mip-level and still achieve a visually pleasing result. I tried this early on with decent success, but later removed it due to some (very small) mip-mapping transition artifacts when viewing the textures in game.

In closing, I've presented an algorithm with good quality and also low MSE which is over 4 times better than state of the art (crunch). I think that its pretty good, but can be made to be better with some of these suggestions. Overall, the artists are also happy with the quality thus far. Firefall will use this technique in the coming patches and people will see a pretty dramatic improvement in streaming performance. Win-win.