DXT compression - part 3 - Transposes

7/16/2013

Welcome to part 3 of the DXT compression series. In this series I go over the techniques and results used to compress Firefall's texture data as they are uncovered and implemented.

In this post I go over some simple data transpose options with some rather non-intuitive results.

Previous posts on this topic:
Part 1 - Intro
Part 2 - The Basics
Part 4 - Entropy

In the last post we determined the baseline of straight up LZMA compression on DXT5 data, which is 2.28bpp on average for my test data set from Orbital Comm Tower in Firefall.

Transposing DXT5

What is a transpose of the DXT5 blocks?
The conversion from an array of structures (AOS) to a structure of arrays (SOA).

// array of structures (AOS)
struct {
unsigned char alphaLineStart;
unsigned char alphaLineEnd;
unsigned char alphaSelectionBits[6];
unsigned short colorLineStart;
unsigned short colorLineEnd;
unsigned int colorSelectionBits;
} dxt5Blocks[1024];

// structure of arrays (SOA)
struct {
unsigned char alphaLineStart[1024];
unsigned char alphaLineEnd[1024];
unsigned char alphaSelectionBits[1024*6];
unsigned short colorLineStart[1024];
unsigned short colorLineEnd[1024];
unsigned int colorSelectionBits[1024];
} dxt5Blocks;

Conversion to SOA should be better for compression, because similar elements are adjacent to each other. That is rather than the default of interleaved with elements which are related, but have different statistical properties.

We have 3 textures of DXT5 blocks with 6 components for each block.

color selection bits per pixel (C[0-2] bits)
alpha selection bits per pixel (A[0-2] bits)
a color line start (C[0-2] start)
a color line end (C[0-2] end)
an alpha line start (A[0-2] start)
an alpha line end (A[0-2] end)

6 components times 3 textures = 18 total components.

I attempted compressing the following permutations of these components:

2.28bpp - Compress all arrays at once. IE, LZ(C0 bits, C1 bits, C2 bits, A0 bits, A1 bits, A2 bits, C0 start, C1 start, C2 start, C0 end, C1 end, C2 end, A0 start, A1 start, A2 start, A0 end, A1 end, A2 end).
2.38bpp - Compress 18 different arrays individually. IE, LZ(C0 bits), LZ(C1 bits), LZ(C2 bits), LZ(A0 bits), LZ(A1 bits), LZ(A2 bits), LZ(C0 start), LZ(C1 start), LZ(C2 start), LZ(C0 end), LZ(C1 end), LZ(C2 end), LZ(A0 start), LZ(A1 start), LZ(A2 start), LZ(A0 end), LZ(A1 end), LZ(A2 end)
2.32bpp - Compress 6 different arrays, one for each DXT5 component where the component data of all 3 textures is concatenated end to end. IE, LZ(C[0-2] bits), LZ(A[0-2] bits), LZ(C[0-2] start), LZ(C[0-2] end), LZ(A[0-2] start), LZ(A[0-2] end)

An alternate variation is grouping the color start and end points together and treating them as a pair. This gives 4 different components to a DXT5 block:

color selection bits per pixel (C[0-2] bits)
alpha selection bits per pixel (A[0-2] bits)
a color line start & end (C[0-2] line)
an alpha line start & end (A[0-2] line)

4 components times 3 textures = 12 total components.

I tested the following permutations with this variation of logical DXT elements:

2.27bpp - Compress all arrays at once. IE, LZ(C bits, A bits, C line, A line)
2.32bpp - Compress 12 different arrays individually. IE, LZ(C0 bits), LZ(C1 bits), LZ(C2 bits), LZ(A0 bits), LZ(A1 bits), LZ(A2 bits), LZ(C0 line), LZ(C1 line), LZ(C2 line), LZ(A0 line), LZ(A1 line), LZ(A2 line)
2.27bpp - Compress 4 different arrays, one for each component where the data of all 3 textures is concatenated end to end. IE, LZ(C bits), LZ(A bits), LZ(C line), LZ(A line)

There is also another variation that is better @ 2.26bpp and also is faster perf wise. You can combine the color line and alpha line into one data set (non-interleaved).

color selection bits per pixel (C[0-2] bits)
alpha selection bits per pixel (A[0-2] bits)
a color/alpha line start & end (C[0-2] lines, A[0-2] lines)

I tried a few other variations but overall, 2.26bpp was the best.

The question still remains though, which method described above is most complementary to the baseline? As in, which does the best job at compressing stuff that the baseline does not?

The conclusion...

The best complementary transpose @ 2.26bpp is LZMA compression with knobs

LZ(C[0-2] bits), default knobs
LZ(A[0-2] bits), default knobs
LZ(C[0-2] line, A[0-2] line), lc = 0, lp = 1, pb = 1

The down side is that on average transposing does not make compression significantly better when compared with optimally tuned straight up LZMA.

The up side, if you switch between baseline and the best complementary transpose conditionally, you can get the size down a small bit further to 2.25bpp. A small, but expected improvement of about 0.5%.

If however you do not have the option to optimally tune LZMA knobs (transferring to a web-browser over Javascript via gzip), trying out the best complementary layout may give better results than the more typical transpose all elements in some cases. As always, be sure to measure and test your own data sets.

8 Comments

Bryan McNett

7/16/2013 08:36:16 am

Nice rundown of LZW's ability to compress DXT textures. Did you try isolating the individual color channels, too? I know you said that you didn't want to go lossy, but here's an idea nobody seems to have explored yet:

For the largest mip only, store block endpoints as a single 8 bit index, rather than 2x16 bit colors. This index indicates one of 256 variations on the 2x16 bit colors in the corresponding block in the next-largest mip.

The largest mip is now 2.5bpp lossy, and the other mips are 4bpp lossless. Since the largest mip appears usually only in extreme closeup or just as you walk past a hunk of environment, the loss may be acceptable.

Jon Olick

7/16/2013 08:48:36 am

> Did you try isolating the individual color channels, too?
Nope, but I'll give that a shot real quick.

> palletize the endpoints
Good idea. This method is used by crunch as well I think.

> Only do lossy on the highest mip?
Also a good idea. I'm thinking lossy is the next thing to do as well, but only doing it on the highest mip is a good trade-off.

Bryan McNett

7/16/2013 10:42:59 am

>>palletize the endpoints
>Good idea. This method is used by crunch as well I think.

Read it again - I didn't mention palletizing endpoints.

Jon Olick

7/16/2013 11:24:03 am

Ah, nm. yeah. I don't believe anybody has tried that yet. I don't have easy access to the higher mip's data, so maybe not applicable to Firefall without additional RAM usage to keep that information around. Sounds like a really good area to investigate if you do have that though.

Jon Olick

7/16/2013 11:30:22 am

According to https://code.google.com/p/crunch/wiki/TechnicalDetails

crunch shares their selector / endpoint palettes between all mips. Not the same thing, but interesting to note.

7/17/2013 01:51:58 am

Jon - LZMA creates sort of a weird bias wrst transposes. Most other LZ compressors see a big win from transposes.

LZMA does not need the transposes because it has a large window. (small window LZ like ZLIB sees a big win from transposes)

LZMA also doesn't get as much win because it has position-dependent statistics; that is, the literal coder uses some bits of position to know that different byte positions have different average statistics.

A custom LZMA variant that works on a non-transposed stream and knows which literal statistics to use in each position should win.

Jon Olick

7/17/2013 02:37:22 am

That makes a lot of sense :) I was pretty surprised at exactly why LZMA performed so well.

7/18/2013 10:26:07 pm

LZ on whole blocks allows it to operate like a context coder within that block, there is a strong contextual relationship between start- and end-points fe. If you divide the blocks into two+ streams you loose the intra-block context and replace it by inter-block context. By the very nature of LZ it just can't cope with the latter that well on regular noisy DXT.

DXT compression - part 3 - Transposes

Transposing DXT5

Leave a Reply.

Archives

Categories