Oodle Texture is a new technology we've developed at RAD Game Tools which promises to dramatically shrink game sizes, reducing what you need to download and store on disk, and speeding up load times even more. Oodle Texture is specialized for what are called "block compressed textures". These are a form of compressed image data that is used by GPUs to provide the rendering attributes for surfaces in games. Oodle Texture works on BC1-BC7 textures, sometimes called "BCN textures". The BC1-7 are seven slightly different GPU formats for different bit depths and content types, and most games use a mix of different BCN formats for their textures. There are normal non-RDO encoders which are very good maximum quality encoders, along with a RDO (Rate-Distortion-Optimization) which can allow your textures to compress further with an additional compressor such as Oodle Kraken or Zip while still maintaining extremely high quality. In this post, I want to primarily cover the BC6H quality of our non-RDO maximum quality encoders compared to a commonly used alternative. First though, what is BC6H? BC6H is a 3-channel (RGB) half-float texture format. It turns out that BC6H is the same size as BC7, even though BC7 compresses only 8-bit data while BC6H compresses 16-bit floating point data. The magic that makes this possible is in the details of how it encodes the texture. There are two formats to BC6H, a signed format and an unsigned format. This does matter as the "half-floats" are encoded differently for each. In the unsigned format, the half-float has a precision of 5 bits in the exponent and 11 bits in the mantissa where as the signed format as 1 bit specifying positive or negative, 5 bits of exponent and only 10 bits of mantissa. Thus if your data is always >= 0, you should probably use the unsigned format as you will get better quality out of it. In the typical use cases of BC6H that I am aware of, the data is typically >= 0. Like all other BCn formats, each texture is broken up into 4x4 blocks and each block for BC6H is encoded in such a way where there are multiple possible encoding modes per block. The encoding modes, of which there are 14 different possible encoding modes, primarily specify the dynamic range (that is the minimum and maximum value of all pixels in a block) and the precision of the block in different possible ways. While some of the modes can cover the entire possible range of a 16-bit half float (at reduced quantized encoding precision), most of them are delta encodings, where you have a base color in the dynamic range and the rest of the colors are offsets from that base color. The colors themselves which are used specify non-linear lines through the color space for each channel. Its non-linear because its specifying them in the integer values of a half float and these integer values are interpolated directly. IE when you interpolate the integer value of a half-float, you get a non-linear distribution of colors along that line. (I hope that's clear... it is kind of confusing). It gets even more complicated and for more information on the specifics see https://docs.microsoft.com/en-us/windows/win32/direct3d11/bc6h-format Sufficed to say, encoding these things optimally is highly non-trivial. The search space is enormous, and even the choice of how you measure what is good or not is also fairly ill-defined for HDR textures. The reason this is is because if you just use straight up Squared-Error it will cause errors in bright spots to over-whelm any of the surrounding data prioritizing getting those just right, while your visual system in your eye is essentially logarithmic in intensity response -- meaning the brighter the values the less you see the small differences -- thus Squared-Error really messes up the colors on the edges of bright objects as it thinks those bright errors are just as important as the darker errors (which is not the case). Your choice in measuring error in BC6H is thus very important. We spent a lot of time nailing that down, and it really shows in the quality of results. This is my favorite example showing off the quality of Oodle Texture. Additionally, you can do what is called Rate-Distortion Optimization (RDO) which will make smarter encoding choices for a very large gain in compressibility of the data. More on that in a future post. Charles has a really nice write-up of our RDO encoders here: https://cbloomrants.blogspot.com/2020/06/oodle-texture-slashes-game-sizes.html (Seriously, go read that then come back) The original maximum quality DDS texture there can only be compressed by 2%! Here's the compression ratio table made from various lambda RDO values...
While those look identical, I assure you there are very subtle differences - but those mostly imperceptible differences make all the difference between no compression and 1.71:1 compression.
You can read more about Oodle Texture at the RAD Game Tools web site, along with the rest of the Oodle family of data compression solutions.
2 Comments
I spent some time recently determining the effect of Oodle on UE4 Load Time for various theoretical disk speeds. The Oodle compressors can speed up load time in two different ways : one because they decompress faster, taking less CPU time for decompression, second by making the data smaller, which saves IO time. When disk speeds are slow, smaller files that save IO time are the primary benefit. When disk speeds are very fast, using less CPU time for decompression is the main factor. First, I patched the UE4 source to limit the cores to less than the system to something more reasonable. I then patched the source to artificially limit the disk IO speed to something specific. The data itself was loaded from a PCIE-4 SSD - so very very fast and needed to be artificially limited to reflect the typical performance of say a blu-ray or PS4/XB1 HDD. Of note, I did not emulate seek time, so the seek time is assumed to be basically instant - so YMMV. Also, real world load times will also be affected by things like disk cache, so we get more useful measurements by simulating disk speed. Loading in UE4 is the sum of the time taken to load from disk and the time to decompress that data plus overhead time for level loading that's not directly in IO or decompression. Though depending on how many cores are available this loading from disk and decompressing the data itself can sometimes be done in parallel - for the purposes of these tests that was minimized through core affinity settings and mutexes. What are we comparing? ZLib and Oodle. If you enable compression for pak files in Unreal, software zlib is used by default. Oodle provides a plugin that drops in and changes the pak file compression. Mostly we care about Oodle's Kraken encoder as it has very desirable perf for compression ratio, but I included the others (Selkie, Mermaid, Leviathan, Hydra) as well in my testing. The time we are measuring here is three things. 1) We want to know time to first frame. 2) We want to know how much time total was spent decoding. 3) We want to know how much time total was spent loading from disk. #1 is the most important overall score, but #2 and #3 inform us about how much we can gain from the different options of Oodle Compressors and which one we should use specifically. How fast is the PS4/XB1 HDD? About 65-80 MB/s typical. How fast is a Blue-ray? About 10-20 MB/s (though seek times are horrendous) How did I measure time to first frame? With RAD Telemetry of course! :) (Seriously invaluable tool if you aren't familiar) How much data are we loading to get to first frame? ZLib: ~105 MB Kraken: ~86 MB Kraken has less data to load because of higher compression ratio. First up, just Zlib and Oodle time to first frame...
The time it takes to do just the decompression part (not counting disk speed - just decompression time) is also pretty interesting.
ZLib: 3.88 seconds Kraken: 1.39 seconds The other Oodle formats here are as follows with regards to decompression time... Selkie: 0.24 seconds Mermaid: 0.64 seconds Leviathan: 1.82 seconds Hydra: 1 second ^^ you heard that right. Even Leviathan, Oodle's LZMA like compression ratio is over twice as fast as Zlib here... In isolation Leviathan can decode 3X faster than Zlib, here we're timing not in an ideal benchmark, but in the actual usage in Unreal, where sometimes the buffers compressed are small and the overhead means we don't reach the full speeds Oodle is capable of. The disk io time is (when measured) basically equivalent to the time to first frame - the decompression time + a second or two depending on how many cores you have working. In conclusion, Oodle does make a meaningful impact on load times - This is extremely so for lower end devices which have fewer cores and also on systems with HDDs which are typical for PC & Current Gen console games. Presumably the Nintendo Switch will also benefit greatly from Oodle as well since the game data is loaded on a sdcard and those come in various speeds (sometimes really really slow). For more information on Oodle visit http://www.radgametools.com/oodle.htm Trying out a new series of blog posts where I talk about different things that I found on github that I think others might also find interesting/useful. This being the first post in the series. This may also be the last post in the series... who knows!
First up, ray-tracers built in a bunch of different languages - profiled and compared. https://github.com/athas/raytracers I found this one somewhat interesting cause it had language choices which I was not familiar with. Though it missed some other obvious (to me) language choices of course - like straight up C and/or C++. Note that I am in no way saying this guy who wrote this code does so in every language well for speed or clarity - so don't consider this an endorsement or anything ;). Perhaps not surprisingly, of the languages they chose to implement Rust came out on top. Rust being a systems language made for performance similar to C, this was kind of expected. Still, its interesting to check out Haskell and a few other uncommon language choices in there. I have a weird fondness for Ocaml, in that I like to look at it from afar, but have never actually used it in a real project yet (I doubt I will) and I thought it was an odd choice to put in this comparison - but maybe not! The implementation in Ocaml looks rather simple, but it usually kinda does which is why I like the language. Second, there is a database here of Covid-19 chest xrays. https://github.com/ieee8023/covid-chestxray-dataset This could perhaps be used by some DNNs to train for detecting the disease - so perhaps useful for anybody who is interested in using machine learning to help with this disease. Third, if you spend a lot of time in linux - this breakdown of the command line has some pretty neat things in here - some of which I knew and some of which I did not. https://github.com/jlevy/the-art-of-command-line Fourth, a paper repository. If you are looking for something specific, or just want to learn something new, this might be a good place to start! https://github.com/papers-we-love/papers-we-love Fifth, power toys from Microsoft. In this repo is a bunch of handy utilities to make your development life just a little bit easier. From right-click image resizing, to batch renaming, to new file types supported in the explorer preview pane and more... https://github.com/microsoft/PowerToys That's all for now! Stay safe and Enjoy! Its been a long time coming, but I finally got around to implementing sub-sampling U,V in the JPEG writer. This means many files are 20-30% smaller than before with very little visual quality loss. Sub-sampled UV is enabled automatically for quality levels <= 90. The new code functions exactly like it did before with the same API as before. Drop it in and enjoy! ![]()
Oodle Lossless Image (OLI) version 1.4.7 was just released. In this release there is lots of improvements - specifically with palettized and 1/2 component images. Also in 1.4.7 is a basic Unity engine integration! OLI now supports palettized images -- up to 2048 unique colors (though it could go as high as 64k, but I didn't see a benefit in my test set to go higher than 2048). Implementing this was pretty interesting in that the order of those colors in the palette matter quite a bit - and the reason is that if you get it right, then it works with the prediction filters. As in, if the palettized color indexes are linearly predictable then there is a good chance you will get significantly better compression than just a random ordering. In practice this means trying a bunch of different heuristics (since computing optimal brute force like is prohibitively expensive). So you sort by luma, or by different channels, or by distance from the last color for example (picking the most common color as the first one). I also implemented mZeng palette ordering technique which isn't commonly in PNG compressors. Believe it or not, while this theoretically should produce really good results in most cases, sometimes the simpler heuristics win by a lot so you can't just always use a single method to decide when going for minimum file sizes. Examples (some images I've seen used as examples on other sites): In all cases, the following arguments were used
pngcrush -brute <input> <output> cwebp -q 100 -lossless -exact -m 6 -mt <input> -o <output> flif -E100 -K <input> <output> Note that insane sometimes doing slightly worse than super-duper happens sometimes due to layered processes - just on average insane is going to be better. 1/2 Component images were just a matter of writing all the various SIMD routines to decode them. Other than that nothing special here except having fewer components means smaller files and faster decoding. I may in the future support more than 4 components if there is a demand for that, but for now its 1,2,3 or 4 components of 8 or 16-bits per component. There also were some general small encoding improvements. And soon coming up are some new color spaces which should further reduce file size. Specifically the new encoding flag called "--insane" which actually compresses stuff instead of using heuristics in most places to find whats the best thing to do. I use this for dev, but it might be useful for people looking to squeeze out a few more percent in file sizes. For more information on Oodle Lossless Image visit http://www.radgametools.com/oodlelimage.htm Note: This is a work-in-progress and still being tested for possible distribution issues. I will update this blog post as the work progresses. Trying to simplify my life a bit over here, I am on a journey to eliminate my Mac from the build iteration cycle. The goal is to completely ship all binaries for both Bink and Oodle Lossless Image (OLI) directly from my PC rather than occasionally building on a mac only to find that Apple broke yet another thing in the latest OSX update or iSDK release (seriously, stop that!). First thing first, your gonna need a toolchain. I used the toolchain from http://www.pmbaty.com/iosbuildenv/ which is claimed to be a native port of the apple tools opensource.apple.com/tarballs/. I also used MSys (via http://mingw.org/) over here so I could have my same build scripts that work on OSX work nearly transparently on Win as well with very little modification. To build for OSX, iOS, tvOS and watchOS you are going to need some sysroots from a real mac. You can find these and some frameworks you are going to need in each SDK release at the following paths
Next use clang to build for Apple by specifying some additional parameters. The first of which is your target specification.
Second, specify your framework directory. This is located in your {SDK}/System/Library/Frameworks directory, so would be specified as "-F{SDK}/System/Library/Frameworks" Third, you need to specify your sysroot as "--sysroot {SDK}". The sysroot tells the compiler where your headers and libs are. That's about it for building stuff (I think?). Just use as normal. To make a DMG file you need to do things a bit differently since there is no hdiutil on windows as it is closed-source apple tech. Instead of hdiutil, you use mkisofs (you can get that with mingw, or provided also right here... ![]()
invocation would look something like
mkisofs -J -R -o {file}.dmg -mac-name -V "{title}" -apple -v -dir-mode 777 -file-mode 777 {dmg_directory} As for signing executables, I haven't yet had to worry about that... hoping I won't! I would point you to the pmbaty ios tools which has an executable signer in there. If I missed anything, or something is not clear or not working for you, please let me know in the comments below and I'll help if I can! A quick post about the results for my first comparison here of a 2-layer fully connected network vs a DagNN. I've removed most of the random variables here for this example so that the comparison is pretty accurate. The only random variable left is the order in which things are trained due to SGD - however, as I removed more and more random variables the differences got more in favor of DagNN and not less. The conclusion of this test is that DagNN is better node-for-node per epoch than the standard 2-layer fully connected network - at least in this example. This at least follows intuition a bit, that more weights between the same number of nodes increases overall computational power of the network.
More rigorous comparisons in some of the standard test cases needs to be done, but this is a good first step offering some preliminary credibility. I had an idea the other day while reading a paper about how they passed residuals around layers to keep the gradient going for really deep networks - to help alleviate the vanishing gradient issue. Then it occurred to me, perhaps that this splitting of networks into layers is not the best way to go about it. After all, the brain isn't organized into strict layers of convolution, pooling, etc... So perhaps this is us humans trying to force structure onto an unstructured task. Thus the DagNN was born over last weekend. Directed Acyclic Graph Neural Networks or DagNN for short.
First, a quick description of why/how many Deep Neural Networks are trained today as I understand it. The vanishing gradient problem is a problem to neural networks that arise because of how back-propagation works. You take the difference between the output of a network and the desired output of a network and then take the derivative of that node and pass that back through the network weighted by the connections. Then repeat for those connections on the next layer up. So you are passing a derivative of a derivative for 1 hidden layer networks and a derivative of a derivative of a derivative for 2 layer networks and so on. These numbers get "vanishingly" small very quickly - so much so that typically you tend to get *worse* results with a network with 3 or more layers vs just 1 or 2. So, how do you train "deep" networks with many layers? Typically with unsupervised pre-training, typically with an auto-encoder. An auto-encoder is when you train a network, 1 layer at a time stacking on top of each other with no specific training goal other than to reproduce the input. Each time you add a layer you lock the weights of the prior layer. This means your training a generic many layer network to just "understand" images in general as a combination of layered patterns rather than to solve any particular task. Which is better than nothing, but certainly not as good as if you could actually train the *entire* network to solve a specific task (intuitively). The solution: If you could somehow pass the gradient further down into the network, then you can train it "deeper" to solve specific tasks. Back to DagNNs. The basic premise follows the idea that if you pass the gradient further down the network, then you can train deeper networks to solve specific tasks. Win! But how? Simple, remove the whole concept of layers and just connect every node with every prior node allowing any computation to build on any other prior computation to solve the output. This means that the gradient filters through the entire network from the output in fewer hops. The way I like to think about DagNNs is the small world phenomenon. Or the degrees of kevin bacon if you prefer. You want your network to be able to get to useful information in 2-3 hops or the gradient tends to vanish. Pro tip: if you want to bound computational complexity, limit it to a random N number of prior connections per node. I'm trying out this idea now and at least initially it is showing promise. I can now train far bigger fully connected networks than I could before. Will release source when I have more proof in the pudding. By proof that means proof for me too! I need to train it on MNist and compare results. Just a quick post about the new 1.02 release of jo_mpeg.cpp In this update the color space was fixed to be more accurate. (Thanks for r- lyeh for reporting this bug!) Also, fixing the above uncovered a different issue in the AC encoding code, now fixed as well. END OF LINE ![]()
Neural Networks offer great promise with their ability to "create" algorithms to solve problems - without the programmer knowing how to solve the problem in the first place. Example based problem solving if you will. I would expect that if you knew precisely how to solve a particular problem to the same degree, you could certainly do it perhaps many orders of magnitude faster and possibly higher quality by coding the solution directly -- however its not always easy or practical to know such a solution.
One opportunity with NNs that I find most interesting is that, no matter how slow they are, you can use NNs as a kind of existence proof -- does an algorithm exist to solve this problem at all? Of course, when I'm talking about "problems" I'm referring to input to output mappings via some generic algorithm or something. Not everything cleanly fits into this definition, but many things do. Of course there are many different kind of NNs for solving various different kinds of problems too. After working with NNs for a while (and indeed to anybody who has) I can say that Neural Networks are asymmetric in complexity. That is, training a neural network to accomplish a task can take extreme amounts of time (days is common). However, executing a previously trained Neural Network is embarrassingly parallel and is pretty well mapped to GPUs. Running a NN can be done in real-time if the network is simple enough! I have spent considerable amounts of time in figuring out how to train Neural Networks faster. The generally recommended practice these days is to use Stochastic Gradient Descent (SGD) with Back Propagation (BP). What this means is you take a random piece of data out of your training set to train with, train with it, and then repeat. SGD works, but is *incredibly* slow at converging. I endeavored to improve the training performance here (how could you not, you spend a *lot* of time waiting...) There are many different techniques to improve upon BP (Adam, etc.. etc..), however each of them are in my measurements slower, regardless of the steeper descent they provide, they take more computation to provide that steeper descent and so when you measure not by epoch but by wall clock time, its actually slower. So, then came the theory that if you somehow knew the precise order to train the samples in, you could perfectly train to the correct solution in some minimum amount of time. I don't know if there is a theorem about this or what-not, but if not you now have heard of it. It seemed common sense to me. In any case, then the question becomes is there a heuristic which can approximate this theoretical "perfect" ordering? The first thing I tried turned out to be very hard to beat, calculate error on all samples, then sort the training order by the error for each in decreasing order. Then, only train 25% of the worst error samples. The speedup from this approach was pretty awesome, but again I got bored waiting so I went further. Essentially you don't waste time training the easy stuff and instead concentrate on learning the parts it has problems with. I then tried doing many variations on this, but the one that ended up working even better (30% improved training time) was taking the sorted order and splitting it into 3 sections of easy, medium and hard. Then reorganizing the training order into hard, medium, easy, hard, medium, easy, hard, medium, easy, etc... Not only did this improve the training time - it also was able to train to an overall lower error than without. Another option that works pretty well is to just take the 25% highest error samples and randomize the order. Its easier to implement and also works really well. Also, this should be overall a better approach (vs unrandomized) as it seems more robust to training situations where the error explodes (which does happen in some cases). Thats generally how I would approach a finite and small-ish data set. I am also developing a technique based on this that works for significantly larger data sets - ones that cannot possibly fit in memory (hundreds or thousands of images). Thus far the setup is fairly similar, except you pick some small batch of images and do basically the same as above with that. There are some interesting relationships between batch size (number of images) and training time/quality. In my data set, the size of the batch reduces the variance of the solution error across the training set and also appears thus far to reduce the number of epochs required to converge - however it is also slower so the jury is still out on if a bigger batch is better - but certainly going too small makes it harder to converge on a general solution. Thats all for now! |
Archives
January 2025
Categories |