Jon Olick
  • Home
  • Presentations
  • Publications
  • Patents
  • Videos
  • Code
  • Games
  • Art
  • Blogspot
  • Twitter
  • WikiCoder
  • Contact
  • Home
  • Presentations
  • Publications
  • Patents
  • Videos
  • Code
  • Games
  • Art
  • Blogspot
  • Twitter
  • WikiCoder
  • Contact

Kodak Image set 2x super-resolution'd

12/11/2020

0 Comments

 
The kodak set is great, but I needed some higher resolution versions of them for some testing that hopefully was more representative regarding compression than a straight up bilinear interpolation - so I ran it through my super-resolution NN and got some 2x outputs. I figure this would probably be useful to others as well, so I'm posting the data set here for you! 
kodak.zip
File Size: 84953 kb
File Type: zip
Download File

0 Comments

Oodle Texture BC6H Exposé

6/18/2020

0 Comments

 
​Oodle Texture is a new technology we've developed at RAD Game Tools which promises to dramatically shrink game sizes, reducing what you need to download and store on disk, and speeding up load times even more.

Oodle Texture is specialized for what are called "block compressed textures". These are a form of compressed image data that is used by GPUs to provide the rendering attributes for surfaces in games. Oodle Texture works on BC1-BC7 textures, sometimes called "BCN textures". The BC1-7 are seven slightly different GPU formats for different bit depths and content types, and most games use a mix of different BCN formats for their textures.

There are normal non-RDO encoders which are very good maximum quality encoders, along with a RDO (Rate-Distortion-Optimization) which can allow your textures to compress further with an additional compressor such as Oodle Kraken or Zip while still maintaining extremely high quality. 

In this post, I want to primarily cover the BC6H quality of our non-RDO maximum quality encoders compared to a commonly used alternative.

First though, what is BC6H? BC6H is a 3-channel (RGB) half-float texture format. It turns out that BC6H is the same size as BC7, even though BC7 compresses only 8-bit data while BC6H compresses 16-bit floating point data. The magic that makes this possible is in the details of how it encodes the texture. 

There are two formats to BC6H, a signed format and an unsigned format. This does matter as the "half-floats" are encoded differently for each. In the unsigned format, the half-float has a precision of 5 bits in the exponent and 11 bits in the mantissa where as the signed format as 1 bit specifying positive or negative, 5 bits of exponent and only 10 bits of mantissa. Thus if your data is always >= 0, you should probably use the unsigned format as you will get better quality out of it. In the typical use cases of BC6H that I am aware of, the data is typically >= 0.

Like all other BCn formats, each texture is broken up into 4x4 blocks and each block for BC6H is encoded in such a way where there are multiple possible encoding modes per block. The encoding modes, of which there are 14 different possible encoding modes, primarily specify the dynamic range (that is the minimum and maximum value of all pixels in a block) and the precision of the block in different possible ways. 

While some of the modes can cover the entire possible range of a 16-bit half float (at reduced quantized encoding precision), most of them are delta encodings, where you have a base color in the dynamic range and the rest of the colors are offsets from that base color.

The colors themselves which are used specify non-linear lines through the color space for each channel. Its non-linear because its specifying them in the integer values of a half float and these integer values are interpolated directly. IE when you interpolate the integer value of a half-float, you get a non-linear distribution of colors along that line. (I hope that's clear... it is kind of confusing).

It gets even more complicated and for more information on the specifics see https://docs.microsoft.com/en-us/windows/win32/direct3d11/bc6h-format

Sufficed to say, encoding these things optimally is highly non-trivial. The search space is enormous, and even the choice of how you measure what is good or not is also fairly ill-defined for HDR textures. The reason this is is because if you just use straight up Squared-Error it will cause errors in bright spots to over-whelm any of the surrounding data prioritizing getting those just right, while your visual system in your eye is essentially logarithmic in intensity response -- meaning the brighter the values the less you see the small differences -- thus Squared-Error really messes up the colors on the edges of bright objects as it thinks those bright errors are just as important as the darker errors (which is not the case). Your choice in measuring error in BC6H is thus very important. We spent a lot of time nailing that down, and it really shows in the quality of results.

This is my favorite example showing off the quality of Oodle Texture. 
Picture
Source Data (Ground Truth)
Picture
Common BC6H alternative encoder with max quality setting
Picture
Oodle Texture non-RDO maximum quality encoding
Additionally, you can do what is called Rate-Distortion Optimization (RDO) which will make smarter encoding choices for a very large gain in compressibility of the data. More on that in a future post.

Charles has a really nice write-up of our RDO encoders here: https://cbloomrants.blogspot.com/2020/06/oodle-texture-slashes-game-sizes.html

(Seriously, go read that then come back)

The original maximum quality DDS texture there can only be compressed by 2%!

Here's the compression ratio table made from various lambda RDO values...
non-RDO
RDO lambda=10
RDO lambda=30
RDO lambda=40
1.02::1
1.35::1
1.64::1
1.71::1
Picture
Oodle Texture non-RDO
Picture
Oodle Texture RDO lambda=10
Picture
Oodle Texture RDO lambda=30
Picture
Oodle Texture RDO lambda=40
While those look identical, I assure you there are very subtle differences - but those mostly imperceptible differences make all the difference between no compression and 1.71:1 compression. 

You can read more about Oodle Texture at the RAD Game Tools web site, along with the rest of the Oodle family of data compression solutions.
0 Comments

Oodle and UE4 Loading Time

5/14/2020

0 Comments

 
Picture
I spent some time recently determining the effect of Oodle on UE4 Load Time for various theoretical disk speeds. 

The Oodle compressors can speed up load time in two different ways : one because they decompress faster, taking less CPU time for decompression, second by making the data smaller, which saves IO time. When disk speeds are slow, smaller files that save IO time are the primary benefit. When disk speeds are very fast, using less CPU time for decompression is the main factor.

First, I patched the UE4 source to limit the cores to less than the system to something more reasonable. I then patched the source to artificially limit the disk IO speed to something specific. The data itself was loaded from a PCIE-4 SSD - so very very fast and needed to be artificially limited to reflect the typical performance of say a blu-ray or PS4/XB1 HDD.

Of note, I did not emulate seek time, so the seek time is assumed to be basically instant - so YMMV. Also, real world load times will also be affected by things like disk cache, so we get more useful measurements by simulating disk speed.

Loading in UE4 is the sum of the time taken to load from disk and the time to decompress that data plus overhead time for level loading that's not directly in IO or decompression. Though depending on how many cores are available this loading from disk and decompressing the data itself can sometimes be done in parallel - for the purposes of these tests that was minimized through core affinity settings and mutexes. 

What are we comparing? 
ZLib and Oodle. If you enable compression for pak files in Unreal, software zlib is used by default. Oodle provides a plugin that drops in and changes the pak file compression. Mostly we care about Oodle's Kraken encoder as it has very desirable perf for compression ratio, but I included the others (Selkie, Mermaid, Leviathan, Hydra) as well in my testing.
​
The time we are measuring here is three things.
1) We want to know time to first frame.
2) We want to know how much time total was spent decoding. 
3) We want to know how much time total was spent loading from disk. 

#1 is the most important overall score, but #2 and #3 inform us about how much we can gain from the different options of Oodle Compressors and which one we should use specifically. 

How fast is the PS4/XB1 HDD?
About 65-80 MB/s typical.

How fast is a Blue-ray?
About 10-20 MB/s (though seek times are horrendous)

How did I measure time to first frame?
With RAD Telemetry of course! :) (Seriously invaluable tool if you aren't familiar)

How much data are we loading to get to first frame?
ZLib: ~105 MB
Kraken: ~86 MB

Kraken has less data to load because of higher compression ratio.

First up, just Zlib and Oodle time to first frame...
Read MB/s
Zlib Time to First Frame
Kraken Time to First Frame
Selkie Time to First Frame
Mermaid Time to First Frame
Leviathan Time to First Frame
16 MB/s (Blu-ray est)
10 seconds
7 seconds
8.5 seconds
7 seconds
7 seconds
64 MB/s (PS4/XB1 HDD est)
7 seconds
3.75 seconds
4 seconds
4 seconds
4.75 seconds
512 MB/s (fast PC)
7 seconds
3.75 seconds
3.5 seconds
3.5 seconds
4.5 seconds
The time it takes to do just the decompression part (not counting disk speed - just decompression time) is also pretty interesting. 
ZLib: 3.88 seconds
Kraken: 1.39 seconds

The other Oodle formats here are as follows with regards to decompression time...
Selkie: 0.24 seconds
Mermaid: 0.64 seconds
Leviathan: 1.82 seconds
Hydra: 1 second

^^ you heard that right. Even Leviathan, Oodle's LZMA like compression ratio is over twice as fast as Zlib here... 

In isolation Leviathan can decode 3X faster than Zlib, here we're timing not in an ideal benchmark, but in the actual usage in Unreal, where sometimes the buffers compressed are small and the overhead means we don't reach the full speeds Oodle is capable of.
​
The disk io time is (when measured) basically equivalent to the time to first frame - the decompression time + a second or two depending on how many cores you have working. 

In conclusion, Oodle does make a meaningful impact on load times - This is extremely so for lower end devices which have fewer cores and also on systems with HDDs which are typical for PC & Current Gen console games. Presumably the Nintendo Switch will also benefit greatly from Oodle as well since the game data is loaded on a sdcard and those come in various speeds (sometimes really really slow). 

For more information on Oodle visit http://www.radgametools.com/oodle.htm

Read More
0 Comments

It Came From Github #1

4/15/2020

0 Comments

 
Trying out a new series of blog posts where I talk about different things that I found on github that I think others might also find interesting/useful. This being the first post in the series. This may also be the last post in the series... who knows! 

First up, ray-tracers built in a bunch of different languages - profiled and compared. 
 https://github.com/athas/raytracers 
I found this one somewhat interesting cause it had language choices which I was not familiar with.  Though it missed some other obvious (to me) language choices of course - like straight up C and/or C++. Note that I am in no way saying this guy who wrote this code does so in every language well for speed or clarity - so don't consider this an endorsement or anything ;). Perhaps not surprisingly, of the languages they chose to implement Rust came out on top. Rust being a systems language made for performance similar to C, this was kind of expected. Still, its interesting to check out Haskell and a few other uncommon language choices in there. I have a weird fondness for Ocaml, in that I like to look at it from afar, but have never actually used it in a real project yet (I doubt I will) and I thought it was an odd choice to put in this comparison - but maybe not! The implementation in Ocaml looks rather simple, but it usually kinda does which is why I like the language. 

Second, there is a database here of Covid-19 chest xrays. 
https://github.com/ieee8023/covid-chestxray-dataset
This could perhaps be used by some DNNs to train for detecting the disease - so perhaps useful for anybody who is interested in using machine learning to help with this disease.

Third, if you spend a lot of time in linux - this breakdown of the command line has some pretty neat things in here - some of which I knew and some of which I did not.
https://github.com/jlevy/the-art-of-command-line

Fourth, a paper repository. If you are looking for something specific, or just want to learn something new, this might be a good place to start! 
https://github.com/papers-we-love/papers-we-love

Fifth, power toys from Microsoft. In this repo is a bunch of handy utilities to make your development life just a little bit easier. From right-click image resizing, to batch renaming, to new file types supported in the explorer preview pane and more... 
https://github.com/microsoft/PowerToys

That's all for now! Stay safe and Enjoy! 
0 Comments

jo_jpeg Release 1.60

11/27/2019

0 Comments

 
Its been a long time coming, but I finally got around to implementing sub-sampling U,V in the JPEG writer. This means many files are 20-30% smaller than before with very little visual quality loss. Sub-sampled UV is enabled automatically for quality levels <= 90. The new code functions exactly like it did before with the same API as before. Drop it in and enjoy!
jo_jpeg.cpp
File Size: 19 kb
File Type: cpp
Download File

0 Comments

Oodle Lossless Image v1.4.7

4/12/2019

0 Comments

 
Oodle Lossless Image (OLI) version 1.4.7 was just released. In this release there is lots of improvements - specifically with palettized and 1/2 component images. Also in 1.4.7 is a basic Unity engine integration! 

OLI now supports palettized images -- up to 2048 unique colors (though it could go as high as 64k, but I didn't see a benefit in my test set to go higher than 2048). Implementing this was pretty interesting in that the order of those colors in the palette matter quite a bit - and the reason is that if you get it right, then it works with the prediction filters. As in, if the palettized color indexes are linearly predictable then there is a good chance you will get significantly better compression than just a random ordering. In practice this means trying a bunch of different heuristics (since computing optimal brute force like is prohibitively expensive). So you sort by luma, or by different channels, or by distance from the last color for example (picking the most common color as the first one). I also implemented mZeng palette ordering technique which isn't commonly in PNG compressors. Believe it or not, while this theoretically should produce really good results in most cases, sometimes the simpler heuristics win by a lot so you can't just always use a single method to decide when going for minimum file sizes. 

Examples (some images I've seen used as examples on other sites):
Picture
Source PNG: 36447
pngcrush -brute: 39018
WebP: 25042
​FLIF: ​24179
OLI --insane: 18909
OLI --super-duper: 18813

Picture
Source PNG: 21603
pngcrush -brute: 21609
WebP: 18978
​FLIF: ​16114
OLI --insane: 14709
OLI --super-duper: 14881


In all cases, the following arguments were used
pngcrush -brute <input> <output>
cwebp -q 100 -lossless -exact -m 6 -mt <input> -o <output>
flif -E100 -K <input> <output>

Note that insane sometimes doing slightly worse than super-duper happens sometimes due to layered processes - just on average insane is going to be better.

1/2 Component images were just a matter of writing all the various SIMD routines to decode them. Other than that nothing special here except having fewer components means smaller files and faster decoding. I may in the future support more than 4 components if there is a demand for that, but for now its 1,2,3 or 4 components of 8 or 16-bits per component. 

There also were some general small encoding improvements. And soon coming up are some new color spaces which should further reduce file size. Specifically the new encoding flag called "--insane" which actually compresses stuff instead of using heuristics in most places to find whats the best thing to do. I use this for dev, but it might be useful for people looking to squeeze out a few more percent in file sizes. 

For more information on Oodle Lossless Image visit 
http://www.radgametools.com/oodlelimage.htm
​​
0 Comments

Building for OSX,iOS,tvOS & watchOS on Windows

8/1/2018

0 Comments

 
Note: This is a work-in-progress and still being tested for possible distribution issues. I will update this blog post as the work progresses. 

Trying to simplify my life a bit over here, I am on a journey to eliminate my Mac from the build iteration cycle. The goal is to completely ship all binaries for both Bink and Oodle Lossless Image (OLI) directly from my PC rather than occasionally building on a mac only to find that Apple broke yet another thing in the latest OSX update or iSDK release (seriously, stop that!). 

First thing first, your gonna need a toolchain. I used the toolchain from http://www.pmbaty.com/iosbuildenv/ which is claimed to be a native port of the apple tools opensource.apple.com/tarballs/.   

I also used MSys (via http://mingw.org/) over here so I could have my same build scripts that work on OSX work nearly transparently on Win as well with very little modification.
​
To build for OSX, iOS, tvOS and watchOS you are going to need some sysroots from a real mac. 

You can find these and some frameworks you are going to need in each SDK release at the following paths
  • /Applications/Xcode.app/Contents/Developer/Platforms/AppleTVOS.platform/Developer/SDKs/AppleTVOS{version}.sdk
  • /Applications/Xcode.app/Contents/Developer/Platforms/AppleTVSimulator.platform/Developer/SDKs/AppleTVSimulator{version}.sdk
  • /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS{version}.sdk
  • /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator{version}.sdk
  • /Applications/Xcode.app/Contents/Developer/Platforms/WatchOS.platform/Developer/SDKs/WatchOS{version}.sdk
  • /Applications/Xcode.app/Contents/Developer/Platforms/WatchSimulator.platform/Developer/SDKs/WatchSimulator{version}.sdk
  • /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX{version}.sdk

Next use clang to build for Apple by specifying some additional parameters.

The first of which is your target specification.
  • When building for iOS use "--target=arm-apple-ios10"
  • When building for tvOS use "--target=arm-apple-tvos10"
  • When building for watchOS use "--target=arm-apple-watchos10"
  • When building for iPhoneSimulator use "--target=x86_64-apple-ios10"  
  • When building for AppleTVSimulator use "--target=x86_64-apple-tvos10"  
  • When building for WatchSimulator use "--target=x86_64-apple-watchos10"  
  • When building for 32-bit OSX use "--target=x86-apple-darwin-macho"
  • When building for 64-bit OSX use "--target=x86_64-apple-darwin-macho"

Second, specify your framework directory. This is located in your {SDK}/System/Library/Frameworks directory, so would be specified as "-F{SDK}/System/Library/Frameworks"

Third, you need to specify your sysroot as "--sysroot {SDK}". The sysroot tells the compiler where your headers and libs are. 

That's about it for building stuff (I think?). Just use as normal.

To make a DMG file you need to do things a bit differently since there is no hdiutil on windows as it is closed-source apple tech. 

Instead of hdiutil, you use mkisofs (you can get that with mingw, or provided also right here...
mkisofs.exe
File Size: 344 kb
File Type: exe
Download File

invocation would look something like

mkisofs -J -R -o {file}.dmg -mac-name -V "{title}" -apple -v -dir-mode 777 -file-mode 777 {dmg_directory}

As for signing executables, I haven't yet had to worry about that... hoping I won't! I would point you to the pmbaty ios tools which has an executable signer in there. 

If I missed anything, or something is not clear or not working for you, please let me know in the comments below and I'll help if I can!
0 Comments

DagNN vs Standard 2-Layer fully connected networks

4/11/2017

0 Comments

 
A quick post about the results for my first comparison here of a 2-layer fully connected network vs a DagNN. 

I've removed most of the random variables here for this example so that the comparison is pretty accurate. The only random variable left is the order in which things are trained due to SGD - however, as I removed more and more random variables the differences got more in favor of DagNN and not less.

The conclusion of this test is that DagNN is better node-for-node per epoch than the standard 2-layer fully connected network - at least in this example.
Picture
64 * 2-layer fully connected network. Solution error at epoch 300 of 685.
Picture
128 node DagNN fully connected network. Solution error at epoch 300 of 380.
This at least follows intuition a bit, that more weights between the same number of nodes increases overall computational power of the network.

More rigorous comparisons in some of the standard test cases needs to be done, but this is a good first step offering some preliminary credibility.
0 Comments

DagNN - a "deeper" fully connected network

4/3/2017

0 Comments

 
I had an idea the other day while reading a paper about how they passed residuals around layers to keep the gradient going for really deep networks - to help alleviate the vanishing gradient issue. Then it occurred to me, perhaps that this splitting of networks into layers is not the best way to go about it. After all, the brain isn't organized into strict layers of convolution, pooling, etc... So perhaps this is us humans trying to force structure onto an unstructured task. Thus the DagNN was born over last weekend. Directed Acyclic Graph Neural Networks or DagNN for short. 

First, a quick description of why/how many Deep Neural Networks are trained today as I understand it. 

The vanishing gradient problem is a problem to neural networks that arise because of how back-propagation works. You take the difference between the output of a network and the desired output of a network and then take the derivative of that node and pass that back through the network weighted by the connections. Then repeat for those connections on the next layer up. So you are passing a derivative of a derivative for 1 hidden layer networks and a derivative of a derivative of a derivative for 2 layer networks and so on. These numbers get "vanishingly" small very quickly - so much so that typically you tend to get *worse* results with a network with 3 or more layers vs just 1 or 2. 

So, how do you train "deep" networks with many layers? Typically with unsupervised pre-training, typically with an auto-encoder. An auto-encoder is when you train a network, 1 layer at a time stacking on top of each other with no specific training goal other than to reproduce the input. Each time you add a layer you lock the weights of the prior layer. 

This means your training a generic many layer network to just "understand" images in general as a combination of layered patterns rather than to solve any particular task. Which is better than nothing, but certainly not as good as if you could actually train the *entire* network to solve a specific task (intuitively). 

The solution: If you could somehow pass the gradient further down into the network, then you can train it "deeper" to solve specific tasks. 

Back to DagNNs. 

The basic premise follows the idea that if you pass the gradient further down the network, then you can train deeper networks to solve specific tasks. Win! But how?

Simple, remove the whole concept of layers and just connect every node with every prior node allowing any computation to build on any other prior computation to solve the output. This means that the gradient filters through the entire network from the output in fewer hops. The way I like to think about DagNNs is the small world phenomenon. Or the degrees of kevin bacon if you prefer. You want your network to be able to get to useful information in 2-3 hops or the gradient tends to vanish. 

Pro tip: if you want to bound computational complexity, limit it to a random N number of prior connections per node. 

I'm trying out this idea now and at least initially it is showing promise. I can now train far bigger fully connected networks than I could before. Will release source when I have more proof in the pudding. By proof that means proof for me too! I need to train it on MNist and compare results. 


0 Comments

MPEG1/2 Encoder  release 1.02

3/22/2017

2 Comments

 
Just a quick post about the new 1.02 release of jo_mpeg.cpp

In this update the color space was fixed to be more accurate. (Thanks for r- lyeh for reporting this bug!)

Also, fixing the above uncovered a different issue in the AC encoding code, now fixed as well.

END OF LINE
jo_mpeg.cpp
File Size: 9 kb
File Type: cpp
Download File

2 Comments
<<Previous

    Archives

    December 2020
    June 2020
    May 2020
    April 2020
    November 2019
    April 2019
    August 2018
    April 2017
    March 2017
    January 2017
    November 2016
    October 2016
    September 2016
    January 2016
    March 2015
    August 2013
    July 2013
    December 2012

    Categories

    All
    Compression
    Dxt

    RSS Feed