In the ever-evolving field of digital image processing, new algorithms for image upscaling are constantly emerging. One such method, EDIZ (Error Diffusion Image Zooming), proposed by Saryazdi et al., claims to offer a simple yet effective approach to image enlargement. However, as with any new technique, it's crucial to examine its merits and limitations critically. As always, you will find an example implementation in the bottom. The Basic Premise EDIZ is based on the idea of using error information from a basic upscaling method to enhance the final result. While this concept sounds promising on paper, let's break down the algorithm and examine its potential shortcomings. The Algorithm: Too Simple to Be Effective? The EDIZ algorithm consists of the following steps:
At first glance, this process seems logical. However, it raises several questions about its effectiveness and theoretical foundation. Critical Analysis 1. Loss of Information The initial downsampling step inevitably results in a loss of information. Can we really expect to recover this lost data accurately in the subsequent steps? This fundamental issue casts doubt on the algorithm's ability to truly enhance image quality. It seems to rely on the theory that the error will be the same in subsequent upscaling passes, which is maybe somewhat relevant, but it lacks the ability to create new details. 2. Error Propagation The method relies heavily on the calculated error to improve the final image. However, this error is based on a comparison between the original image and a reconstructed version that has undergone both down and upsampling. Isn't this introducing compounded errors rather than genuine enhancements? 3. Lack of Theoretical Foundation While the algorithm's simplicity might be appealing, it lacks a solid theoretical foundation. Unlike more advanced super-resolution techniques based on machine learning or sophisticated signal processing theories, EDIZ seems to be built on heuristics rather than proven mathematical principles. 4. Limited Scope for Improvement The algorithm essentially redistributes existing information. It doesn't have any mechanism to infer or generate new, high-frequency details that weren't present in the original image. This severely limits its potential for significant quality improvements, especially at higher zoom factors. 5. Potential for Artifact Introduction By adding scaled error information to the upsampled image, isn't there a risk of introducing new artifacts? This could potentially lead to unnatural-looking results, especially in areas of the image with fine details or textures. Comparative Shortcomings When compared to state-of-the-art super-resolution methods, EDIZ falls short in several aspects:
Conclusion: A Step Backwards? While EDIZ might offer a computationally efficient approach to image upscaling, its simplistic nature and lack of theoretical grounding raise serious questions about its effectiveness. In an era where machine learning and advanced signal processing techniques are pushing the boundaries of what's possible in image enhancement, EDIZ seems like a step backwards. It's crucial for the image processing community to critically evaluate new algorithms, no matter how appealing their simplicity might be. While EDIZ might have some niche applications where computational resources are limited, it's unlikely to compete with more sophisticated methods in terms of output quality. As always in science and technology, we should remain open to new ideas, but also maintain a healthy skepticism, especially when claims of effectiveness aren't backed by robust theoretical foundations or comprehensive empirical evidence. I'll let the results speak for themselves. It definitely enhances edges and such, but it doesn't appear to make anything sharper or add any new details.
2 Comments
I've been looking at various error diffusion techniques for an internal use case and was first pointed to Structure-Aware Error Diffusion which worked pretty well, but was pretty slow. I then came across a different paper: Laplacian Based Structure-Aware Error Diffusion. This article is about that technique. Its from 2010 so not super recent, but its new to me so here we go. First up, why error diffusion? You use it when you have binary colors or palettized colors and such. In the internal use case it was for a binary mask of sorts. I won't go super into detail on that as I don't know how much I'm supposed to reveal or not about it - so I opt for revealing nothing (sorry). Anyways, Halftoning, the process of converting grayscale images to binary (black and white) images, has long been crucial for printing and display applications. The challenge has always been to preserve as much of the original image detail and structure as possible using only black and white pixels. The technique described in this paper built upon the classic Floyd-Steinberg error diffusion algorithm but introduced key modifications to better preserve image structure:
At the time, the researchers tested their algorithm against then state-of-the-art halftoning methods and found it produced superior results, especially for preserving fine details and textures. Notably, it performed about 25% better on structural similarity metrics (MSSIM) for low-contrast image regions. Despite the more complex processing, the technique maintained similar computational efficiency to basic error diffusion. It ran orders of magnitude faster than some optimization-based halftoning algorithms of that era. Visual comparisons showed the algorithm's ability to faithfully reproduce subtle textures that were lost with other methods of the time. For example, in a test image of a snail, the technique clearly preserved the shell texture that was blurred out by other algorithms.
For a great many years I've had a PCM WAV file writer which is super duper simple (<20 LoC). I recently came across a git repo where somebody made a super high quality ADPCM encoder called ADPCM-XQ (stands for Xtreme Quality ADPCM Encoder/Decoder). I've looked at that code, re-wrote it to single file STB style, and greatly improved the algorithm it uses, fixed some bugs it had in the algorithms, etc... Namely compared to the original adpcm-xq, for the same quality as adpcm-xq it is lots faster - as well as it now includes a non-greedy search which further improves ADPCM quality. This improved speed is in part due to an improved search function around the current heuristic choice. The new slow mode is similar in speed to adpcm-xq, while having ~20% improved error per sample. First some examples. Uncompressed PCM: (0 error per sample) Fastest ADPCM encode configuration: (1201 error per sample) Fast ADPCM encode configuration: (1196 error per sample, similar quality to adpcm-xq) Slow ADPCM encode configuration: (1007 error per sample) Usage is something like: jo_write_wav_adpcm_slow("adpcm_slow.wav", num_channels, 44100, samples, num_samples); You can grab the code here:
Quality here is just incredible IMO. This is using a GAN so its generating details like brush strokes and other minor details that don't actually exist in the source DVD material into what it thinks the original material looked like - and the results are stunning. I wish I could show you all the entire film, but can't of course. Instead I give you a 30s clip. Behold! I wanted to write a blog post about an unusual usage of cross products. Namely, that you can take two points and cross them together to get a line, and take two lines and cross product them to get the intersection point -- and this works in 3D too! Cross product 3 vectors and you get the plane that it represents, cross product 3 planes and get the point they intersect at!
I learned about this a great many years ago when learning about Computer Vision techniques for Epipolar Geometry and solving for the Essential Matrix. Where in epipolar geometry, a pixel in one view is a line in another view, and given a bunch of corresponding points you can calculate the Essential Matrix with the help of a cross product matrix. This blog post is not about that specifically, but bringing this idea over to computer graphics generally. First in 2D where you have points and lines. Lets say you have 2 points p0(x,y) and p1(x,y). You first make them into 3D points by adding a 1. Not going to explain why you do that right now as that would just confuse the point. Sufficed to say though, p0(x,y,1) and p1(x,y,1). Then you perform a 3D cross product of those points cross3D(p0,p1) and you get... the line equation they represent as the solution. So, the orientation of the line (x,y), and the offset from the origin (z). One thing to note though that it is not normalized. So most of the time you want to take an additional step where you take the length of (x,y) and normalize the line equation such that |(x,y)| == 1. So... (x,y,z) = (x,y,z)/|(x,y)|. Then, you can find the intersection point of two line equations by doing a 3D cross product of those two lines! So cross3D(line0,line1) = point of intersection. Note: the intersection point is unnormalized, so you have to divide by Z. And of course this also works in 3D, where you have three 3D points. p0(x,y,z), p1(x,y,z), p2(x,y,z). Next up, you guessed it, you add a 1 as the W coordinate. so those points then become p0(x,y,z,1), p1(x,y,z,1), p2(x,y,z,1). Then you cross product those three 4D points together to get the plane equation they represent! planeEquation = cross4D(p0,p1,p2). Again th0ugh, this result is unnormalized, so you probably want to normalize the length via [x,y,z,w]/|[x,y,z]| And of course if you have 3 plane equations, you can cross product them and get the point they intersect at! point = cross4D(plane0,plane1,plane2). Note: the intersection point is unnormalized, so you have to divide by W. For easy reference, the calculation for a 4D cross product is specified as thus... void cross4D(const float v1[4], const float v2[4], const float v3[4], float result[4]) { result[0] = v1[1]*(v2[2]*v3[3]-v3[2]*v2[3])-v1[2]*(v2[1]*v3[3]-v3[1]*v2[3] )+v1[3]*(v2[1]*v3[2]-v3[1]*v2[2]); result[1] = v1[0]*(v3[2]*v2[3]-v2[2]*v3[3])-v1[2]*(v3[0]*v2[3]-v2[0]*v3[3])+v1[3]*(v3[0]*v2[2]-v2[0]*v3[2]); result[2] = v1[0]*(v2[1]*v3[3]-v3[1]*v2[3])-v1[1]*(v2[0]*v3[3]-v3[0]*v2[3])+v1[3] *(v2[0]*v3[1]-v3[0]*v2[1]); result[3] = v1[0]*(v3[1]*v2[2]-v2[1]*v3[2])-v1[1]*(v3[0]*v2[2]-v2[0]*v3[2])+v1[2]*(v3[0]*v2[1]-v2[0]*v3[1]); } and of course for 3D it is... void cross3D(const float v1[4], const float v2[4], float result[4]) { result[0] = v1[1]*v2[2] - v1[2]*v2[1]; result[1] = v1[2]*v2[0] - v1[0]*v2[2]; result[2] = v1[0]*v2[1] - v1[1]*v2[0]; } Its pretty useful or at least I find it useful to know this trick. I bet you will too! For educational purposes, this is being posted as a test of the real-time SRNN from the prior blog post. Everything is pretty decent - and this version again only looks at a frame at a time and doesn't consider past or future frames for the super-resolution (thus vast room for improvement). It has issues where it magnifies/interprets DCT artifacts as meaningful - but I suppose that's to be expected. Garbage in -> Garbage out. Ideally you would use this with uncompressed data. In theory you could train a NN to remove the DCT artifacts before processing or maybe do an all-in-one - though that's probably a diminishing returns kind of scenario. The original file incase the host re-encodes it in less than 4k...
The following AI Image Upscaler is made from a single file deep neural network library (jo_nn.h) - and will have the ability to ship to many common user targets (CPU/GLES/OpenGL/Vulkan/DX11/DX12/Metal). The plan is to release jo_nn.h once its more feature complete with regards to the production targets. Currently, CPU/GLES/WebGL/OpenGL/Vulkan is in good shape.
Until the release, an example output of image super resolution... Initializing... orThis is an article going over the technique to compress index buffers presented in my 2007,2008 presentations in more detail here.
Update: Fixed some bugs in my pseudo code. Also important to point out that your choice of index buffer optimizer greatly affects the effectiveness of this technique. Hence, look for a future post where I talk about that a bit. The process is as follows: First clean up the data... 1) Optimize the index buffers for cache coherency (another article on this later eventually, there's better ways than the typical methods) 2) Remove degenerate triangles (triangles with two or more indexes which are identical) 3) Remove duplicate triangles (triangles which occur more than once in the same index buffer, even if in different orders) 4) Order the vertexes such that they appear in the same order as the index buffer itself. The next step we make note that each triangle is typically re-using many elements of the previous triangle - so don't repeat yourself! The most common patterns occur in triangle fans and triangle strips which the index buffer optimizer creates in abundance. This usually has two repeating indexes, and one new index. There are some other cases with only two new indexes and such, but they are rare enough that we just escape all other cases to 3 new indexes. Additionally, we can losslessly rotate the indexes of the triangle to reduce the set of possibilities we need to encode. So we in effect create a 2 bit "repeat table", where we have an entry per triangle which describes the most common repeating patterns - and also creating a new index buffer which has those repeats omitted. For the 2 bit table, each value would mean 0 == first index same as last triangle's first index, second index same as last triangle's third index. 1 = = first index same as last triangle's third index, second index same as last triangle's second index. 2 == first index same as last triangle's second index, second index same as last triangle's first index. 3 == escape to encode all 3 indexes as if not repeated. In code this would be something like: // need a 2-bit entry for each triangle, except the first one which is assumed to be "3" char *repeat_table = (char*)malloc(num_indexes/3-1); char *rptbl = repeat_table; // the 2-bit repeat table, in bytes. entropy coder will do the right thing later. int *unq_idxs = unique_index_buffer; int *idxs = index_buffer; int * idxs_end = index_buffer + num_indexes; // First triangle is always there. *unq_idxs++ = *idxs++; *unq_idxs++ = *idxs++; *unq_idxs++ = *idxs++; for(; idxs < idxs_end; idxs+=3) { int *prev_tri = idxs-3; bool is_idx_old[3] = {false}; int num_old_idxs = 0; for(int j = 0; j < 3; ++j) { if(idxs[j] == prev_tri[0] || idxs[j] == prev_tri[1] || idxs[j] == prev_tris[2]) { is_idx_old[j] = true; ++num_old_idxs; } } // Previous cleanups should guarantee that degenerate triangles do not happen. assert(num_old_idxs != 3); // If we have less than 2 index repeats, then just encode an escape into the repeat table. if(num_old_idxs < 2) { memcpy(unq_idxs, idxs, sizeof(*idxs)*3); unq_idxs += 3; *rptbl++ = 3; } // Rotate the triangle to match one of the encoding if possible. // We want the two index repeats in the first two indexes of this triangle. while(!is_idx_old[0] || !is_idx_old[1]) { // Rotate the triangle int tmp = idxs[0]; memmove(idxs, idxs+1, sizeof(*idxs)*2); int btmp = is_idx_old[0]; memmove(is_idx_old, is_idx_old+1, sizeof(*is_idx_old)*2); } if(idxs[0] == prev_tri[0] && idxs[1] == prev_tri[2]) { *unq_idxs++ = idxs[2]; *rptbl++ = 0; continue; } if(idxs[0] == prev_tri[2] && idxs[1] == prev_tri[1]) { *unq_idxs++ = idxs[2]; *rptbl++ = 1; continue; } if(idxs[0] == prev_tri[1] && idxs[1] == prev_tri[0]) { *unq_idxs++ = idxs[2]; *rptbl++ = 2; continue; } memcpy(unq_idxs, idxs, sizeof(*idxs)*3); unq_idxs += 3; *rptbl++ = 3; } The next step is to notice that the indexes which are left are typically sequential (0,1,2,3,4,5,6,7,8,etc) as we re-ordered the vertex buffer to specifically be that way up-front. We can encode that explicitly with a 1-bit table which represents that... 0 == not sequential 1 == sequential code for that would look something like... int num_unique_indexes = unq_idxs - unique_index_buffer; char *chaos_index_buffer = (char*)malloc(num_unique_indexes); // maximum possible size char *chaos_idxs = chaos_index_buffer; char *chaos_bits = (char*)malloc(num_unique_indexes); for(int i = 0, j = 0; i < num_unqiue_indexes; ++i) { if(unqiue_index_buffer[i] == j) { chaos_bits[i] = 1; ++j; } else { chaos_bits[i] = 0; *chaos_idxs++ = unique_index_buffer[i]; } } Next up in the compression is to notice that the difference in numbers is typically small and can be represented in a small number of bits by delta encoding. Word of warning though, that this was built in the time of SPUs where you had to chunk your data up - as such sometimes depending on your data, that the deltas encoded in the following may be not worth the effort. I would bet though that there would be some on average gain though for most uses. int num_chaos_indexes = chaos_idxs - chaos_index_buffer; // Delta-encode the indexes for(int i = num_chaos_indexes-1; i >= 1; --i) { chaos_index_buffer[i] -= chaos_index_buffer[i-1]; } Since the values can go positive or negative, you next need to shift the data to all positive. // First find the min value int min_idx = INT_MAX; for(int i = 0; i < num_chaos_indexes; ++i) { min_idx = min_idx < chaos_index_buffer[i] ? min_idx : chaos_index_buffer[i]; } // Then shift it up so its 0 based. for(int i = 0; i < num_chaos_indexes; ++i) { chaos_index_buffer[i] -= min_idx; } Finally, you need to encode the following data... 1) min_idx 2) chaos_bits 3) chaos_index_buffer 4) repeat_table and you have a few choices for doing that. 1) You could just throw it at Kraken/Zip and let it handle all the heavy lifting. 2) You could just pack the data straight forward like. min_idx is 4 bytes (or 2 bytes if this is known 16-bit index buffer) repeat_table is 2-bits per entry chaos_bits is 1 bit per entry chaos_index_buffer is some number of bits, typically less than a full 16bits or 32bits per entry - so you can just find the minimum number of bits it can fit in and encode that directly. Obviously, Kraken/Zip is probably going to get you the biggest wins, but for a higher overall computational cost vs bit unpacking. When Kraken encoding, its probably beneficial to split the high and low bytes so that the compression doesn't have to do any heaving lifting there (high bytes should be almost always zero). For now I leave the decoding of the data as an exercise for the reader - its pretty straight forward just doing the operations in reverse. The kodak set is great, but I needed some higher resolution versions of them for some testing that hopefully was more representative regarding compression than a straight up bilinear interpolation - so I ran it through my super-resolution NN and got some 2x outputs. I figure this would probably be useful to others as well, so I'm posting the data set here for you!
Oodle Texture is a new technology we've developed at RAD Game Tools which promises to dramatically shrink game sizes, reducing what you need to download and store on disk, and speeding up load times even more. Oodle Texture is specialized for what are called "block compressed textures". These are a form of compressed image data that is used by GPUs to provide the rendering attributes for surfaces in games. Oodle Texture works on BC1-BC7 textures, sometimes called "BCN textures". The BC1-7 are seven slightly different GPU formats for different bit depths and content types, and most games use a mix of different BCN formats for their textures. There are normal non-RDO encoders which are very good maximum quality encoders, along with a RDO (Rate-Distortion-Optimization) which can allow your textures to compress further with an additional compressor such as Oodle Kraken or Zip while still maintaining extremely high quality. In this post, I want to primarily cover the BC6H quality of our non-RDO maximum quality encoders compared to a commonly used alternative. First though, what is BC6H? BC6H is a 3-channel (RGB) half-float texture format. It turns out that BC6H is the same size as BC7, even though BC7 compresses only 8-bit data while BC6H compresses 16-bit floating point data. The magic that makes this possible is in the details of how it encodes the texture. There are two formats to BC6H, a signed format and an unsigned format. This does matter as the "half-floats" are encoded differently for each. In the unsigned format, the half-float has a precision of 5 bits in the exponent and 11 bits in the mantissa where as the signed format as 1 bit specifying positive or negative, 5 bits of exponent and only 10 bits of mantissa. Thus if your data is always >= 0, you should probably use the unsigned format as you will get better quality out of it. In the typical use cases of BC6H that I am aware of, the data is typically >= 0. Like all other BCn formats, each texture is broken up into 4x4 blocks and each block for BC6H is encoded in such a way where there are multiple possible encoding modes per block. The encoding modes, of which there are 14 different possible encoding modes, primarily specify the dynamic range (that is the minimum and maximum value of all pixels in a block) and the precision of the block in different possible ways. While some of the modes can cover the entire possible range of a 16-bit half float (at reduced quantized encoding precision), most of them are delta encodings, where you have a base color in the dynamic range and the rest of the colors are offsets from that base color. The colors themselves which are used specify non-linear lines through the color space for each channel. Its non-linear because its specifying them in the integer values of a half float and these integer values are interpolated directly. IE when you interpolate the integer value of a half-float, you get a non-linear distribution of colors along that line. (I hope that's clear... it is kind of confusing). It gets even more complicated and for more information on the specifics see https://docs.microsoft.com/en-us/windows/win32/direct3d11/bc6h-format Sufficed to say, encoding these things optimally is highly non-trivial. The search space is enormous, and even the choice of how you measure what is good or not is also fairly ill-defined for HDR textures. The reason this is is because if you just use straight up Squared-Error it will cause errors in bright spots to over-whelm any of the surrounding data prioritizing getting those just right, while your visual system in your eye is essentially logarithmic in intensity response -- meaning the brighter the values the less you see the small differences -- thus Squared-Error really messes up the colors on the edges of bright objects as it thinks those bright errors are just as important as the darker errors (which is not the case). Your choice in measuring error in BC6H is thus very important. We spent a lot of time nailing that down, and it really shows in the quality of results. This is my favorite example showing off the quality of Oodle Texture. Additionally, you can do what is called Rate-Distortion Optimization (RDO) which will make smarter encoding choices for a very large gain in compressibility of the data. More on that in a future post. Charles has a really nice write-up of our RDO encoders here: https://cbloomrants.blogspot.com/2020/06/oodle-texture-slashes-game-sizes.html (Seriously, go read that then come back) The original maximum quality DDS texture there can only be compressed by 2%! Here's the compression ratio table made from various lambda RDO values...
While those look identical, I assure you there are very subtle differences - but those mostly imperceptible differences make all the difference between no compression and 1.71:1 compression.
You can read more about Oodle Texture at the RAD Game Tools web site, along with the rest of the Oodle family of data compression solutions. |
Archives
August 2024
Categories |