JPG format sucks. We can do something about it
We all use JPG photos on our websites and apps. In this post I’ll show how you can reduce image sizes by an additional 20–50% with a single line of code. This is accomplished by carefully analyzing the way JPG works and changing its logic. If you don’t care about the details and just want to fiddle with a live demo, you can try out the result for yourself right here.
Consider a bear
Can you tell the difference between the two bear photos below? The first is a photo compressed with JPG at quality=90. The other is still a JPG photo, that for the life of me looks identical, and takes up half the space compared to the other.
You can download them to see for yourself. Show them in full screen, check their file size. No trickery involved — the output is still a JPG image, that looks the same, and takes up much less space. If you want to see I’m not playing tricks with some cherry picked example, you can play around and upload your own photos right here.
How come JPG sucks?
I remember almost 15 years ago, when I was an undergrad in Electrical Engineering, I sat in a classroom and was taught how JPEG works. I was in awe at the ingenuity of the engineers that handcrafted a method that takes an image, and describes its content with much less numbers than you’d get by say, enumerating the values of each pixel independently.
I wasn’t wrong in being impressed. Despite the provocative title for this post — I’m actually still impressed with the engineers who in the 1980s figured out a pretty good way to lower an image file size while making visually minor changes. But they left a lot of optimization opportunities untouched. And to see that, we need to understand how it works.
How does JPG work
You can skip this portion if you’re not interested in a trip down memory lane, but in order to improve JPG you really need to understand how it works. So I’m going to briefly skim over what happens in regular old JPG compression. All the numeric examples are ripped straight off of wikipedia, which have an excellent writeup on this.
Step 1: divide the image into 8x8 blocks of pixels
JPG applies the same logic for every block of 8x8 pixel values, and applies the same logic for each color channel independently. So you can understand JPG just by looking at what it does to a single 8x8 block of pixel values. For example. Let’s look at one block, and pretend this is the pixel values:
Step 2: apply DCT transform to the block, and round down low values for some components
DCT transform is a very close cousin to Fourier transform. It transforms pixel intensity values to spatial coefficients. You can read about it here, but for our purposes we can focus on the upshot. After applying DCT transform, you end up with another 8x8 values, but they tend to have higher values in the upper left part of the block. For the example above, you’d get the following values 8x8 values:
Not only do the values on the bottom right tend to be smaller, but they also tend to describe information about the block that is often less visually noticeable when omitted! This is because these values describe changes in the pixel intensity that go back and forth very quickly between neighboring pixels, and our visual system tends to “average them out”.
So engineers in the 1980s sat and came up with something called a “quantization table”. It’s just a bunch of constants, that express our belief that keeping track of the values in the upper left corner of the DCT coefficient block is more important than keeping information about stuff that goes on in the bottom right.
We’re just going to divide each value in the block by its corresponding quantization table value, and round the result. We use smaller values on the top left, and higher values on the bottom right.
This process is done regardless of the content of the image, and regardless of how visually noticeable this rounding process is in context. It just captures the intuition of engineers about how some coefficients are going to be less noticeable than others, and doesn’t take into account at all what are the actual contents in this specific image.
Here’s the result when you do this division:
Get it? You get a bunch of 0s, which you can get away with not saving. Also the few values that aren’t 0 tend to repeat or be very small, which takes less bits to save. There’s only a little bit more detail like running huffman encoding and run length encoding — but that’s details that matter exceed the scope of this post. You now know pretty much how JPG works.
What are we missing here?
I think just describing how the process works, it’s a bit obvious what JPG misses out on.
- JPG doesn’t look at the picture. So it has no understanding of whether a certain loss of information would introduce an edge or not. It also doesn’t care if we’re losing information at the background or smack dab in the center of a person’s eyes. Can’t blame them — this protocol predates the the time when this understanding was technically something we could accomplish.
- JPG uses the same quantization table regardless of the image content. A machine learning model can solve that as well — you can directly optimize for a quantization table that empirically reduced visual distortion.
- JPG doesn’t look at the entire block to optimize it further. What if we can lower one coefficient from 1 to 0, and trade away another coefficient going up from 5 to 6, looking almost exactly the same? We’d gain space, because it’s easier to encode.
Deep learning to the rescue!
I might write a follow-up blog post on the details of the model we’ve trained.
But the upshot is, in each weakness of the original JPG I’ve describe above, you get an opportunity to improve results. Modern convolutional neural nets can do a great job at approximating the way we perceive images visually (with some strong disclaimers around adversarial examples which fool them and not human).
Once we learn how to visually determine if our change to the image content was visually identifiable, you can think of the very broady goal of “improve JPG” as well defined optimization process of “tweak every pixel in the image, and see if the result is visually unnoticeable and results in smaller DCT coefficients in a given block”.
Let’s focus on how the first part is achievable — the latter part can be thought of as any numeric optimization procedure. What can we do to look at two patches of an image, and ask “do they look the same to a visual system that humans have?”.
A taste of the solution: learning to see
So, how do we use deep learning models to tell us how visually similar two image patches are? It turns out that taking a convolutional neural net that was trained on e.g. image classification, and lopping chopping it off midway through, you get internal representations of the image that are useful for visual comparison.
So, if you think of a small patch of an image, let’s say 128x128 pixels. You can tweak their values as you see fit to make JPG take up less space. So let’s say you have some guess about a change to the image that would make it take up less space like zero out a bunch of coefficients in a certain block. You end up with:
- The original image patch, which we can run through the model. The output might be 256 values that capture how the entire patch looks like to the model.
- Another 256 values that represent how the model understands the modified image.
Now, you can take some distance measure like their euclidean distance, and call that your rate of distortion. That already works pretty well, but something even better you can do is have humans annotate “on a scale of 1 to 5, how different are these two patches?”, and train a secondary model that learns to predict given these two sets of 256 numbers, a single number fo similarity score.
More to come
This post is getting a bit long, so I’ll write up further about details of how we optimize the image in the future. If you think reducing your image size might be useful for your website’s page load speed, feel free to try us out live.