// you’re reading...

Immanence

Bit bombs and other dangers

Clever hack: a 42 kb .zip file that decompresses unto your hearts’ content.

Quoting freely from various Wikipedia articles in italic, the specification for ZIP indicates that files can be stored either uncompressed or using a variety of compression algorithms. However, in practice, ZIP is almost always used with Katz’s DEFLATE algorithm, except when files being added are already compressed or are resistant to compression”; where DEFLATE is a combination of the LZ77 algorithm and Huffman coding.

Whille DEFLATE, the LZ family and other compression systems are often referred as algorithms, what they really are is file specification standards. It’s theoretically possible that the standard DEFLATE algorithm isn’t the most efficient in terms of final size/uncompressed size such that the resulting file is a DEFLATE standard stream. Which is why such clever hacks as the 42kb zip bomb that uncompresses into petabytes of nonsense data are possible.

This extends in general to all kinds of compression systems, and even more for lossy ones. It’s actually  widely-known that there exist competing MP3 encoders which will produce a varying range of “qualities” while still generating valid MP3 files. (These qualities might be multi-dimensional — not “better”, but better for rock or classical music, or perhaps a spectrum between “best for a wide amplitude range” or “best for a wide frequency range”).

There’s a Dijkstra-like lesson on the importance of provability for reliability in this 42kb bomb story. As far as a cursory research could go, it was never proven that the MP3 spec does not allow for a byte bundle that will decompress into a very large byte bundle, thus crashing MP3 players and perhaps exploiting buffer overflow vulnerabilities in cell phones. I mean, the DEFLATE spec is very simple, and nevertheless “correctness” in full was never  provennot only of the DEFLATE scheme!, hence launching this clever zip bomb scare. And while actually writing a zip bomb took a long time despite the simplicity of DEFLATE, the combined ingeniousness of such very talented bit-level twiddlers should not be underestimated.

The economics of why such things happen in the computer industry is easy to  understand. Some of it was nailed by Dijsktra in various EWDs and his “Discipline in thought” documentary — the software industry thrives on maintenance and programmers get their thrill from not really knowing what they’re doing, seeing “magic” happen from their hands from deep libraries. The fact that DEFLATE zip-DEFLATE was never proven correct (and if there was a proof, it was invalid, as this very conterexample shows) derives in part from the standard economics of buggy software development, but in part from the fact that the real thrills (or marketable advances) in compression are in finding clever new compression algorithms, not in checking the correctness of every possible standard representation under the standard decoder.

What Djisktra used to call “loose thinking” puts the icing on the cake; we tend to think of MP3, JPEG, zip/DEFLATE as compression formats, when they’re in fact decompression formats, or more strictly protocols on how a stream of bytes should be decoded. The compression market is always a wide open road, which is good, on one side, because it allows for some evolution while keeping compatibility, but on the other side is dangerous. A (de)compression format can only be considered safe if it’s two-way, specifying both compression and decompression algorithms, and proving both of them and the representation correct.

The ironic part is how much attention and controversy was devoted to “security” in zip formats on the subject of cryptography. Dijkstra would chide Phil Katz, Nico Mak and peers for never have gotten the basics right in the first place!

UPDATE: From a rather critical Reddit comment which kind of doesn’t get the message but contributes a new fact that changes the scare part of the text:

It can be trivially shown that every MP3 frame contains only so many samples. All parts of MPEG, including MP3, was designed from the ground up with real-time hardware decoders in mind. That requires bounded memory usage, bounded memory throughput, and bounded processing power. And all of that means bounded bit rates.

That means the MP3 bomb scenario is off the table. I’m sure glad to be wrong on that one!

IMPORTANT CLARIFICATIONS, responding to comments here and on reddit/news.ycombinator:

  1. Reversibility only means losslessness. No one ever said ZIP was a lossy format. FLAC and TIFF are reversible, JPEG and MP3 aren’t.
  2. Original/deflated size per se is unbounded because DEFLATE depends on entropy. DEFLATE decompression per se is an anamorphic operation on a small-entropy encoding. Correctness of the ZIP format does not require boundedness.
  3. ZIP can be said to essentially consist of a (filesystem, encoding) pair. The filesystem has been slowly evolving and is mostly compatible over all decoders. The encoding is usually DEFLATE but could be something else. What is not correct here is that a ZIP file uncompresses into something incompatible with the underlying ZIP filesystem. This wasn’t a multi-petabyte file of zeroes, but a recursive evaluation hack. (see this reddit thread for the full monty.)
  4. If this was a multi-petabyte file of zeroes, compression-enabled filesystems like ZFS (or DoubleSpace, for those who remember the bad old days of tiny hard drives) would eat it up. Some guys running ZFS (again, the aforementioned reddit thread describes it) opened the bomb and saw different.

Breadcrumb trails:

, , , , , , , , , , , , , , , , , , , , , ,

Related posts:

Discussion

15 comments for “Bit bombs and other dangers”

  1. I don’t completely understand. What would a proof of correctness for DEFLATE consist of? A proof that the compressor compresses optimally? How would this fix your hypothetical cell phone MP3 vulnerability?

    Posted by Daniel Ehrenberg | July 28, 2008, 4:10 pm
  2. The proof of correctness would mean the deflate algorithm would be bounded and there would be no ’surprise’ such as this one, the 42kbyte zip causing a mushroom cloud.

    Posted by YU | July 28, 2008, 4:16 pm
  3. What good would proving that for an input of sizex the output would be <= c’+c*sizex ?
    That isn’t even true for simple RLE compression.

    Posted by Bart | July 28, 2008, 4:36 pm
  4. I read about half-way through this, but I had to stop because I couldn’t really handle the prose. Is there some sort of Non-profit organization that helps needy programmers learn how to write? I think we’ve found a client.

    Posted by WeaselAshes | July 28, 2008, 5:04 pm
  5. Wouldn’t such a proof be impossible? It sounds like the classic halting problem to me…, but I’m not a theoretical mathematician either.

    Posted by casey | July 28, 2008, 5:10 pm
  6. I’m not sure I really agree with you.
    Proving deflate correct in my mind is showing that it is a reversible mapping.

    Posted by Cheese Ninjas | July 28, 2008, 5:36 pm
  7. Proving deflate reversible means proving it lossless.

    “Correctness” is in the eyes of the beholder (actually, the theorem-maker), but for one, DEFLATE-zip should only be able to specify a ZIP filesystem-valid file.

    Remember, the whole point is that there’s nothing promised on the /representations/. A correct representation coupled with a standard decoder shouldn’t do something it isn’t intended to.

    The DEFLATE representation may as well be correct on its own if you extend its usefulness wide enough — someone might actually intend to use it to represent anamorphic operations other than file restoring, but the zip bomb shows the ZIP representation is not.

    Posted by Dayvan Cowboy | July 28, 2008, 5:54 pm
  8. As for the halting problem, that’s only a problem if we’re talking about Turing-complete representations/languages. Writing a (La)TeX file that crashes a system is feasible, and online “LaTeX renderers” indeed place restrictions on what TeX one’s allowed to compile. This is informal, but bounds on what “online LaTeX” actually should be do intuitively exist and are applicable.

    Posted by Dayvan Cowboy | July 28, 2008, 6:03 pm
  9. You misspelled “Dijkstra”.

    Corrected!

    Posted by Anonymous | July 28, 2008, 6:15 pm
  10. Just because it is possible to cram a stupidly large amount of data into 42Kb doesn’t mean the algorithm has a flaw which requires a proof. Like Cheese Ninjas said, as long as its a reversible mapping then the algorithm is perfectly correct.

    Posted by Malcolm | July 28, 2008, 6:25 pm
  11. Malcolm: what about an arbitrarily large amount of data?

    Posted by Cheese | July 28, 2008, 6:44 pm
  12. Cheese: You can store as much data as you want. Here’s a very simple algoritym: output 10 quadrillion zeros. That’s 28 characters that uncompresses to 10 quadrillion.

    The writer of the post has no idea what’s he’s talking about, proving deflate correct has zero to do with the size of a zipfile.

    Posted by Ariel | July 28, 2008, 8:27 pm
  13. Regarding Ariel’s comment, look at this reddit thread.

    Posted by Dayvan Cowboy | July 28, 2008, 8:31 pm
  14. Ok, look it’s simple. If the petabyte garbage file can be compressed back to 42kb, it is correct and there isn’t anything you can do about it.

    On the other hand if no zip compatible compressor can produce the 42kb, it is still correct, it just means you haven’t looked hard enough. Perhaps that particular file has only one specialized compressor algorithm which causes it to become 42kb.

    The only way it is not correct is the compression produces a 42kb file which is in now way similar to the original.

    Posted by milligence | July 28, 2008, 11:07 pm
  15. Again, this wasn’t a multi-petabyte redundant bundle compressed through ordinary means, but a recursive evaluation hack.

    And once again, reversibility only means losslessness. FLAC is reversible, MP3 isn’t.

    Posted by Dayvan Cowboy | July 29, 2008, 6:11 pm

Post a comment

You must be logged in to post a comment.