Zip - How not to design a file format.

2021-01-16

The Zip file format is now 32 years old. You'd think being 32 years old the format would be well documented. Unfortunately it's not.

I have a feeling this is like many file formats. They aren't designed, rather the developer just makes it up as they go. If it gets popular other people want to read and/or write them. They either try to reverse engineer the format OR they ask for specs. Even if the developer writes specs they often forget all the assumptions their original program makes. Those are not written down and hence the spec is incomplete. Zip is such a format.

Zip claims its format is documented in a file called APPNOTE.TXT which can be found here.

The short version is, a zip file consists of records, each record starts with some 4 byte marker that generally takes the form

0x50, 0x4B, ??, ??

Where the 0x50, 0x4B are the letters PK standing for "Phil Katz", the person who made the zip format. The two ?? are bytes that identify the type of the record. Examples

0x50 0x4b 0x03 0x04   // a local file record
0x50 0x4b 0x01 0x02   // a central directory file record
0x50 0x4b 0x06 0x06   // an end of central directory record

Records do NOT follow any standard pattern. To read or even skip a record you must know its format. What I mean is there are several other formats that follow some convention like each record id is followed by the length of the record. So, if you see an id, and you don't understand it, you just read the length, skip that many bytes (*), and you'll be at the next id. Examples of this type include most video container formats, jpgs, tiff, photoshop files, wav files, and many others.
(*) some formats require rounding the length up to the nearest multiple of 4 or 16.

Zip does NOT do this. If you see an id and you don't know how that type of record's content is structured there is no way to know how many bytes to skip.

APPNOTE.TXT says the following things

4.1.9 ZIP files MAY be streamed, split into segments (on fixed or on removable media) or "self-extracting". Self-extracting ZIP files MUST include extraction code for a target platform within the ZIP file.

4.3.1 A ZIP file MUST contain an "end of central directory record". A ZIP file containing only an "end of central directory record" is considered an empty ZIP file. Files MAY be added or replaced within a ZIP file, or deleted. A ZIP file MUST have only one "end of central directory record". Other records defined in this specification MAY be used as needed to support storage requirements for individual ZIP files.

4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.

4.3.3 Files MAY be stored in arbitrary order within a ZIP file. A ZIP file MAY span multiple volumes or it MAY be split into user-defined segment sizes. All values MUST be stored in little-endian byte order unless otherwise specified in this document for a specific data element.

4.3.6 Overall .ZIP file format:

      [local file header 1]
      [encryption header 1]
      [file data 1]
      [data descriptor 1]
      . 
      .
      .
      [local file header n]
      [encryption header n]
      [file data n]
      [data descriptor n]
      [archive decryption header] 
      [archive extra data record] 
      [central directory header 1]
      .
      .
      .
      [central directory header n]
      [zip64 end of central directory record]
      [zip64 end of central directory locator] 
      [end of central directory record]
   

4.3.7 Local file header:

      local file header signature     4 bytes  (0x04034b50)
      version needed to extract       2 bytes
      general purpose bit flag        2 bytes
      compression method              2 bytes
      last mod file time              2 bytes
      last mod file date              2 bytes
      crc-32                          4 bytes
      compressed size                 4 bytes
      uncompressed size               4 bytes
      file name length                2 bytes
      extra field length              2 bytes

      file name (variable size)
      extra field (variable size)
   

4.3.8 File data

Immediately following the local header for a file SHOULD be placed the compressed or stored data for the file. If the file is encrypted, the encryption header for the file SHOULD be placed after the local header and before the file data. The series of [local file header][encryption header] [file data][data descriptor] repeats for each file in the .ZIP archive.

Zero-byte files, directories, and other file types that contain no content MUST NOT include file data.

4.3.12 Central directory structure:

      [central directory header 1]
      .
      .
      . 
      [central directory header n]
      [digital signature] 
   

File header:

        central file header signature   4 bytes  (0x02014b50)
        version made by                 2 bytes
        version needed to extract       2 bytes
        general purpose bit flag        2 bytes
        compression method              2 bytes
        last mod file time              2 bytes
        last mod file date              2 bytes
        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes
        file name length                2 bytes
        extra field length              2 bytes
        file comment length             2 bytes
        disk number start               2 bytes
        internal file attributes        2 bytes
        external file attributes        4 bytes
        relative offset of local header 4 bytes

        file name (variable size)
        extra field (variable size)
        file comment (variable size)
   

4.3.16 End of central directory record:

      end of central dir signature    4 bytes  (0x06054b50)
      number of this disk             2 bytes
      number of the disk with the
      start of the central directory  2 bytes
      total number of entries in the
      central directory on this disk  2 bytes
      total number of entries in
      the central directory           2 bytes
      size of the central directory   4 bytes
      offset of start of central
      directory with respect to
      the starting disk number        4 bytes
      .ZIP file comment length        2 bytes
      .ZIP file comment       (variable size)
   

There are other details involving encryption, larger files, optional data, but for the purposes of this post this is all we need. We need one more piece of info, how to make a self extracting archive.

To do so we could look back to ZIP2EXE.exe which shipped with pkzip in 1989 and see what it does but it's easier look at Info-Zip to see what happens.

How do I make a DOS (or other non-native) self-extracting archive under Unix?

The procedure is basically described in the UnZipSFX man page. First grab the appropriate UnZip binary distribution for your target platform (DOS, Windows, OS/2, etc.), as described above; we'll assume DOS in the following example. Then extract the UnZipSFX stub from the distribution and prepend as if it were a native Unix stub:

> unzip unz552x3.exe unzipsfx.exe                // extract the DOS SFX stub
> cat unzipsfx.exe yourzip.zip > yourDOSzip.exe  // create the SFX archive
> zip -A yourDOSzip.exe                          // fix up internal offsets
> 

That's it. You can still test, update and delete entries from the archive; it's a fully functional zipfile.

So given all of that let's go over some problems.

How do you read a zip file?

This is undefined by the spec.

There are 2 obvious ways.

  1. Scan from the front, when you see an id for a record do the appropriate thing.

  2. Scan from the back, find the end-of-central-directory-record and then use it to read through the central directory, only looking at things the central directory references.

Scanning from the back is how the original pkunzip works. For one it means if you ask for some subset of files it can jump directly to the data you need instead of having to scan the entire zip file. This was especially important if the zip file spanned multiple floppy disks.

But, 4.1.9 says you can stream zip files. How is that possible? What if there is some local file record that is not referenced by the central directory? Is that valid? This is undefined.

4.3.1 states

Files MAY be added or replaced within a ZIP file, or deleted.

Okay? That suggests the central directory might not reference all the files in the zip file because otherwise this statement about files being added, replaced, or delete has no point to be in the spec.

If I have file1.zip that contains files, A, B, C and I generate file2.zip that only contains files A, B. Those are just 2 independent zip files. It makes zero sense to put in the spec that you can add, replace, and delete files unless that knowledge some how affects the format of a zip file.

In other words. If you have

  [local file A]
  [local file B]
  [local file C]
  [central directory file A]
  [central directory file C]
  [end of central directory]

Then clearly B is deleted as the central directory doesn't reference it. On the other hand, if there's no [local file B] then you just have an independent zip file, independent of some other zip file that has B in it. No need for the spec to even mention that situation.

Similarly if you had

  [local file A (old)]
  [local file B]
  [local file C]
  [local file A (new)]
  [central directory file A(new)]
  [central directory file B]
  [central directory file C]
  [end of central directory]

Then A (old) has been replaced by A (new) according to the central directory. If on the other hand there is no [local file A (old)] you just have an independent zip file.

You might think this is nonsense but you have to remember, pkzip comes from the era of floppy disks. Reading an entire zip file's contents and writing out a brand new zip file could be an extremely slow process. In both cases, the ability to delete a file just by updating the central directory, or to add a file by reading the existing central directory, appending the new data, then writing a new central directory, is a desirable feature. This would be especially true if you had a zip file that spanned multiple floppy disks; something that was common in 1989. You'd like to be able to update a README.TXT in your zip file without having to re-write multiple floppies.

In discussion with PKWare, they state the following

The format was originally intended to be written front-to-back so the central directory and end of central directory record could be written out last after all files included in the ZIP are known and written. If adding files, changes can applied without rewriting the entire file. This was how the original PKZIP application was designed to write .ZIP files. When reading, it will read the ZIP file end of central directory first to locate the central directory and then seek to any files it needs to access

Of course "add" is different than "delete" and "replace".

Whether or not having local files not referenced by the central directory is undefined by the spec. It is only implied by the mention of:

Files MAY be added or replaced within a ZIP file, or deleted.

If it is valid for the central directory to not reference all the local files then reading a zip file by scanning from the front may fail. Without special care you'd get files that aren't supposed to exist or errors from trying to overwrite existing files.

But, that contradicts 4.1.9 that says zip files maybe be streamed. If zip files can be streamed then both of the example above would fail because in the first case we'd see file B and in the second we'd see file A (old) before we saw that the central directory doesn't reference them. If you have to wait for the central directory before you can correctly use any of the entries then functionally you can not stream zip files.

Can the self extracting portion have any zip IDs in it?

Seeing the instructions for how to create a self extracting zip file above, we just prepend some executable code to the front of the file and then fix the offsets in the central directory.

So let's say your self extractor has code like this

switch (id) {
  case 0x06054b50:
    read_end_of_central_directory();
    break;
  case 0x04034b50:
    read_local_file_record();
    break;
  case 0x02014b50:
    read_center_file_record();
    break;
  ...
}

Given the code above, it's likely those values 0x06054b50, 0x04034b50, 0x02014b50 will appear in binary in the self extracting portion of the zip file at the front of the file. If you read a zip file by scanning from the front your scanner my see those ids and mis-interpret them as a zip records.

In fact you can imagine a self extractor with a zip file in it like this

// data for a zip file that contains
//   LICENSE.txt
//   README.txt
//   player.exe
const unsigned char[] runtimeAndLicenseData = {
  0x50, 0x4b, 0x03, 0x04, ??, ??, ...
};

int main() {
   extractZipFromFile(getPathToSelf());
   extractZipFromMemory(runtimeAndLicenseData, sizeof(runtimeAndLicenseData));
}

Now there's a zip file in the self extractor. Any reader that reads from the front would see this inner zip file and fail. Is that a valid zip file? This is undefined by the spec.

I tested this. The original PKUnzip.exe in DOS, the Windows Explorer, MacOS Finder, Info-Zip (the unzip included in MacOS and Linux), all clearly read from the back and see the files after the self extractor. 7z, Keka, see the embedded zip inside the self extractor.

Is that failure or is that a bad zip file? The APPNOTE.TXT does not say. I think it should be explicit here and I think it's one of those unstated assumptions. PKunzip scans from the back and so this just happens to work but the fact of how it happens to work is never documented. The issue that the data in the self-extractor might happen to resemble a zip file is just glossed over. Similarly streaming will likely fail if it hasn't already from the previous issues.

You might think this is a non issue but their are 100s of thousands of self extracting zip files out there from the 1990s in the archives. A forward scanner might fail to read these.

Can the zip comment contain zip IDs in it?

If you go look at 4.3.16 above you'll see the end of a zip file is a variable length comment. So, if you're doing backward scanning you basically read from the back of the file looking for 0x50 0x4B 0x05 0x06 but what if that sequence of bytes is in the comment?

I'm sure Phil Katz never gave it a second thought. He just assumed people would put the equivalent of a README.txt in there. As such it would only have values from 0x20 to 0x7F with maybe a 0x0D (carriage return), 0x0A (linefeed), 0x09 (tab) and maybe 0x06 (bell).

Unfortunately all of those values in the ids are valid ASCII, even utf-8. We already went over 0x50 = P and 0x4B = K. 0x06 is "Bell" in ASCII (makes a noise or flashes the screen). 0x05 is "Enquiry".

The APPNOTE.TXT should arguably explicitly specify if this is invalid. Indirectly 4.3.1 says

A ZIP file MUST have only one "end of central directory record"

But what does that mean? Does that mean the bytes 0x50 0x4B 0x05 0x06 can't appear in the comment nor the self extracting code? Does it mean the first time you see that scanning from the back you don't try to find a second match?

If you scan from the front and run into none of the issues mentioned before, then a forward scanner would successfully read this. On the other hand, pkunzip itself would fail.

What if the offset to the central directory is 1,347,093,766?

That offset is 0x504b0506 so it will appear to be end central directory header. I think 1.3gig zip file wasn't even on the radar when zip was created and in fact extensions were required to handle files larger then 4gig. But, it does show one more way the format is poorly designed.

What's a good design?

There's certainly debate to be had about what a good design would be but somethings are arguably easy to decide if we could start over.

  1. It would have been better if records had a fixed format like id followed by size so that you can skip a record you don't understand.

  2. It would have been better if the last record at the end of the file was just an offset-to-end-of-central-directory record as in

       0x504b0609 (id: some id is not in use)
       0x04000000 (size of data of record)
       0x???????? (relative offset to end-of-central-directory)
       

    Then there would be no ambiguity for reading from the back.

    1. Read the last 12 bytes
    2. Check the first 8 are 0x50 0x4b 0x06 0x09 0x04 0x00 0x00 0x00. If not, fail.
    3. Read the offset and go to the end-of-central-directory

    Or, conversely, put the comment in its own record and write it before the central directory and put an offset to it in the end-of-central-directory-record. Then at least this issue of scanning over the comment would disappear.

  3. Be clear about what data can appear in a self extracting stub.

    If you want to support reading from the front it seems required to state that the self extracting portion can't appear to have any records.

    This is hard to enforce unless you specifically wrote some validator. If you just check based on whether your own app can read the zip file then, as it stands now, Pkzip, pkunzip, info-zip (the zip in MacOS, Linux), Windows Explorer, and MacOS all don't care what's in the self extracting portion so they aren't useful for validation. You must explicitly state that you must scan from the back in the spec or write a validator that rejects zip that are not forward scanable and state in the spec why.

  4. Be clear if the central directory can disagree with local file records

  5. Be clear if random data can appear between records

    A backward scanner does not care what's between records. It only cares it can find the central directory and it only reads what that central directory points to. That means there can be any random data between records (or some at least some records).

    Be explicit if this is okay or not okay. Don't rely on implicit diagrams.

What to do, how to fix?

If I was to to guess all of these issues are implementation details that didn't make it into the APPNOTE.TXT. What I believe the APPNOTE.TXT really wants to say is "a valid zip file is one that pkzip can manipulate and pkunzip can correctly unzip. Instead it defines things in such a way that various implementations can make files that other implementations can't read.

Of course with 32 years of zip files out their we can't fix the format. What PKWare could do is get specific about these edge cases. If it was me I'd add these sections to the APPNOTE.TXT

4.3.1 A ZIP file MUST contain an "end of central directory record". A ZIP file containing only an "end of central directory record" is considered an empty ZIP file. Files MAY be added or replaced within a ZIP file, or deleted. A ZIP file MUST have only one "end of central directory record". Other records defined in this specification MAY be used as needed to support storage requirements for individual ZIP files.

The "end of central directory record" must be at the end of the file and the sequence of bytes, 0x50 0x4B 0x05 0x06, must not appear in the comment.

The "central directory" is the authority on the contents of the zip file. Only the data it references are valid to read from the file. This is because (1) the contents of the self extracting portion of the file is undefined and might be appear to contain zip records when in fact they are not related to the zip file and (2) the ability to add, update, and delete files in a zip file stems from the fact that it is only the central directory that knows which local files are valid.

That would be one way. I believe this will read the 100s of millions of existing zip files out there.

On the other hand, if PKWare claims such files that have these issues don't exist then this would work as well

4.3.1 A ZIP file MUST contain an "end of central directory record". A ZIP file containing only an "end of central directory record" is considered an empty ZIP file. Files MAY be added or replaced within a ZIP file, or deleted. A ZIP file MUST have only one "end of central directory record". Other records defined in this specification MAY be used as needed to support storage requirements for individual ZIP files.

The "end of central directory record" must be at the end of the file and the sequence of bytes, 0x50 0x4B 0x05 0x06, must not appear in the comment.

There can be no [local file records] that do not appear in the central directory. This guarantee is required so reading a file front to back provides the same results as reading it back to front. Any file that does not follow this rule is an invalid zip file.

A self extracting zip file must not contain any of the sequences of record ids listed in this document as they maybe mis-interpreted by forward scanning zip readers. Any file that does not follow this rule is an invalid zip file.

I hope they will update the APPNOTE.TXT so that the various zip readers and zip creators can agree on what's valid.

Unfortunately I feel like pkware doesn't want to be clear here. Their POV seems to be that zip is an ambiguous format. If you want to read by scanning from the front then just don't try to read files you can't read that way. They're still valid zip files and but the fact that you can't read them is irrelevant. It's just your choice to fail to support those.

I suppose that's a valid POV. Few if any zip libraries handle every feature of zip. Still, it would be nice to know if you're intentionally not handling something or if you're just reading the file wrong and getting lucky that it works sometimes.

The reason all this came up is I wrote a javascript unzip library. There are tons out here but I had special needs the other libraries I found didn't handle. In particular I needed a library that let me read a single file from a large zip as fast as possible. That means backward scanning, finding the offset to the desired file, and just decompressing that single file. Hopefully others find it useful.


You might find this history of Zip fascinating

Comments
Randomly Selected Music
The Day Unity Broke The Internet