Reverse Engineering File Formats
Content Leader: Jess Garcia - Last Updated: December 24, 2006
Archives
Typical Structure
-
Header
-
Signature (archive format and version).
-
Index position: Most of the time, the archive index will start immediately, but sometimes the index is actually stored at the end of the file instead, since the archive packer doesn’t know how big it will be until afterwards (if the index is itself compressed, for instance). In that case, there will be a pointer to where the index is.
-
Index
-
Index size (or number of files in archive)
-
List of file entries: constant- or variable-length data structure depending on how the filenames are handled. Sometimes there will be a hierarchical structure of filepaths to represent a whole directory tree inside, too.
-
Filename/filepath: can be a zero-terminated string, or sometimes the length will be explicitly given. Optional (filename hash can be used instead).
-
Position: offet to the start of the file in the archive (from the start of the archive, the start of the index, or the start of the “file area”)
-
Size (can often be inferred from the offset to the next file). Sometimes there are two different sizes: an original size for the file, and the compressed size as it is stored in the archive.
-
Flags: compression, compression algorithm, encryption, encryption key, etc.
-
Checksum
-
Files (possibly compressed & encrypted).
References
Compression
Compression Algorithms
-
zlib
-
Run-length compression
-
LZSS
-
Huffman encoding
-
Other
-
Arithmetic encoding
-
LZ77 (Lempel-Ziv ’77)
-
LZW (Lempel-Ziv-Welch)
References
Encryption
Other References