dictzip, dictunzip - compress (or expand) files, allowing random access

Credits

dictzip was written by Rik Faith (faith@cs.unc.edu) and is distributed under the terms of the GNU General
       Public License.  If you need to distribute under other terms, write to the author.

       The main libraries used by this programs (zlib, regex, libmaa) are distributed under different terms,  so
       you  may  be able to use the libraries for applications which are incompatible with the GPL -- please see
       the copyright notices and license information that come with the  libraries  for  more  information,  and
       consult with your attorney to resolve these issues.

Description

dictzip  compresses  files  using the gzip(1) algorithm (LZ77) in a manner which is completely compatible
       with the gzip file format.  An extension to the gzip file format (Extra Field, described  in  2.3.1.1  of
       RFC 1952) allows extra data to be stored in the header of a compressed file.  Programs like gzip and zcat
       will  ignore  this  extra  data.  However, dictd(8), the DICT protocol dictionary server will make use of
       this data to perform pseudo-random access on the file.  Files in the dictzip format should end  in  ".dz"
       so  that  they  may  be  distinguished  from  common  gzip  files  that do not contain the special header
       information.

       From RFC 1952, the extra field is specified as follows:

              If the FLG.FEXTRA bit is set, an "extra field" is present in the header, with  total  length  XLEN
              bytes.  It consists of a series of subfields, each of the form:

              +---+---+---+---+==================================+
              |SI1|SI2|  LEN  |... LEN bytes of subfield data ...|
              +---+---+---+---+==================================+

              SI1  and  SI2  provide a subfield ID, typically two ASCII letters with some mnemonic value.  Jean-
              Loup Gailly <gzip@prep.ai.mit.edu> is maintaining a registry of subfield IDs; please send him  any
              subfield ID you wish to use.  Subfield IDs with SI2 = 0 are reserved for future use.

              LEN gives the length of the subfield data, excluding the 4 initial bytes.

       The  dictzip program uses 'R' for SI1, and 'A' for SI2 (i.e., "Random Access").  After the LEN field, the
       data is arranged as follows:

       +---+---+---+---+---+---+===============================+
       |  VER  | CHLEN | CHCNT |  ... CHCNT words of data ...  |
       +---+---+---+---+---+---+===============================+

       As per RFC 1952, all data is stored least-significant byte first.  For VER 1 of the data, all values  are
       16-bits long (2 bytes), and are unsigned integers.

       XLEN  (which  is specified earlier in the header) is a two byte integer, so the extra field can be 0xffff
       bytes long, 2 bytes of which are used for the subfield ID (SI1 and SI1), and 2 bytes of  which  are  used
       for  the  subfield  length  (LEN).   This  leaves  0xfffb  bytes  (0x7ffd 2-byte entries or 0x3ffe 4-byte
       entries).  Given that the zip output buffer must be 10% + 12 bytes larger than the input buffer,  we  can
       store  58969  bytes per entry, or about 1.8GB if the 2-byte entries are used.  If this becomes a limiting
       factor, another format version can be selected and defined for 4-byte entries.

       For compression, the file is divided up into "chunks" of data, each chunk is less than 64kB, and  can  be
       compressed  into  an  area  that  is also less than 64kB long (taking incompressible data into account --
       usually the data is compressed into a block that is much smaller than the  original).   The  CHLEN  field
       specifies the length of a "chunk" of data.  The CHCNT field specifies how many chunks are preset, and the
       CHCNT  words  of data specifies how long each chunk is after compression (i.e., in the current compressed
       file).

       To perform random access on the data, the offset and length of the data are provided to library routines.
       These routines determine the chunk in which  the  desired  data  begins,  and  decompresses  that  chunk.
       Consecutive chunks are decompressed as necessary.

Name

       dictzip, dictunzip - compress (or expand) files, allowing random access

Options

-d or --decompress
Decompress. This is the default if the executable is called dictunzip.

-c or --stdout
Write output on standard output; keep original files unchanged. This is only available when
decompressing (because parts of the header must be updated after a write when compressing).

-f or --force
Force compression or decompression even if the output file already exists.

-h or --help
Display help.

-k or --keep
Do not delete the original file.

-l or --list
For each compressed file, list the following fields:

type: dzip, gzip, or text (includes files in unknown formats)
crc: CRC checksum
date and time: from header
chunks: number of chunks in file
size: size of each uncompressed chunk
compr.: compressed size
uncompr.: uncompressed size
ratio: compression ratio (0.0% if unknown)
name: name of uncompressed file

Unlike gzip, the compression method is not detected.

-L or --license
Display the dictzip license and quit.

-t or --test
Check the compressed file integrity. This option is not implemented. Instead, it will list the
header information.

-n or --no-name
Don't save the original filename and timestamp.

-v or --verbose
Verbose. Display extra information during compression.

-V or --version
Version. Display the version number and compilation options then quit.

-sstart or --startstart
Specify the offer to start decompression, using decimal numbers. The default is at the beginning
of the file.

-esize or --sizesize
Specify the size of the portion of the file to decompress, using decimal numbers. The default is
the whole file.

-Sstart or --Startstart
Specify the offer to start decompression, using base64 numbers. The default is at the beginning
of the file.

-Esize or --Sizestart
Specify the size of the portion of the file to decompress, using base64 numbers. The default is
the whole file.

-pprefilter or --preprefilter
Specify a shell command to execute as a filter before compression or decompression of a chunk.
The pre- and post-compression filters can be used to provide additional compression or output
formatting. The filters may not increase the buffer size significantly. The pre- and post-
compression filters were designed to provide the most general interface possible.

-Ppostfilter or --postpostfilter
Specify a shell command to execute as a filter after compression or decompression.

Synopsis

dictzip[options]namedictunzip[options]name

Tradeoffs

Speed  True random file access is not realized, since any access, even for a single byte, requires that a
              64kB chunk be read and decompressed.  This is slower than accessing a flat text file, but is much,
              much faster than performing serial access on a fully compressed file.

       Space  For  the textual dictionary databases we are working with, the use of 64kB chunks and maximal LZ77
              compression realizes a file which is only about 4% larger than the same  file  compressed  all  at
              once.