The program mkvtree constructs an index for a given set of sequences. These are given as a list of input
files. The sequences are referred to as database sequences. They can be over any given alphabet. The
alphabet can be the DNA alphabet, or the protein alphabet, or any other alphabet consisting of printable
characters. An alphabet is specified by a file storing a symbol mapping. The index consists of several
files, the index files. Each such file stores a different table. The user specifies which tables (i.e.
which part of the index) is written to a file, using one of eight output options, or a single option
specifying that all tables are written to file.
We support the following formats for the input files. They are recognized according to the first
non-whitespace symbol in the file.
• multiple FASTA format: If the file begins with the symbol ">", then this file is considered to be a
file in multiple FASTA format (i.e. it contains one or more sequences). Each line starting with the
symbol ">" contains the description of the sequence following it. Each line not starting with the
symbol ">" contains the sequence. Empty lines are allowed and ignored when reading the input.
• multiple EMBL/SWISSPROT format: If the file begins with the string "ID", then this file is considered
to be a file in multiple EMBL format (i.e. containing one or more sequences, each in EMBL format).
The information contained in the "ID" and "DE" lines is taken as the description of the corresponding
sequence. The EMBL format is identical to the SWISSPROT format (w.r.t. the information we need to
extract from such entries). So one can also use files in multiple SWISSPROT format as input.
• multiple GENBANK format: If the file begins with the string "LOCUS", then this file is considered to
be a file in multiple GENBANK format (i.e. containing one or more entries in GENBANK format). The
information contained in the "LOCUS" and the "DEFINITION" lines is taken as the description of the
corresponding sequence.
• plain format: If the file does not begin with the symbol ">" or the strings "ID" or "LOCUS", then the
file is taken verbatim. That is, the entire file is considered to be the input sequence (whitespaces
are not ignored).
There is no special option necessary to tell the program the sequence format. It automatically detects
the appropriate format, according to the rules given above. If none of the above rules apply, then the
program cannot recognize the input format and exits with error code 1. In such a case please check you
input files for if they are conform with the input formats above. Another good solution is to use a more
versatile sequence format transformation programs (e.g. readseq) to first generate multiple FASTA files
and then feed this into mkvtree.
Today many files containing sequence files are provided compressed by the program gzip. To simplify the
use of these files, mkvtree also accepts gzipped input files. These files must have the ending ".gz". The
gzipped formatted files are gunzipped internally and then processed as any other file.