Obtainingsub-alignmentsandcombiningalignments--start, -s <start_col>
Starting column of sub-alignment (indexing starts with 1). Default is 1. Note that coordinates
use the frame of reference of the entire alignment unless --refidx 1 is specified.
--end, -e <end_col>
Ending column of sub-alignment.
Default is length of
alignment.
Note that coordinates use the frame of reference
of the entire alignment unless --refidx 1 is specified.
--seqs, -l <seq_list> Comma-separated list of sequences to include (default) exclude (if --exclude).
Indicate by sequence number or name (numbering starts with 1 and is evaluated *after* --order is
applied).
--exclude, -x
Exclude rather than include specified sequences.
--refidx, -r <ref_seq>
Index of reference sequence for coordinates.
Use 0 to
indicate the coordinate system of the alignment as a whole (this is the default).
--aggregate, -A <name_list>
(Not compatible with --start or --end) Create an aggregate alignment from a set of alignment
files, by concatenating individual alignments. If used with --out-format SS and --unordered-ss,
the aggregate alignment will never be created explicitly (recommended for large data sets). The
argument <name_list> must be a list of sequence names, including all names in all specified
alignments (missing sequences will be replaced by rows of missing data). The standard <msa_fname>
argument should be replaced with a list of (whitespaceseparated) file names.
--split-all, -X <filename root>
Split output alignment into separate fasta files by species. File naming convention is
filename_root.species.fa. If used with --gap-strip, gap characters will be stripped from all
output files. In this case, '--gap-strip <s>' should NOT be used (ALL or ANY should both work
fine).
Fileformats,gapstripping,reordering,etc.--in-format, -i PHYLIP|FASTA|MPM|MAF|SS
(Default is to guess format from file contents).
Input file
format.
FASTA is as usual. PHYLIP is compatible with the formats
used in the PHYLIP and PAML packages.
MPM is the format used by the
MultiPipMaker aligner and some other of Webb Miller's older tools. MAF ("Multiple Alignment
Format") is used by MULTIZ/TBA and the UCSC Genome Browser. SS is a simple format describing the
sufficient statistics for phylogenetic inference (distinct columns or tuple of columns and their
counts). Use --out-format SS with --in-format MAF for best efficiency (explicit alignment is
never created). Also, use --unordered-ss if possible.
--out-format, -o PHYLIP|FASTA|MPM|SS (Default FASTA) Output file format.
--alphabet, -a <alphabet_string>
Use the specified alphabet (default "ACGT").
In addition,
'-' characters are assumed to represent alignment gaps, and '*' and 'N' characters are allowed for
missing data. Alphabetical letters not in the alphabet will be converted to 'N's upon input.
This option is ignored with SS input (alphabet specified within SS files.)
--soft-masked, -f
Implies --alphabet 'ACGTNacgtn', useful for soft-masked sequences.
--unmask, -u
Remove soft-masking; convert to uppercase.
--pretty, -P Pretty-print alignment (use '.' when character matches corresponding character in first
sequence). Ignored if --out-format SS is selected.
--gap-strip, -G ALL|ANY|<s> Strip columns containing all gaps, any gaps, or a gap in the specified
sequence (<s>). Indexing starts at one and refers to the list *after* any sequences have been
added or subtracted (via --seqs and --exclude or --order).
--collapse-missing, -p
(For use with -o SS) Convert all missing-data characters and gaps to "*" characters. Can be used
to make sufficient statistics more compact, which can improve the performance of phyloFit (all
missing data and gap characters are typically treated the same by phyloFit anyway).
--mark-missing, -K <maxlen> Convert all gaps of length greater than <maxlen> to "*" characters. If
--refidx is specified (with a positive index), gaps in the designated reference sequence will not
be altered. This is a useful heuristic for distinguishing between microindels and regions of
missing data (e.g., due to large-scale indels, incomplete assemblies, or highly diverged
sequences).
--missing-as-indels, -m
Convert all missing data characters (Ns and *s) to gap characters, except for Ns in a reference
sequence specified by --refidx, which will be replaced by randomly selected nucleotides. (This
allows the coordinate frame for the reference sequence to be maintained; this option is only
recommended if such Ns are rare.) If --refidx is not used, all Ns will be replaced by gaps. You
may want to use --gap-strip ALL with this option.
--order, -O <name_list> Change order of rows in alignment to match sequence names specified in name_list.
If a name appears in name_list but not in the alignment, a row of gaps will be inserted. This
option is applied to the alignment *before* --seqs, --refidx, and --gap-strip are applied.
--reverse-complement, -V
Reverse complement output alignment.
--randomize, -R Randomly permute the columns of the source alignment (done *before* taking
sub-alignment). Requires an ordered representation of the alignment (careful using with
--in-format SS|MAF -- will create full alignment from sufficient statistics).
--fill-Ns, -N <s:b-e>
Fill sequence no. <s> with Ns, from <b> to <e>. Applied before --start, --end, --seqs,
--gap-strip, but after --order. Coordinate frame depends on --refidx. Can be used multiple
times.
--summary-only-S Report only summary statistics, rather than complete alignment. Statistics are for
alignment that would otherwise be output (i.e., after other options have been applied).
--window-summary, -w <win_size> Like -S, but output summary statistics for non-overlapping windows of the
specified size. (Sufficient statistics)
--tuple-size, -T <tup_size> (For use with --out-format SS). Represent an alignment in terms of tuples of
columns of the designated size. Useful
with context-dependent phylogenetic models.
--unordered-ss, -z
(For use with --out-format SS).
Suppress the portion of the
sufficient statistics concerned with the order in which columns appear. Useful for analyses for
which order is unimportant. (MAF input)
--refseq, -M <fname>
Read the complete text of the reference sequence from <fname> (FASTA format) and combine it with
the contents of the MAF file to produce a complete, ordered representation of the alignment
(unaligned regions will be represented by gaps). Best used with --out-format SS. The reference
sequence of the MAF file is assumed to be the one that appears first in each block.
--keep-overlapping, -k
Keep blocks in MAF that have overlapping coordinates in the reference (1st) sequence (by default,
only the first one is kept). Useful in extracting unordered stats from a jumbled collection of
MAF blocks (e.g., output of Jim Kent's mafFrags program). Cannot be used with --refseq,
--features, or
--cats-cycle. (Site categories: all options require --out-format SS)
--features, -g <gff_fname>
(Requires --catmap) Read sequence annotations from the specified file (GFF) and label the columns
of the alignment accordingly. Note: UCSC BED and genepred formats are now recognized as well.
--catmap, -c <fname>|<string>
(optionally use with --features) Mapping of feature types to category numbers. Can either give a
filename or an "inline" description of a simple category map, e.g., --catmap "NCATS = 3 ; CDS 1-3"
or --catmap "NCATS = 1 ; UTR 1".
--cats-cycle, -Y <cycle_size> (alternative to --features and --catmap) Assign site categories in cycles
of the specified size, e.g., as 1,2,3,...,1,2,3 (for cycle_size == 3). Useful for in-frame coding
sequence, or to partition a data set into nonoverlapping tuples of columns (use with --do-cats).
--do-cats, -C <cat_list>
(For use with --features or --cats-cycle)
Obtain
sufficient statistics only for the specified categories (comma-delimited list, by number).
--codons, -D Extract sufficient statistics for in-frame codons. Implies --tuple-size 3 --cats-cycle 3
--do-cats 3. Not appropriate
for use with --features/--catmap.
--reverse-groups, -W <tag>
(For use with --features) Group features by <tag> (e.g., "transcript_id" or "exon_id") and reverse
complement segments of the alignment corresponding to groups on the reverse strand. Groups must
be non-overlapping (see refeature --unique). Useful when extracting sufficient statistics for
strand-specific site categories (e.g., codon positions).
--4d, -4
(For use with --features; assumes coding regions have feature type 'CDS') Extract sufficient
statistics for fourfold degenerate synonymous sites. Implies --out-format SS --unordered-stats--tuple-size 3 --reverse-groups transcript_id.