leaff - sequence library utilities and applications

Description

       LEAFF  (Let's  Extract  Anything  From Fasta) is a utility program for working with multi-fasta files. In
       addition to providing random access to the base level, it includes several analysis functions.

Examples

       1. Print the first 10 bases of the fourth sequence in file 'genes':
            leaff -f genes -e 0 10 -s 3

       2. Print the first 10 bases of the fourth and fifth sequences:
            leaff -f genes -e 0 10 -s 3 -s 4

       3.  Print the fourth and fifth sequences reverse complemented, and the sixth
           sequence forward. The second set of -R -C toggle off reverse-complement:
            leaff -f genes -R -C -s 3 -s 4 -R -C -s 5

       4.  Convert file 'genes' to a seqStore 'genes.seqStore'.
            leaff -f genes --seqstore genes.seqStore

Name

       leaff - sequence library utilities and applications

Notes

       Please note that options are ORDER DEPENDENT. Sequences are printed whenever a SEQUENCE SELECTION  option
       occurs on the command line. OUTPUT OPTIONS are not reset when a sequence is printed.

       SEQUENCES are numbered starting at ZERO, not one!

Options

       SOURCE FILES
          -f file:     use sequence in 'file' (-F is also allowed for historical reasons)
          -A file:     read actions from 'file'

       SOURCE FILE EXAMINATION
          -d:          print the number of sequences in the fasta
          -i name:     print an index, labelling the source 'name'

       OUTPUT OPTIONS
          -6 <#>:      insert a newline every 60 letters
                         (if the next arg is a number, newlines are inserted every
                         n letters, e.g., -6 80.  Disable line breaks with -6 0,
                         or just don't use -6!)
          -e beg end:  Print only the bases from position 'beg' to position 'end'
                         (space based, relative to the FORWARD sequence!)  If
                         beg == end, then the entire sequence is printed.  It is an
                         error to specify beg > end, or beg > len, or end > len.
          -ends n      Print n bases from each end of the sequence.  One input
                         sequence generates two output sequences, with '_5' or '_3'
                         appended to the ID.  If 2n >= length of the sequence, the
                         sequence itself is printed, no ends are extracted (they
                         overlap).
          -C:          complement the sequences
          -H:          DON'T print the defline
          -h:          Use the next word as the defline ("-H -H" will reset to the
                         original defline
          -R:          reverse the sequences
          -u:          uppercase all bases

       SEQUENCE SELECTION
          -G n s l:    print n randomly generated sequences, 0 < s <= length <= l
          -L s l:      print all sequences such that s <= length < l
          -N l h:      print all sequences such that l <= % N composition < h
                         (NOTE 0.0 <= l < h < 100.0)
                         (NOTE that you cannot print sequences with 100% N
                          This is a useful bug).
          -q file:     print sequences from the seqid list in 'file'
          -r num:      print 'num' randomly picked sequences
          -s seqid:    print the single sequence 'seqid'
          -S f l:      print all the sequences from ID 'f' to 'l' (inclusive)
          -W:          print all sequences (do the whole file)

       LONGER HELP
          -help analysis
          -help examples

       ANALYSIS FUNCTIONS
          --findduplicates a.fasta
                       Reports sequences that are present more than once.  Output
                       is a list of pairs of deflines, separated by a newline.

          --mapduplicates a.fasta b.fasta
                       Builds a map of IIDs from a.fasta and b.fasta that have
                       identical sequences.  Format is "IIDa <-> IIDb"

          --md5 a.fasta:
                       Don't print the sequence, but print the md5 checksum
                       (of the entire sequence) followed by the entire defline.

          --partition     prefix [ n[gmk]bp | n ] a.fasta
          --partitionmap         [ n[gmk]bp | n ] a.fasta
                       Partition the sequences into roughly equal size pieces of
                       size nbp, nkbp, nmbp or ngbp; or into n roughly equal sized
                       partitions.  Sequences larger that the partition size are
                       in a partition by themself.  --partitionmap writes a
                       description of the partition to stdout; --partiton creates
                       a fasta file 'prefix-###.fasta' for each partition.
                       Example: -F some.fasta --partition parts 130mbp
                                -F some.fasta --partition parts 16

          --segment prefix n a.fasta
                       Splits the sequences into n files, prefix-###.fasta.
                       Sequences are not reordered; the first n sequences are in
                       the first file, the next n in the second file, etc.

          --gccontent a.fasta
                       Reports the GC content over a sliding window of
                       3, 5, 11, 51, 101, 201, 501, 1001, 2001 bp.

          --testindex a.fasta
                       Test the index of 'file'.  If index is up-to-date, leaff
                       exits successfully, else, leaff exits with code 1.  If an
                       index file is supplied, that one is tested, otherwise, the
                       default index file name is used.

          --dumpblocks a.fasta
                       Generates a list of the blocks of N and non-N.  Output
                       format is 'base seq# beg end len'.  'N 84 483 485 2' means
                       that a block of 2 N's starts at space-based position 483
                       in sequence ordinal 84.  A '.' is the end of sequence
                       marker.

          --errors L N C P a.fasta
                       For every sequence in the input file, generate new
                       sequences including simulated sequencing errors.
                       L -- length of the new sequence.  If zero, the length
                            of the original sequence will be used.
                       N -- number of subsequences to generate.  If L=0, all
                            subsequences will be the same, and you should use
                            C instead.
                       C -- number of copies to generate.  Each of the N
                            subsequences will have C copies, each with different
                            errors.
                       P -- probability of an error.

                       HINT: to simulate ESTs from genes, use L=500, N=10, C=10
                                -- make C=10 sequencer runs of N=10 EST sequences
                                   of length 500bp each.
                             to simulate mRNA from genes, use L=0, N=10, C=10
                             to simulate reads from genomes, use L=800, N=10, C=1
                                -- of course, N= should be increased to give the
                                   appropriate depth of coverage

          --stats a.fasta
                       Reports size statistics; number, N50, sum, largest.

          --seqstore out.seqStore
                       Converts the input file (-f) to a seqStore file (for instance,
                       for use with the Celera assembler or sim4db).