logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

meryl - in- and out-of-core kmer counting and utilities

Description

meryl computes the kmer content of genomic sequences.  Kmer content is represented as a list of kmers and
       the number of times each occurs in the input sequences.  The kmer can be restricted to only  the  forward
       kmer,  only the reverse kmer, or the canonical kmer (lexicographically smaller of the forward and reverse
       kmer at each location).  Meryl can report the histogram of counts, the list of kmers and their counts, or
       can perform mathematical and set operations on the processed data files.

       The output of meryl is two binary files, called a meryl database, which can be quickly dumped to  provide
       a histogram of counts, or the actual counts.  A C++ library is supplied for direct access to the files.

Example

Batchcreationofatable
       Initialize  the  compute  with  -configbatch, which needs all the build options.  Execute all -countbatch
       jobs, then -mergebatch to complete.

              meryl -configbatch-B [options] -o file
              meryl -countbatch 0 -o file
              meryl -countbatch 1 -o file
              ...
              meryl -countbatch N -o file
              meryl -mergebatch N -o file

Name

       meryl - in- and out-of-core kmer counting and utilities

Options

-P     Estimate  memory  requirements. Given a sequence file (-s) or an upper limit on the number of mers
              in the file (-n), compute the table size (-t in build) to minimize the  memory  usage.  This  mode
              recognizes the following options:

              -m#   size of a mer (required)

              -c#   homopolymer compression (optional)

              -p     enable positions

              -sseq.fasta
                     Sequence file to be scanned to determine the number of mers

              -n#   compute params assuming file with this many mers in it

              Only one of -s, -n need to be specified.  If both are given, -s takes priority.

       -B     Compute  the mer-count tables given a sequence file (-s) and lots of parameters.  By default, both
              strands are processed.

              -f     only build for the forward strand

              -r     only build for the reverse strand

              -C     use canonical mers (assumes both strands)

              -L#   DON'T save mers that occur less than # times

              -U#   DON'T save mers that occur more than # times

              -m#   size of a mer (required)

              -c#   homopolymer compression (optional)

              -p     enable positions

              -sseq.fasta
                     sequence to build the table for

              -otblprefix
                     output table prefix

              -v     entertain the user

              The meryl process can run in one large memory batch, in many small memory batches,  or  under  SGE
              control, all with or without using multiple CPU cores.  By default, the computation is done as one
              large  sequential process.  Multi-threaded operation is possible, at additional memory expense, as
              is segmented operation, at additional I/O expense.

              Threadedoperation
                     Split the counting in to n almost-equally sized pieces.  This uses an extra h MB (from  -P)
                     per thread.

                     -threadsn
                            use n threads to build

              Segmented,sequentialoperation
                     Split  the  counting  into pieces that will fit into no more than m MB of memory, or into n
                     equal sized pieces.  Each piece is computed sequentially, and the results are merged at the
                     end.  Only one of -memory and -segments is needed.

                     -memorym
                            use at most m MB of memory per segment

                     -segmentsn
                            use n segments

              Segmented,batchedoperation
                     Same as sequential, except this allows each segment to be manually  executed  in  parallel.
                     Only one of -memory and -segments is needed.  Also see the EXAMPLE section on this page.

                     -memorym
                            use at most m MB of memory per segment

                     -segmentsn
                            use n segments

                     -configbatch
                            create the batches

                     -countbatchn
                            run batch number n-mergebatch
                            merge the batches

                     Batched mode can run on the grid.

                     -sgejobname
                            unique  job  name  for  this execution.  Meryl will submit jobs with name mpjobname,
                            ncjobname, nmjobname, for phases prepare, count and merge.

                     -sgebuild"options"-sgemerge"options"
                            any additional options to qsub(1) (e.g., "-p -153-pe  thread  2  -A  merylaccount")
                            N.B. - -N will be ignored N.B. - be sure to quote the options

       -M     Given  a  list  of  tables, perform a math, logical or threshold operation.  Unless specified, all
              operations take any number of databases.  Math operations are:

              min    count is the minimum count for all databases.  If the mer does NOT exist in all  databases,
                     the mer has a zero count, and is NOT in the output.

              minexist
                     count is the minimum count for all databases that contain the mer

              max    count is the maximum count for all databases

              add    count is sum of the counts for all databases

              sub    count is the first minus the second (binary only)

              abs    count is the absolute value of the first minus the second (binary only)

              Logical operations are:

              and    outputs mer iff it exists in all databases

              nand   outputs mer iff it exists in at least one, but not all, databases

              or     outputs mer iff it exists in at least one database

              xor    outputs mer iff it exists in an odd number of databases

              Threshold operations are:

              lessthanx
                     outputs mer iff it has count <  x

              lessthanorequalx
                     outputs mer iff it has count <= x

              greaterthanx
                     outputs mer iff it has count >  x

              greaterthanorequalx
                     outputs mer iff it has count >= x

              equalx
                     outputs mer iff it has count == x

              Threshold operations work on exactly one database.

              -stblprefix
                     use tblprefix as a database

              -otblprefix
                     create this output

              -v     entertain the user

       -D     Dump table (not all of these work)

              -Dd    Dump a histogram of the distance between the same mers.

              -Dt    Dump mers >= a threshold.  Use -n to specify the threshold.

              -Dc    Count the number of mers, distinct mers and unique mers.

              -Dh    Dump (to stdout) a histogram of mer counts.

              -s     Read the count table from here (leave off the .mcdat or .mcidx).

See Also

simple(1), mapMers(1), mapMers-depth(1), kmer-mask(1)

meryl 0~20150520+2004                               May 2015                                            MERYL(1)

Synopsis

Estimatingmemoryrequirementsmeryl-P-mkmersize [-c#] [-p] -sseq.fastameryl-P-mkmersize [-c#] [-p] -nmercountBuildingatablemeryl-B-mkmersize  [-c#]  [-p] [-v] [-f|-r|-C] [-Lminoccurrence] [-Umaxoccurrence] [-threadsn |
       {-segmentssegments | -memorymegabytes} [-configbatch [-sgejobname]]] -sseq.fasta-otblprefixmeryl-countbatchnumber [-sgebuild "qsuboptionstring"] -otblprefixmeryl-mergebatchnumber [-sgemerge "qsuboptionstring"] -otblprefixPerformingoperationsonatablemeryl-Moperation [-v] -stblprefix [-stblprefix2 ...]  -ooutputDumpingatablemeryl-Dh-stblprefixmeryl-Dt-nmincount-stblprefix

See Also