A Seq object is a sequence with sequence features placed on it. The Seq object contains a PrimarySeq
object for the actual sequence and also implements its interface.
In Bioperl we have 3 main players that people are going to use frequently
Bio::PrimarySeq - just the sequence and its names, nothing else.
Bio::SeqFeatureI - a feature on a sequence, potentially with a sequence
and a location and annotation.
Bio::Seq - A sequence and a collection of sequence features
(an aggregate) with its own annotation.
Although Bioperl is not tied heavily to file formats these distinctions do map to file formats sensibly
and for some bioinformaticians this might help
Bio::PrimarySeq - Fasta file of a sequence
Bio::SeqFeatureI - A single entry in an EMBL/GenBank/DDBJ feature table
Bio::Seq - A single EMBL/GenBank/DDBJ entry
By having this split we avoid a lot of nasty circular references (sequence features can hold a reference
to a sequence without the sequence holding a reference to the sequence feature). See Bio::PrimarySeq and
Bio::SeqFeatureI for more information.
Ian Korf really helped in the design of the Seq and SeqFeature system.
Examples
A simple and fundamental block of code:
use Bio::SeqIO;
my $seqIOobj = Bio::SeqIO->new(-file=>"1.fa"); # create a SeqIO object
my $seqobj = $seqIOobj->next_seq; # get a Seq object
With the Seq object in hand one has access to a powerful set of Bioperl methods and related Bioperl
objects. This next script will take a file of sequences in EMBL format and create a file of the reverse-
complemented sequences in Fasta format using Seq objects. It also prints out details about the exons it
finds as sequence features in Genbank Flat File format.
use Bio::Seq;
use Bio::SeqIO;
$seqin = Bio::SeqIO->new( -format => 'EMBL' , -file => 'myfile.dat');
$seqout= Bio::SeqIO->new( -format => 'Fasta', -file => '>output.fa');
while((my $seqobj = $seqin->next_seq())) {
print "Seen sequence ",$seqobj->display_id,", start of seq ",
substr($seqobj->seq,1,10),"\n";
if( $seqobj->alphabet eq 'dna') {
$rev = $seqobj->revcom;
$id = $seqobj->display_id();
$id = "$id.rev";
$rev->display_id($id);
$seqout->write_seq($rev);
}
foreach $feat ( $seqobj->get_SeqFeatures() ) {
if( $feat->primary_tag eq 'exon' ) {
print STDOUT "Location ",$feat->start,":",
$feat->end," GFF[",$feat->gff_string,"]\n";
}
}
}
Let's examine the script. The lines below import the Bioperl modules. Seq is the main Bioperl sequence
object and SeqIO is the Bioperl support for reading sequences from files and to files
use Bio::Seq;
use Bio::SeqIO;
These two lines create two SeqIO streams: one for reading in sequences and one for outputting sequences:
$seqin = Bio::SeqIO->new( -format => 'EMBL' , -file => 'myfile.dat');
$seqout= Bio::SeqIO->new( -format => 'Fasta', -file => '>output.fa');
Notice that in the "$seqout" case there is a greater-than sign, indicating the file is being opened for
writing.
Using the
'-argument' => value
syntax is common in Bioperl. The file argument is like an argument to open() . You can also pass in
filehandles or FileHandle objects by using the -fh argument (see Bio::SeqIO documentation for details).
Many formats in Bioperl are handled, including Fasta, EMBL, GenBank, Swissprot (swiss), PIR, and GCG.
$seqin = Bio::SeqIO->new( -format => 'EMBL' , -file => 'myfile.dat');
$seqout= Bio::SeqIO->new( -format => 'Fasta', -file => '>output.fa');
This is the main loop which will loop progressively through sequences in a file, and each call to
$seqio->next_seq() provides a new Seq object from the file:
while((my $seqobj = $seqio->next_seq())) {
This print line below accesses fields in the Seq object directly. The $seqobj->display_id is the way to
access the display_id attribute of the Seq object. The $seqobj->seq method gets the actual sequence out
as string. Then you can do manipulation of this if you want to (there are however easy ways of doing
truncation, reverse-complement and translation).
print "Seen sequence ",$seqobj->display_id,", start of seq ",
substr($seqobj->seq,1,10),"\n";
Bioperl has to guess the alphabet of the sequence, being either 'dna', 'rna', or 'protein'. The alphabet
attribute is one of these three possibilities.
if( $seqobj->alphabet eq 'dna') {
The $seqobj->revcom method provides the reverse complement of the Seq object as another Seq object. Thus,
the $rev variable is a reference to another Seq object. For example, one could repeat the above print
line for this Seq object (putting $rev in place of $seqobj). In this case we are going to output the
object into the file stream we built earlier on.
$rev = $seqobj->revcom;
When we output it, we want the id of the outputted object to be changed to "$id.rev", ie, with .rev on
the end of the name. The following lines retrieve the id of the sequence object, add .rev to this and
then set the display_id of the rev sequence object to this. Notice that to set the display_id attribute
you just need call the same method, display_id(), with the new value as an argument. Getting and setting
values with the same method is common in Bioperl.
$id = $seqobj->display_id();
$id = "$id.rev";
$rev->display_id($id);
The write_seq method on the SeqIO output object, $seqout, writes the $rev object to the filestream we
built at the top of the script. The filestream knows that it is outputting in fasta format, and so it
provides fasta output.
$seqout->write_seq($rev);
This block of code loops over sequence features in the sequence object, trying to find ones who have been
tagged as 'exon'. Features have start and end attributes and can be outputted in Genbank Flat File
format, GFF, a standarized format for sequence features.
foreach $feat ( $seqobj->get_SeqFeatures() ) {
if( $feat->primary_tag eq 'exon' ) {
print STDOUT "Location ",$feat->start,":",
$feat->end," GFF[",$feat->gff_string,"]\n";
}
}
The code above shows how a few Bio::Seq methods suffice to read, parse, reformat and analyze sequences
from a file. A full list of methods available to Bio::Seq objects is shown below. Bear in mind that some
of these methods come from PrimarySeq objects, which are simpler than Seq objects, stripped of features
(see Bio::PrimarySeq for more information).
# these methods return strings, and accept strings in some cases:
$seqobj->seq(); # string of sequence
$seqobj->subseq(5,10); # part of the sequence as a string
$seqobj->accession_number(); # when there, the accession number
$seqobj->alphabet(); # one of 'dna','rna',or 'protein'
$seqobj->version() # when there, the version
$seqobj->keywords(); # when there, the Keywords line
$seqobj->length() # length
$seqobj->desc(); # description
$seqobj->primary_id(); # a unique id for this sequence regardless
# of its display_id or accession number
$seqobj->display_id(); # the human readable id of the sequence
Some of these values map to fields in common formats. For example, The display_id() method returns the
LOCUS name of a Genbank entry, the (\S+) following the > character in a Fasta file, the ID from a
SwissProt file, and so on. The desc() method will return the DEFINITION line of a Genbank file, the
description following the display_id in a Fasta file, and the DE field in a SwissProt file.
# the following methods return new Seq objects, but
# do not transfer features across to the new object:
$seqobj->trunc(5,10) # truncation from 5 to 10 as new object
$seqobj->revcom # reverse complements sequence
$seqobj->translate # translation of the sequence
# if new() can be called this method returns 1, else 0
$seqobj->can_call_new
# the following method determines if the given string will be accepted
# by the seq() method - if the string is acceptable then validate()
# returns 1, or 0 if not
$seqobj->validate_seq($string)
# the following method returns or accepts a Species object:
$seqobj->species();
Please see Bio::Species for more information on this object.
# the following method returns or accepts an Annotation object
# which in turn allows access to Annotation::Reference
# and Annotation::Comment objects:
$seqobj->annotation();
These annotations typically refer to entire sequences, unlike features. See Bio::AnnotationCollectionI,
Bio::Annotation::Collection, Bio::Annotation::Reference, and Bio::Annotation::Comment for details.
It is also important to be able to describe defined portions of a sequence. The combination of some
description and the corresponding sub-sequence is called a feature - an exon and its coordinates within a
gene is an example of a feature, or a domain within a protein.
# the following methods return an array of SeqFeatureI objects:
$seqobj->get_SeqFeatures # The 'top level' sequence features
$seqobj->get_all_SeqFeatures # All sequence features, including sub-seq
# features, such as features in an exon
# to find out the number of features use:
$seqobj->feature_count
Here are just some of the methods available to SeqFeatureI objects:
# these methods return numbers:
$feat->start # start position (1 is the first base)
$feat->end # end position (2 is the second base)
$feat->strand # 1 means forward, -1 reverse, 0 not relevant
# these methods return or accept strings:
$feat->primary_tag # the name of the sequence feature, eg
# 'exon', 'glycoslyation site', 'TM domain'
$feat->source_tag # where the feature comes from, eg, 'EMBL_GenBank',
# or 'BLAST'
# this method returns the more austere PrimarySeq object, not a
# Seq object - the main difference is that PrimarySeq objects do not
# themselves contain sequence features
$feat->seq # the sequence between start,end on the
# correct strand of the sequence
See Bio::PrimarySeq for more details on PrimarySeq objects.
# useful methods for feature comparisons, for start/end points
$feat->overlaps($other) # do $feat and $other overlap?
$feat->contains($other) # is $other completely within $feat?
$feat->equals($other) # do $feat and $other completely agree?
# one can also add features
$seqobj->add_SeqFeature($feat) # returns 1 if successful
# sub features. For complex join() statements, the feature
# is one sequence feature with many sub SeqFeatures
$feat->sub_SeqFeature # returns array of sub seq features
Please see Bio::SeqFeatureI and Bio::SeqFeature::Generic, for more information on sequence features.
It is worth mentioning that one can also retrieve the start and end positions of a feature using a
Bio::LocationI object:
$location = $feat->location # $location is a Bio::LocationI object
$location->start; # start position
$location->end; # end position
This is useful because one needs a Bio::Location::SplitLocationI object in order to retrieve the
coordinates inside the Genbank or EMBL join() statements (e.g. "CDS
join(51..142,273..495,1346..1474)"):
if ( $feat->location->isa('Bio::Location::SplitLocationI') &&
$feat->primary_tag eq 'CDS' ) {
foreach $loc ( $feat->location->sub_Location ) {
print $loc->start . ".." . $loc->end . "\n";
}
}
See Bio::LocationI and Bio::Location::SplitLocationI for more information.