Boulder::Genbank - Fetch Genbank data records as parsed Boulder Stones
Contents
Description
Boulder::Genbank provides retrieval and parsing services for NCBI Genbank-format records. It returns
Genbank entries in Stone format, allowing easy access to the various fields and values. Boulder::Genbank
is a descendent of Boulder::Stream, and provides a stream-like interface to a series of Stone objects.
>> IMPORTANT NOTE <<
As of January 2002, NCBI has changed their Batch Entrez interface. I have modified Boulder::Genbank so
as to use a "demo" interface, which fixes things, but this isn't guaranteed in the long run.
I have written to NCBI, and they may fix this -- or they may not.
>> IMPORTANT NOTE <<
Access to Genbank is provided by three different accessors, which together give access to remote and
local Genbank databases. When you create a new Boulder::Genbank stream, you provide one of the three
accessors, along with accessor-specific parameters that control what entries to fetch. The three
accessors are:
Entrez
This provides access to NetEntrez, accessing the most recent Genbank information directly from NCBI's
Web site. The parameters passed to this accessor are either a series of Genbank accession numbers,
or an Entrez query (see http://www.ncbi.nlm.nih.gov/Entrez/linking.html). If you provide a list of
accession numbers, the stream will return a series of stones corresponding to the numbers.
Otherwise, if you provided an Entrez query, the entries returned will be in the order returned by
Entez.
File
This provides access to local Genbank entries by reading from a flat file (typically one of the .seq
files downloadable from NCBI's Web site). The stream will return a Stone corresponding to each of
the entries in the file, starting from the top of the file and working downward. The parameter in
this case is the path to the local file.
Yank
This provides access to local Genbank entries using Will Fitzhugh's Yank program. Yank provides fast
indexed access to a Genbank flat file using the accession number as the key. The parameter passed to
the Yank accessor is a list of accession numbers. Stones will be returned in the requested order.
By default the yank binary lives in /usr/local/bin/yank. To support other locations, you may define
the environment variable YANK to contain the full path.
It is also possible to parse a single Genbank entry from a text string stored in a scalar variable,
returning a Stone object.
Boulder::Genbankmethods
This section lists the public methods that the Boulder::Genbank class makes available.
new()
# Network fetch via Entrez, with accession numbers
$gb=new Boulder::Genbank(-accessor => 'Entrez',
-fetch => [qw/M57939 M28274 L36028/]);
# Same, but shorter and uses -> operator
$gb = Boulder::Genbank->new qw(M57939 M28274 L36028);
# Network fetch via Entrez, with a query
# Network fetch via Entrez, with a query
$query = 'Homo sapiens[Organism] AND EST[Keyword]';
$gb=new Boulder::Genbank(-accessor => 'Entrez',
-fetch => $query);
# Local fetch via Yank, with accession numbers
$gb=new Boulder::Genbank(-accessor => 'Yank',
-fetch => [qw/M57939 M28274 L36028/]);
# Local fetch via File
$gb=new Boulder::Genbank(-accessor => 'File',
-fetch => '/usr/local/genbank/gbpri3.seq');
The new() method creates a new Boulder::Genbank stream on the accessor provided. The three possible
accessors are Entrez, Yank and File. If successful, the method returns the stream object. Otherwise
it returns undef.
new() takes the following arguments:
-accessor Name of the accessor to use
-fetch Parameters to pass to the accessor
-proxy Path to an HTTP proxy, used when using
the Entrez accessor over a firewall.
Specify the accessor to use with the -accessor argument. If not specified, it defaults to Entrez.
-fetch is an accessor-specific argument. The possibilities are:
For Entrez, the -fetch argument may point to a scalar, in which case it is interpreted as an Entrez
query string. See http://www.ncbi.nlm.nih.gov/Entrez/linking.html for a description of the query
syntax. Alternatively, -fetch may point to an array reference, in which case it is interpreted as a
list of accession numbers to retrieve. If -fetch points to a hash, it is interpreted as extended
information. See "Extended Entrez Parameters" below.
For Yank, the -fetch argument must point to an array reference containing the accession numbers to
retrieve.
For File, the -fetch argument must point to a string-valued scalar, which will be interpreted as the
path to the file to read Genbank entries from.
For Entrez (and Entrez only) Boulder::Genbank allows you to use a shortcut syntax in which you provde
new() with a list of accession numbers:
$gb = new Boulder::Genbank('M57939','M28274','L36028');
newFh()
This works like new(), but returns a filehandle. To recover each GenBank record read from the
filehandle with the <> operator:
$fh = Boulder::GenBank->newFh('M57939','M28274','L36028');
while ($record = <$fh>) {
print $record->asString;
}
get()
The get() method is inherited from Boulder::Stream, and simply returns the next parsed Genbank Stone,
or undef if there is nothing more to fetch. It has the same semantics as the parent class, including
the ability to restrict access to certain top-level tags.
The object returned is a Stone::GB_Sequence object, which is a descendent of Stone.
put()
The put() method is inherited from the parent Boulder::Stream class, and will write the passed Stone
to standard output in Boulder format. This means that it is currently not possible to write a
Boulder::Genbank object back into Genbank flatfile form.
ExtendedEntrezParameters
The Entrez accessor recognizes extended parameters that allow you the ability to customize the search.
Instead of passing a query string scalar or a list of accession numbers as the -fetch argument, pass a
hash reference. The hashref should contain one or more of the following keys:
-query
The Entrez query to process.
-accession
The list of accession numbers to fetch, as an array ref.
-db The database to search. This is a single-letter database code selected from the following list:
m MEDLINE
p Protein
n Nucleotide
s Popset
-proxy
An HTTP proxy to use. For example:
-proxy => http://www.firewall.com:9000
If you think you need this, get the correct URL from your system administrator.
As an example, here's how to search for ESTs from Oryza sativa that have been entered or modified since
1999.
my $gb = new Boulder::Genbank( -accessor=>Entrez,
-query=>'Oryza sativa[Organism] AND EST[Keyword] AND 1999[MDAT]',
-db => 'n'
});
Example Genbank Object
The following is an excerpt from a moderately complex Genbank Stone. The Sequence line and several other
long lines have been truncated for readability.
Authors=Spritz,R.A., Strunk,K., Surowy,C.S.O., Hoch,S., Barton,D.E. and Francke,U.
Authors=Spritz,R.A., Strunk,K., Surowy,C.S. and Mohrenweiser,H.W.
Locus=HUMRNP7011 2155 bp DNA PRI 03-JUL-1991
Accession=M57939
Accession=J04772
Accession=M57733
Keywords=ribonucleoprotein antigen.
Sequence=aagcttttccaggcagtgcgagatagaggagcgcttgagaaggcaggttttgcagcagacggcagtgacagcccag...
Definition=Human small nuclear ribonucleoprotein (U1-70K) gene, exon 10 and 11.
Journal=Nucleic Acids Res. 15, 10373-10391 (1987)
Journal=Genomics 8, 371-379 (1990)
Nid=g337441
Medline=88096573
Medline=91065657
Features={
Polya_site={
Evidence=experimental
Position=1989
Gene=U1-70K
}
Polya_site={
Position=1990
Gene=U1-70K
}
Polya_site={
Evidence=experimental
Position=1992
Gene=U1-70K
}
Polya_site={
Evidence=experimental
Position=1998
Gene=U1-70K
}
Source={
Organism=Homo sapiens
Db_xref=taxon:9606
Position=1..2155
Map=19q13.3
}
Cds={
Codon_start=1
Product=ribonucleoprotein antigen
Db_xref=PID:g337445
Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
Gene=U1-70K
Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPRDAPPPTR...
}
Cds={
Codon_start=1
Product=ribonucleoprotein antigen
Db_xref=PID:g337444
Evidence=experimental
Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
Gene=U1-70K
Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPR...
}
Polya_signal={
Position=1970..1975
Note=putative
Gene=U1-70K
}
Intron={
Evidence=experimental
Position=1100..1208
Gene=U1-70K
}
Intron={
Number=10
Evidence=experimental
Position=1100..1181
Gene=U1-70K
}
Intron={
Number=9
Evidence=experimental
Position=order(M57937:702..921,1..1011)
Note=2.1 kb gap
Gene=U1-70K
}
Intron={
Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1208)
Gene=U1-70K
}
Intron={
Evidence=experimental
Position=order(M57935:284..406,M57936:1..284,M57937:1..599, <1..>1208)
Note=first gap-0.14 kb, second gap-0.62 kb
Gene=U1-70K
}
Intron={
Number=8
Evidence=experimental
Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1181)
Note=first gap-0.14 kb, second gap-0.62 kb
Gene=U1-70K
}
Exon={
Number=10
Evidence=experimental
Position=1012..1099
Gene=U1-70K
}
Exon={
Number=11
Evidence=experimental
Position=1182..(1989.1998)
Gene=U1-70K
}
Exon={
Evidence=experimental
Position=1209..(1989.1998)
Gene=U1-70K
}
Mrna={
Product=ribonucleoprotein antigen
Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
Gene=U1-70K
}
Mrna={
Product=ribonucleoprotein antigen
Citation=[2]
Evidence=experimental
Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
Gene=U1-70K
}
Gene={
Position=join(M57928:207..719,M57929:1..562,M57930:1..577, ...
Gene=U1-70K
}
}
Reference=1 (sites)
Reference=2 (bases 1 to 2155)
=
perl v5.34.0 2022-06-08 Boulder::Genbank(3pm)
Methods Defined By The Genbank Stone Object
Each record returned from the Boulder::Genbank stream defines a set of methods that correspond to
features and other fields in the Genbank flat file record. Stone::GB_Sequence gives the full details,
but they are listed for reference here:
$length=$entry->length
Get the length of the sequence.
$start=$entry->start
Get the start position of the sequence, currently always "1".
$end=$entry->end
Get the end position of the sequence, currently always the same as the length.
@feature_list=$entry->features(-pos=>[50,450],-type=>['CDS','Exon'])features() will search the entry feature list for those features that meet certain criteria. The
criteria are specified using the -pos and/or -type argument names, as shown below.
-pos
Provide a position or range of positions which the feature must overlap. A single position is
specified in this way:
-pos => 1500; # feature must overlap postion 1500
or a range of positions in this way:
-pos => [1000,1500]; # 1000 to 1500 inclusive
If no criteria are provided, then features() returns all the features, and is equivalent to calling
the Features() accessor.
-type, -types
Filter the list of features by type or a set of types. Matches are case-insensitive, so "exon",
"Exon" and "EXON" are all equivalent. You may call with a single type as in:
-type => 'Exon'
or with a list of types, as in
-types => ['Exon','CDS']
The names "-type" and "-types" can be used interchangeably.
$seqObj=$entry->bioSeq;
Returns a Bio::Seq object from the Bioperl project. Dies with an error message unless the Bio::Seq
module is installed.
Name
Boulder::Genbank - Fetch Genbank data records as parsed Boulder Stones
See Also
Boulder, Boulder::Blast
Synopsis
use Boulder::Genbank
# network access via Entrez
$gb = Boulder::Genbank->newFh( qw(M57939 M28274 L36028) );
while ($data = <$gb>) {
print $data->Accession;
@introns = $data->features->Intron;
print "There are ",scalar(@introns)," introns.\n";
$dna = $data->Sequence;
print "The dna is ",length($dna)," bp long.\n";
my @features = $data->features(-type=>[ qw(Exon Source Satellite) ],
-pos=>[90,310] );
foreach (@features) {
print $_->Type,"\n";
print $_->Position,"\n";
print $_->Gene,"\n";
}
}
# another syntax
$gb = new Boulder::Genbank(-accessor=>'Entrez',
-fetch => [qw/M57939 M28274 L36028/]);
# local access via Yank
$gb = new Boulder::Genbank(-accessor=>'Yank',
-fetch=>[qw/M57939 M28274 L36028/]);
while (my $s = $gb->get) {
# etc.
}
# parse a file of Genbank records
$gb = new Boulder::Genbank(-accessor=>'File',
-fetch => '/usr/local/db/gbpri3.seq');
while (my $s = $gb->get) {
# etc.
}
# parse flatfile records yourself
open (GB,"/usr/local/db/gbpri3.seq");
local $/ = "//\n";
while (<GB>) {
my $s = Boulder::Genbank->parse($_);
# etc.
}
