spidey is a tool for aligning one or more mRNA sequences to a given genomic sequence. spidey was written
with two main goals in mind: find good alignments regardless of intron size; and avoid getting confused
by nearby pseudogenes and paralogs. Towards the first goal, spidey uses BLAST and Dot View (another
local alignment tool) to find its alignments; since these are both local alignment tools, spidey does not
intrinsically favor shorter or longer introns and has no maximum intron size. To avoid mistakenly
including exons from paralogs and pseudogenes, spidey first defines windows on the genomic sequence and
then performs the mRNA-to-genomic alignment separately within each window. Because of the way the
windows are constructed, neighboring paralogs or pseudogenes should be in separate windows and should not
be included in the final spliced alignment.
Initialalignmentsandconstructionofgenomicwindowsspidey takes as input a single genomic sequence and a set of mRNA accessions or FASTA sequences. All
processing is done one mRNA sequence at a time. The first step for each mRNA sequence is a high-
stringency BLAST against the genomic sequence. The resulting hits are analyzed to find the genomic
windows.
The BLAST alignments are sorted by score and then assigned into windows by a recursive function which
takes the first alignment and then goes down the alignment list to find all alignments that are
consistent with the first (same strand of mRNA, both the mRNA and genomic coordinates are nonoverlapping
and linearly consistent). On subsequent passes, the remaining alignments are examined and are put into
their own nonoverlapping, consistent windows, until no alignments are left. Depending on how many gene
models are desired, the top n windows are chosen to go on to the next step and the others are deleted.
Aligningineachwindow
Once the genomic windows are constructed, the initial BLAST alignments are freed and another BLAST search
is performed, this time with the entire mRNA against the genomic region defined by the window, and at a
lower stringency than the initial search. spidey then uses a greedy algorithm to generate a high-
scoring, nonoverlapping subset of the alignments from the second BLAST search. This consistent set is
analyzed carefully to make sure that the entire mRNA sequence is covered by the alignments. When gaps
are found between the alignments, the appropriate region of genomic sequence is searched against the
missing mRNA, first using a very low-stringency BLAST and, if the BLAST fails to find a hit, using
DotView functions to locate the alignment. When gaps are found at the ends of the alignments, the BLAST
and DotView searches are actually allowed to extend past the boundaries of the window. If the 3' end of
the mRNA does not align completely, it is first examined for the presence of a poly(A) tail. No attempt
is made to align the portion of the mRNA that seems to be a poly(A) tail; sometimes there is a poly(A)
tail that does align to the genomic sequence, and these are noted because they indicate the possibility
of a pseudogene.
Now that the mRNA is completely covered by the set of alignments, the boundaries of the alignments (there
should be one alignment per exon now) are adjusted so that the alignments abut each other precisely and
so that they are adjacent to good splice donor and acceptor sites. Most commonly, two adjacent exons'
alignments overlap by as much as 20 or 30 base pairs on the mRNA sequence. The true exon boundary may
lie anywhere within this overlap, or (as we have seen empirically) even a few base pairs outside the
overlap. To position the exon boundaries, the overlap plus a few base pairs on each side is examined for
splice donor sites, using functions that have different splice matrices depending on the organism chosen.
The top few splice donor sites (by score) are then evaluated as to how much they affect the original
alignment boundaries. The site that affects the boundaries the least is chosen, and is evaluated as to
the presence of an acceptor site. The alignments are truncated or extended as necessary so that they
terminate at the splice donor site and so that they do not overlap.
Finalresult
The windows are examined carefully to get the percent identity per exon, the number of gaps per exon, the
overall percent identity, the percent coverage of the mRNA, presence of an aligning or non-aligning
poly(A) tail, number of splice donor sites and the presence or absence of splice donor and acceptor sites
for each exon, and the occurrence of an mRNA that has a 5' or 3' end (or both) that does not align to the
genomic sequence. If the overall percent identity and percent length coverage are above the user-defined
cutoffs, a summary report is printed, and, if requested, a text alignment showing identities and
mismatches is also printed.
Interspeciesalignmentsspidey is capable of performing interspecies alignments. The major difference in interspecies alignments
is that the mRNA-genomic identity will not be close to 100% as it is in intraspecies alignments; also,
the alignments have numerous and lengthy gaps. If spidey is used in its normal mode to do interspecies
alignments, it produces gene models with many, many short exons. When the interspecies flag is set,
spidey uses different BLAST parameters to encourage longer and more gaps and to not penalize as heavily
for mismatches. This way, the alignments for the exons are much longer and more closely approximate the
actual gene structure.
ExtractingCDSalignments
When spidey is run in network-aware mode or when ASN.1 files are used for the mRNA records, it is capable
of extracting a CDS alignment from an mRNA alignment and printing the CDS information also. Since the
CDS alignment is just a subset of the mRNA alignment, it is relatively straightforward to truncate the
exon alignments as necessary and to generate a CDS alignment. Furthermore, the untranslated regions are
now defined, so the percent identity for the 5' and 3' untranslated regions is also calculated.