Appendix
Illuminator alignment algorithm
General considerations
Illuminator was designed with the aim of
screening individual genes for mutations, rather than screening large regions
containing many genes. A number of practical considerations therefore influenced
the program design. For example, in a hunt for an unknown disease-causing
mutation, there is often in practice no way to prove that rare intronic sequence
variants are disease-causing. Therefore, the program has a facility to create
reference sequences that only include exons (along with limited flanking
intronic sequences). The relatively small size of such reference sequences makes
it possible to quickly align Illumina reads using a mismatch-tolerant algorithm.
Also, due to the comparatively small region to be aligned, it becomes fairly
straightforward to screen a number of genes by pooling several long PCR products
in a single GA-II lane and then screening each gene in turn.
Indexing the reference sequence
Central to the alignment algorithm is an index of all 65536 possible 8-nt DNA
sequences (octamers). The position of each successive
octamer within the reference sequence is linked to the entry for the
corresponding octamer sequence in this index. This
means that any 8-nt test sequence can be rapidly looked up in the index,
defining its location(s) in the reference file (Figure 1). Some octamers will
not occur in the reference sequence, while others will be relatively abundant.
Since the sequence reads originate from both the forward and reverse complement
of the reference DNA sequence, the program automatically reverse-complements the
genomic reference sequence and appends this to the forward sequence. This
obviates the need to reverse-complement each experimental read and search for
homology twice.
Mapping sequence reads against the reference file
The mapping process includes two cycles of data alignment. In the first
cycle, information is collected on the locations of SNPs, which is then used to
update the reference index. The data is then read again to allow the detection
of closely spaced SNPs during the second cycle. After this, the locations of
indels are found and mapped, using a subset of reads collected during the second
cycle. These steps are performed as follows:
First cycle
Each read is broken down into 4 non-overlapping octamers and the octamer
positions are found in the reference sequence by consulting the index of
octamers. The program then searches for a set of positions whose locations in
the reference file increase stepwise by 8 nt for each successive octamer (Figure
2). The sequence read is then mapped to the reference sequence, to identify the
positions of any sequence variants within the read that lie outside the region
containing the four octamers.
If the read does not contain a run of 4 sequential matching octamers as
described above, Illuminator next tries to identify
sets of octamers where the first and fourth octamers have the expected
separation, along with one or other of the internal octamers (Figure 3). Again,
the sequence read is then mapped to the reference sequence to identify the
positions of any sequence variants.
Finally, any read which has not yet been mapped is checked for sets of
octamers that include 3 consecutive positions in the reference sequence, as
shown in Figures 4 & 5. As before, these reads are mapped on to the
reference sequence and the position of any sequence variants are noted.
Second cycle
All the sequence variants found in the first cycle are now superimposed on
the reference sequence, which is then re-indexed to include both the original
reference sequence and the sequence variants found in the first alignment cycle.
By re-indexing the reference sequence in this way, it is possible to map reads
which contain multiple genuine sequence variants (such as a mutation close to a
SNP). Such reads would have failed to be aligned in the first cycle, because
only two octamers would have been mapped close to each other.
The alignment process is now re-run as before, except that this time any read
that is found to be mismatched at either end of the sequence (i.e. Figure
4 or 5) is stored to a temporary file. These reads are important since they are
the only class of read containing insertions or deletions that can easily be
mapped. Once the second round of analysis is complete, the locations of all the
sequence variants are noted. This information is then used to identify single
nucleotide variants, including both SNPs and potential pathogenic point
mutations. Also, the read depth is calculated from this alignment data, with the
data split into three "phases"; the first phase comprises all the reads aligned
as shown in Figures 1 and 2, the second phase comprises reads aligned with a 5′
mismatch (Figure 4), and the third phase consists of reads with a 3′ mismatch
(Figure 5). This information can all be inspected within the Data view window, by selecting the "Phase" value using
its list-box. (See Figure 10 in the
Illuminator guide).
Mapping indels
Through empirical observations of alignment data, we found that short regions
containing a large number of variants are strongly suggestive of the presence of
an insertion or deletion. Therefore, Illuminator
checks the positions of variants to identify regions where an unusual number of
substitutions occur together within a region of 16 nt. (These variants are
required to be present both in reads mapped to the forward strand and in reads
mapped to the reverse complement strand of the reference sequence.) The pattern
of a 3-nt deletion in a poly(A) stretch can be seen in Figures 6 (5′-mismatched
sequences) and 7 (3′-mismatched sequences) which are screen-shots from the Data view window of Illuminator.
The location of the possible indel is noted and the flanking sequence is
aligned to the sequence reads placed in the temporary file during the second
cycle of read alignment. This alignment is done in the same manner as shown in
Figures 4 and 5, but the alignments for the 5′ and 3′ mismatches to the forward
and reverse-complement sequences are kept separate, and used to create four
sequences of about 50 nt in length. The 5′- and 3′-mismatched sequences are then
aligned to create a consensus sequence across the indel (for both the forward
and reverse-complement sequences). These consensus sequences are then aligned to
the reference sequence, in order to identify the location and sequence of an
indel. Only indels that are independently identified on the forward and
reverse-complement sequences are annotated and exported. The sequence of these
alignments can be viewed in the Indels window of Illuminator, as described in the Indels window
section of the Illuminator guide.
|