Insilicase
deus ex computa
     Skip Navigation Links.

Appendix

Illuminator alignment algorithm

General considerations

Illuminator was designed with the aim of screening individual genes for mutations, rather than screening large regions containing many genes. A number of practical considerations therefore influenced the program design. For example, in a hunt for an unknown disease-causing mutation, there is often in practice no way to prove that rare intronic sequence variants are disease-causing. Therefore, the program has a facility to create reference sequences that only include exons (along with limited flanking intronic sequences). The relatively small size of such reference sequences makes it possible to quickly align Illumina reads using a mismatch-tolerant algorithm. Also, due to the comparatively small region to be aligned, it becomes fairly straightforward to screen a number of genes by pooling several long PCR products in a single GA-II lane and then screening each gene in turn.

Indexing the reference sequence

Central to the alignment algorithm is an index of all 65536 possible 8-nt DNA sequences (octamers). The position of each successive octamer within the reference sequence is linked to the entry for the corresponding octamer sequence in this index. This means that any 8-nt test sequence can be rapidly looked up in the index, defining its location(s) in the reference file (Figure 1). Some octamers will not occur in the reference sequence, while others will be relatively abundant. Since the sequence reads originate from both the forward and reverse complement of the reference DNA sequence, the program automatically reverse-complements the genomic reference sequence and appends this to the forward sequence. This obviates the need to reverse-complement each experimental read and search for homology twice.

Figure 1

Figure 1

Mapping sequence reads against the reference file

The mapping process includes two cycles of data alignment. In the first cycle, information is collected on the locations of SNPs, which is then used to update the reference index. The data is then read again to allow the detection of closely spaced SNPs during the second cycle. After this, the locations of indels are found and mapped, using a subset of reads collected during the second cycle. These steps are performed as follows:

First cycle

Each read is broken down into 4 non-overlapping octamers and the octamer positions are found in the reference sequence by consulting the index of octamers. The program then searches for a set of positions whose locations in the reference file increase stepwise by 8 nt for each successive octamer (Figure 2). The sequence read is then mapped to the reference sequence, to identify the positions of any sequence variants within the read that lie outside the region containing the four octamers.

Figure 2

Figure 2

If the read does not contain a run of 4 sequential matching octamers as described above, Illuminator next tries to identify sets of octamers where the first and fourth octamers have the expected separation, along with one or other of the internal octamers (Figure 3). Again, the sequence read is then mapped to the reference sequence to identify the positions of any sequence variants.

Figure 3

Figure 3

Finally, any read which has not yet been mapped is checked for sets of octamers that include 3 consecutive positions in the reference sequence, as shown in Figures 4 & 5. As before, these reads are mapped on to the reference sequence and the position of any sequence variants are noted.

Figure 4

Figure 4

Figure 5

Figure 5

Second cycle

All the sequence variants found in the first cycle are now superimposed on the reference sequence, which is then re-indexed to include both the original reference sequence and the sequence variants found in the first alignment cycle. By re-indexing the reference sequence in this way, it is possible to map reads which contain multiple genuine sequence variants (such as a mutation close to a SNP). Such reads would have failed to be aligned in the first cycle, because only two octamers would have been mapped close to each other.

The alignment process is now re-run as before, except that this time any read that is found to be mismatched at either end of the sequence (i.e. Figure 4 or 5) is stored to a temporary file. These reads are important since they are the only class of read containing insertions or deletions that can easily be mapped. Once the second round of analysis is complete, the locations of all the sequence variants are noted. This information is then used to identify single nucleotide variants, including both SNPs and potential pathogenic point mutations. Also, the read depth is calculated from this alignment data, with the data split into three "phases"; the first phase comprises all the reads aligned as shown in Figures 1 and 2, the second phase comprises reads aligned with a 5′ mismatch (Figure 4), and the third phase consists of reads with a 3′ mismatch (Figure 5). This information can all be inspected within the Data view window, by selecting the "Phase" value using its list-box. (See Figure 10 in the Illuminator guide).

Mapping indels

Through empirical observations of alignment data, we found that short regions containing a large number of variants are strongly suggestive of the presence of an insertion or deletion. Therefore, Illuminator checks the positions of variants to identify regions where an unusual number of substitutions occur together within a region of 16 nt. (These variants are required to be present both in reads mapped to the forward strand and in reads mapped to the reverse complement strand of the reference sequence.) The pattern of a 3-nt deletion in a poly(A) stretch can be seen in Figures 6 (5′-mismatched sequences) and 7 (3′-mismatched sequences) which are screen-shots from the Data view window of Illuminator.

Figure 6

Figure 6

Figure 7

Figure 7

The location of the possible indel is noted and the flanking sequence is aligned to the sequence reads placed in the temporary file during the second cycle of read alignment. This alignment is done in the same manner as shown in Figures 4 and 5, but the alignments for the 5′ and 3′ mismatches to the forward and reverse-complement sequences are kept separate, and used to create four sequences of about 50 nt in length. The 5′- and 3′-mismatched sequences are then aligned to create a consensus sequence across the indel (for both the forward and reverse-complement sequences). These consensus sequences are then aligned to the reference sequence, in order to identify the location and sequence of an indel. Only indels that are independently identified on the forward and reverse-complement sequences are annotated and exported. The sequence of these alignments can be viewed in the Indels window of Illuminator, as described in the Indels window section of the Illuminator guide.



Copyright © 2020 Insilicase.