User guide
Introduction
AgileGenotyper determines genotypes for over 0.5 million SNPs that have been identified by the 1000 Genome Project and are located in protein-coding
exons and their closely flanking intron sequences. The analysis is performed in the same manner as AgileAnnotator, except that
AgileGenotyper only genotypes specific positions known to be polymorphic and compares the deduced genotype with the known variants. If the
deduced genotype is consistent with the known alleles, it is stored. However, if the deduced genotype includes unknown alleles, or the position cannot be called for
reasons related to read depth, the position is recorded as a “Nocall”. Triallelic or indel variants are not genotyped.
Compared to AgileAnnotator, the default read depth and minimum minor allele frequency required by AgileGenotyper
are more stringent, with a minimum read depth set at 7 reads for the two must common alleles at a given position. A position is called as heterozygous if 25% or more of the reads
identify the minor allele. Table 1 highlights the variant calling criteria:
Variant genotyping criteria, by example
A | C | G | T | Read depth (minor+major) | Percent minor allele | Comments |
0 | 0 | 0 | 6 | 6 | 0% | No call, since read depth < 7 |
0 | 0 | 0 | 7 | 7 | 0% | Homozygous |
2 | 0 | 0 | 5 | 7 | 28.6% | Heterozygous A/T, as minor allele is >25% of major+minor allele read depth |
1 | 1 | 0 | 5 | 4 | 25% | No call, as combined read depth of major and a single minor allele <7 |
1 | 4 | 1 | 7 | 11 | 36% | No call, as minor allele read depth is not more than twice the read depth of the
remaining two (presumptively erroneous) bases (A+G) |
0 | 25 | 0 | 75 | 100 | 25% | Heterozygous A/T, since minor allele is 25% of major+minor allele read depth |
0 | 25 | 0 | 76 | 101 | 24.7% | Homozygous T, since minor allele is <25% of major+minor allele read depth |
0 | 25 | 0 | 74 | 99 | 25.2% | Heterozygous A/T, since minor allele is >25% of major+minor allele read depth |
0 | 25 | 10 | 75 | 100 | 25% | Heterozygous A/T, since minor allele is 25% of major+minor allele read depth |
Table 1
Creating an annotation file
Figure 1: AgileGenotyper user interface
AgileGenotyper is designed to derive genotype information from exome pulldown sequence data. In keeping with this aim,
AgileGenotyper refers to an annotation file that contains the locations of protein coding exons and their genomic sequences.
This file can be created by either AgileGenotyper itself or the related program AgileAnnotator.
To create an annotation file, press Create in the Create annotated feature file panel (Figure 1).
This causes the Annotation file creation window to be displayed (Figure 2).
Figure 2: The annotation file creation window
The positions of the SNPs in the Access database are defined relative to the hg19 reference sequence build; therefore, the genomic sequences in the
annotation files and the CCDS data file MUST both be derived from the hg19 build. If discordant reference sequences are used, the analysis will fail!
The location of the uncompressed FASTA-format genomic reference files is selected using the Chromosomes button under
Chromosome sequence files. These reference sequence files must follow the specific naming convention where each file name starts with "chr"
followed by either the chromosome number or "X" or "Y", and has the .fa file extension. Permitted names include chr1.fa, chr5.fa for an autosome, while the X and Y may be named chrX.fa or chr23.fa and chrY.fa or chr24.fa, respectively. (Note that while it is possible to include Y chromosome data in the annotation file, the Access
database does not contain any Y-specific SNPs. Most other programs designed to handle SNP genotype data also ignore the Y chromosome.)
Next press the CCDS button in the CCDS data files panel and select the file containing the positions
of the coding sequences as described by the Consensus CDS (CCDS) project. These files can be downloaded from the NCBI
CCDS web page or FTP site.
Finally, press the Create button under Create annotation file and enter a name for the genomic annotation
file. Since AgileAnnotator has to read all of the sequences in the genomic reference files and then write a large amount of data,
the creation of the annotation file may take several minutes.
Genotyping an exome-derived SAM file
Figure 3: Adjusting the variant calling parameters
Before a SAM file is screened for sequence variants, it is necessary to select Solexa- or Sanger-type quality scores, using the
option, and to adjust the variant calling cut-off parameters (quality and read depth), all
accessible under the menu (Figure 3).
Figure 4: Entering data
Once the analysis cut-off parameters have been set, press the Access db button in the Exome SNP database
panel (Figure 4) and select the Access file containing the SNP genotype and position data (download here). Note that while this file is an Access
database, it is not necessary to have the Microsoft Access program installed on the computer.
To select an ORDERED SAM file, press Alignment file → Select (Figure 4)
and enter the name of the SAM file to be genotyped. The aligned sequence reads in this file must be ordered by chromosome and chromosomal position. Finally, press
Screen alignment data → Screen (Figure 4) and enter the name of the file to save the
genotype data to. AgileGenotyper will then export the SNP genotypes as it reads the SAM file, showing the progress
of the analysis in its title bar (Figure 5). Since AgileGenotyper only stores the sequence reads for one gene at a time, it is not memory-hungry,
and the speed of the analysis is limited by the speed at which it can read the SAM file (which should therefore be located on a local hard drive).
Comparison of exome-derived SNP data with Affymetrix SNP 6.0 data
Exome-derived SNP data differ from Affymetrix SNP 6.0 data in two important ways. Firstly, in a typical Affymetrix microarray, a quarter to a third of the SNPs are
heterozygous. In contrast, we observe that less than 4% of SNPs derived from an exome sequencing experiment are heterozygous (and less than 2.5% of SNPs are homozygous for
the non-reference allele). This means that only about 25,000 SNPs derived from an exome sequence will differ from the reference sequence. Secondly, coverage of a region by
exome-derived SNPs is strongly affected by the region’s gene density. Consequently, in many regions of the genome, it is not possible to accurately identify small
scale features (<10 Mb). Figure 6 shows a comparison of exome-derived data (lanes 1 to 3) against Affymetrix SNP 6.0 microarray data (lane 4), for Chromosomes 1, 6, 8,
12 and 13. In each case, the exome genotypes were extracted using a minimum quality score of 10, read depth of 7 and minor allele frequency of lane 1: 10%, lane 2: 25%
and lane 3: 40%. While it can be seen that in many cases it is possible to identify autozygous regions in the exome data (red bars), this task is generally easier when
using microarray data. Figure 6E, for example, shows data for chromosome 13; an extended region of low gene density is revealed by its very low SNP coverage, which makes it
very difficult to assess for autozygosity.