AgileGenotyper guide

User guide

Introduction

AgileGenotyper determines genotypes for over 0.5 million SNPs that have been identified by the 1000 Genome Project and are located in protein-coding exons and their closely flanking intron sequences. The analysis is performed in the same manner as AgileAnnotator, except that AgileGenotyper only genotypes specific positions known to be polymorphic and compares the deduced genotype with the known variants. If the deduced genotype is consistent with the known alleles, it is stored. However, if the deduced genotype includes unknown alleles, or the position cannot be called for reasons related to read depth, the position is recorded as a “Nocall”. Triallelic or indel variants are not genotyped.

Compared to AgileAnnotator, the default read depth and minimum minor allele frequency required by AgileGenotyper are more stringent, with a minimum read depth set at 7 reads for the two must common alleles at a given position. A position is called as heterozygous if 25% or more of the reads identify the minor allele. Table 1 highlights the variant calling criteria:

Variant genotyping criteria, by example

A	C	G	T	Read depth (minor+major)	Percent minor allele	Comments
0	0	0	6	6	0%	No call, since read depth < 7
0	0	0	7	7	0%	Homozygous
2	0	0	5	7	28.6%	Heterozygous A/T, as minor allele is >25% of major+minor allele read depth
1	1	0	5	4	25%	No call, as combined read depth of major and a single minor allele <7
1	4	1	7	11	36%	No call, as minor allele read depth is not more than twice the read depth of the remaining two (presumptively erroneous) bases (A+G)
0	25	0	75	100	25%	Heterozygous A/T, since minor allele is 25% of major+minor allele read depth
0	25	0	76	101	24.7%	Homozygous T, since minor allele is <25% of major+minor allele read depth
0	25	0	74	99	25.2%	Heterozygous A/T, since minor allele is >25% of major+minor allele read depth
0	25	10	75	100	25%	Heterozygous A/T, since minor allele is 25% of major+minor allele read depth

Table 1

Creating an annotation file

Figure 1: AgileGenotyper user interface

AgileGenotyper is designed to derive genotype information from exome pulldown sequence data. In keeping with this aim, AgileGenotyper refers to an annotation file that contains the locations of protein coding exons and their genomic sequences. This file can be created by either AgileGenotyper itself or the related program AgileAnnotator. To create an annotation file, press Create in the Create annotated feature file panel (Figure 1). This causes the Annotation file creation window to be displayed (Figure 2).

Figure 2: The annotation file creation window

The positions of the SNPs in the Access database are defined relative to the hg19 reference sequence build; therefore, the genomic sequences in the annotation files and the CCDS data file MUST both be derived from the hg19 build. If discordant reference sequences are used, the analysis will fail!

The location of the uncompressed FASTA-format genomic reference files is selected using the Chromosomes button under Chromosome sequence files. These reference sequence files must follow the specific naming convention where each file name starts with "chr" followed by either the chromosome number or "X" or "Y", and has the .fa file extension. Permitted names include chr1.fa, chr5.fa for an autosome, while the X and Y may be named chrX.fa or chr23.fa and chrY.fa or chr24.fa, respectively. (Note that while it is possible to include Y chromosome data in the annotation file, the Access database does not contain any Y-specific SNPs. Most other programs designed to handle SNP genotype data also ignore the Y chromosome.)

Next press the CCDS button in the CCDS data files panel and select the file containing the positions of the coding sequences as described by the Consensus CDS (CCDS) project. These files can be downloaded from the NCBI CCDS web page or FTP site.

Finally, press the Create button under Create annotation file and enter a name for the genomic annotation file. Since AgileAnnotator has to read all of the sequences in the genomic reference files and then write a large amount of data, the creation of the annotation file may take several minutes.

Genotyping an exome-derived SAM file

Figure 3: Adjusting the variant calling parameters

Before a SAM file is screened for sequence variants, it is necessary to select Solexa- or Sanger-type quality scores, using the Type of quality score option, and to adjust the variant calling cut-off parameters (quality and read depth), all accessible under the Options menu (Figure 3).

Figure 4: Entering data

Once the analysis cut-off parameters have been set, press the Access db button in the Exome SNP database panel (Figure 4) and select the Access file containing the SNP genotype and position data (download here). Note that while this file is an Access database, it is not necessary to have the Microsoft Access program installed on the computer.

To select an ORDERED SAM file, press Alignment file → Select (Figure 4) and enter the name of the SAM file to be genotyped. The aligned sequence reads in this file must be ordered by chromosome and chromosomal position. Finally, press Screen alignment data → Screen (Figure 4) and enter the name of the file to save the genotype data to. AgileGenotyper will then export the SNP genotypes as it reads the SAM file, showing the progress of the analysis in its title bar (Figure 5). Since AgileGenotyper only stores the sequence reads for one gene at a time, it is not memory-hungry, and the speed of the analysis is limited by the speed at which it can read the SAM file (which should therefore be located on a local hard drive).

Figure 5: The title bar of AgileGenotyper shows the status of the analysis.

Comparison of exome-derived SNP data with Affymetrix SNP 6.0 data

Exome-derived SNP data differ from Affymetrix SNP 6.0 data in two important ways. Firstly, in a typical Affymetrix microarray, a quarter to a third of the SNPs are heterozygous. In contrast, we observe that less than 4% of SNPs derived from an exome sequencing experiment are heterozygous (and less than 2.5% of SNPs are homozygous for the non-reference allele). This means that only about 25,000 SNPs derived from an exome sequence will differ from the reference sequence. Secondly, coverage of a region by exome-derived SNPs is strongly affected by the region’s gene density. Consequently, in many regions of the genome, it is not possible to accurately identify small scale features (<10 Mb). Figure 6 shows a comparison of exome-derived data (lanes 1 to 3) against Affymetrix SNP 6.0 microarray data (lane 4), for Chromosomes 1, 6, 8, 12 and 13. In each case, the exome genotypes were extracted using a minimum quality score of 10, read depth of 7 and minor allele frequency of lane 1: 10%, lane 2: 25% and lane 3: 40%. While it can be seen that in many cases it is possible to identify autozygous regions in the exome data (red bars), this task is generally easier when using microarray data. Figure 6E, for example, shows data for chromosome 13; an extended region of low gene density is revealed by its very low SNP coverage, which makes it very difficult to assess for autozygosity.

Figure 6: A comparison of exome-derived SNP genotyping data against Affymetrix SNP 6.0 microarray data. Figures A to E repesent Chromosomes 1, 6, 8, 12 and 13 respectively. The minor allele frequency cut-off value was 10%, 25% and 40% for lanes 1 , 2 and 3 respectively. The red bars show regions of autozygosity.