User guide
Introduction
AgileVariantViewer allows sequence variants identified by AgileAnnotator (and optionally pre-filtered by
AgileKnownSNPFilter) to be interactively filtered in order to facilitate identification of genuine (and possibly pathogenic) mutations.
The filtering process can be performed by adjusting the cut-off parameters for read depth and the minor allele frequency. Variants can also be selected based on their position
within a gene (exon, intron, splice site or Kozak start site). It is also possible to select variants by type (point mutation or indel) and by predicted functional severity.
If the sequence variants have been pre-filtered by AgileKnownSNPFilter, variants may also be selected according to whether they have an
RS number, have been identified by the 1000 Genomes Project, or have not been described before. By adjusting the read depth and the minor allele frequency parameters and observing
the effect of this on known genuine sequence variants with RS numbers, it is possible to assess how the chosen cut-off values affect the detection of true variants.
Once the desired cut-off values have been settled upon, the resulting set of sequence variants, from a specific region, from a single chromosome, or from the entire genome,
can then be exported. It is also possible to export only homozygous variants or heterozygous variants.
Data used in this guide
The download page contains a link to the PXDN sequence variant data used for the examples in this guide. This dataset was used to identify a
PXDN mutation in consanguineous patients with a rare congenital eye defect (citation). The data is derived from an Agilent
capture reagent that was used to select regions for which the patient was autozygous. However, the reagent also enriched for other regions for which the patient was not autozygous.
Although the actual disease-causing variant is located on the tip of Chromosome 2p, this guide will concentrate on a larger autozygous region on Chromosome 20 (Figures 11 to 23)
and on a heterozygous region on Chromosome 1 (Figures 1 to 6). Figure 14 does show the region containing the deleterious variant on Chromosome 2. The variants have been pre-filtered
by AgileKnownSNPFilter.
ATOH7 sequence analysis walk-through
While this document describes in detail the use of AgileVariantViewer with data used to identify the PXDN mutation, a second guide that
demonstrates the identification of a mutation in ATOH7 can be found here. Unlike the PXDN example, to identify the disease-causing
ATOH7 mutation, it proved necessary to screen the sequence variants found in this second data set using AgileGeneFilter.
File formats
A description of the file formats for the variant and read depth files used by AgileVariantViewer can be found here.
Entering data files
AgileVariantViewer requires three files: the genomic annotation file, the sequence variants file, and the read depth file, each
of which is produced by AgileAnnotator.
Genomic annotation files contain
the sequences and positional information of the coding exons, as described by the Consensus CDS (CCDS)
project, and are used in the detection and annotation of sequence variants by AgileAnnotator. It is VERY important that
the genomic annotation files used to create and then view sequence variants refer to the same version of the CCDS and genomic reference data.
Figure 1: Selecting the data files
To select a genomic annotation file, press Select in the Genomic annotation file panel
(Figure 1) and select the correct file. Since the file is large it may take a few moments to read.
Next press Select in the Variant file panel (Figure 1) and select the file containing the sequence variants identified by
AgileAnnotator.
Finally, press Select in the Read depth file panel (Figure 1) and select the file containing the read depth data calculated
by AgileAnnotator. Once all three files have been selected, press Screen in the View variant data
panel to display a graphical view of the data (Figure 2).
Viewing the sequence variant data
When the Data view window opens, it displays a graphical view of the sequence variant data on Chromosome 1 in its upper panel (Figure 2).
However, an alternative view, of the exon read depth as a histogram, can be selected (Region view options → Show
→ “Read depth”). This option is described in greater detail below under Viewing the exonic read depth data (see Figure 15).
Below the graphical display panel are five “options” panels which
allow the sequence variants to be filtered and then exported. Each of these panels is described in detail below:
Description of the sequence variant display panel
The upper panel displays the analysis data organised as five horizontal strips (Figure 3).
- Strip A: shows the location of any genes (black rectangles) in the selected region, with the green and orange rectangles representing exons transcribed from
the positive and negative DNA strands respectively.
Placing the cursor over any gene in this strip causes the gene’s name to appear in the window’s title bar.
- Strip B: shows the positions of exons whose typical read depth is above (grey block) or below (red block) the cut-off value selected from the
Exon read depth cut off list in the Region view options panel (see below).
- Strip C: shows the locations of sequence variants that are likely to affect the function of a protein. These include those that create in-frame stop codons,
frameshift variants (indels), splice site variants and those that disrupt the gene’s normal stop codon.
The vertical lines in the upper part of the strip represent homozygous
variants, while those in the lower part of the strip correspond to heterozygous variants.
- Strip D: shows variants that alter the amino acid sequence of a protein, as well as those close to the start codon which might affect protein translation. These
variants have a intermediate (and indeterminate) risk of altering the protein’s function.
The vertical lines in the upper part of the strip represent homozygous variants, while those in the lower part of the strip correspond to heterozygous variants.
- Strip E: shows sequence variants that are less likely to affect a protein’s function, since they are either in an intron or do not alter the amino acid sequence.
Again, the vertical lines in the upper part of the strip represent homozygous variants, while those in the lower part of the strip correspond to heterozygous variants.
Selecting a genomic region to view
The Region view options panel contains the controls that allow different genomic regions to be viewed (Figure 4). The
Chromosome list (Figure 4, labelled A) is used to select the chromosome to view. A chromosomal region may be selected by entering the region’s
coordinates, by entering the gene names of the genes flanking the region, or by mouse-clicking on the graphical view in the upper panel.
- To select a region by entering its coordinates, select the “Physical position” option in the Search method list (Figure 4, labelled B)
and enter the basepair coordinates of the region in the Start and End boxes (Figure 4, labelled C and D). Next,
press the Go to button; this should place two vertical black lines on the graphical data view, identifying the selected region. Finally, if
the selected region is correct, press the Zoom button to view the region.
- To select a region using names of the flanking genes, select “Gene name” from the Search method list and enter the names
of the genes at the region’s start and end points in the Start and End boxes (Figure 4 C and D). (To view a single
gene, enter the same name in both text boxes). Next, press Go to (Figure 4, labelled E); this should place two vertical black lines on the graphical
data view, identifying the selected region. Finally, if the selected region is correct, press Zoom (Figure 4, labelled F) to view the region.
- To select a region by mouse-clicking on the graphical view panel, select “Manual” under Search method (Figure 4, labelled B),
right mouse-click on the graphical view panel at the end of the region, and then left mouse-click at the start of the region. Finally, if the selected region is correct,
press the Zoom button (Figure 4, labelled F) to view the region.
Pressing the < and > buttons (Figure 4 F) moves the selected region to the left or right of its current position
by 90% of its width, so that 10% of the previous view is retained.
The Region view options panel also contains the Show options list (Figure 4, labelled G). If the “Graphic” option
is selected from this list, the sequence variants graphical view shown above is displayed. Alternatively, the “Read depth” option will invoke a histogram display of the exonic
read depths. See Viewing the exonic read depth data and Figure 15.
Adjusting the sequence variant filtering parameters
The Read depth options panel allows the sequence variants to be filtered by adjusting the minimum read depth and minor allele cut-off parameters, which are used to
genotype a sequence variant. When either of these parameters is altered, the graphical display in the upper panel is updated, allowing the effect of each change to be seen. To adjust the minimum
read depth at which a sequence variant is called, select a new value from the Minimum read depth list (Figure 5, A), while to change the minor allele frequency
cut-off value, select the appropriate value from the Minor allele cut off list (Figure 5, B).
Changing either of these values causes the graphical panel to update both the view of sequence variants that pass the selected parameters and the list of exons whose read depth is
above the new minimum read depth.
Whether an exon passes the minimum read depth depends on the minimum read depth parameter used to screen the sequence variants and on the exon read depth cut-off.
This parameter can be set to 5%, 10% or 50% using the Exon read depth cut off list (Figure 5, labelled D). If set to 5%, then for an exon to pass, 95% of all
its protein-coding positions must be covered by a read depth equal to or greater than the minimum read depth value used to filter the sequence variants. (Similarly, if the value is set
to 10% or 50% then, 90% or 50% of the protein coding positions in an exon must have a read depth equal or greater than the minimum read depth parameter.)
Exons which fail this test are drawn as red rectangles in the strip labelled B in Figure 3, while those that pass the cut-off are drawn in grey.
The effect of adjusting the sequence variant cut-off parameters
Figure 6A shows the graphical display of a heterozygous (non-autozygous) region, viewed with the default cut-off settings (note values underlined by the blue lines). If the cut-off
parameters are now increased (Figure 6B, adjusted values underlined by blue lines) the proportion of sequence variants and exons that fall below the new cut-offs increases.
This is seen as an increase in the number of red rectangles in the exon read depth strip (Figure 3, A) and a decrease in the number of vertical lines in the pink, blue and green
sequence variant strips (Figure 3 B, C, D).
Spurious heterozygous sequence variants may arise due to basecalling errors at positions that are actually homozygous either for the reference allele or a variant allele.
As the minor allele frequency cut-off is increased, heterozygous sequence variants will progressively decrease in number and will convert to homozygous, either for the reference
sequence or for a variant allele. Homozygous wildtype positions are not displayed, but homozygous variant positions are. Therefore as the Minor allele cut off
value is increased, the number of heterozygous variants decreases, while the number of homozygous variants may increase. The red circles in Figure 6 highlight a variant that changes
from heterozygous to homozygous as the cut-off values change.
The effect of changing the Exon read depth cut off from 5% (Figure 6A) to 50% can be seen in Figure 6C. Since only 50% (rather than 95% in Figure 6A)
of the protein-coding positions in an exon must now be above the cut-off for a exon to pass, there are fewer red rectangles in the exon read depth strip (Figure 3, A) than in Figure 6A and 6B.
Adjusting the read depth and allele frequency cut-offs using the “Variant reads depths” window
In addition to the above, it is possible to interactively modify the read depth and allele frequency parameters using a second graphical interface, the
Variant reads depths window. In this, sequence variants are plotted according to the read depths of their reference and variant alleles (Figures 7 to 10).
The window is displayed by pressing the Read depth options → Graphic button (Figure 5, labelled C).
The reference allele’s read depth (Y-axis) is plotted against the variant allele’s read depth, for each single nucleotide sequence variant. Differently coloured regions
identify the sequence variants which are homozygous for the variant allele (green), heterozygous (white) and homozygous for the reference allele (pink).
The blue region corresponds to positions at which the minor allele read depth was below 10% and so the position was not called. The smaller red triangle close to the origin of the graph
shows the region where the read depth is too low to call a position. If the sequence variants were previously filtered by
AgileKnownSNPFilter, each variant is drawn as a green dot (unknown variant) or a black dot (present in the 1000 Genomes data set). (If not previously
filtered, all the variants will be drawn as green dots.) The black diagonal from bottom left to top right shows the line around which heterozygous sequence variants should fall.
It is possible to display only known (Figure 8A) or only unknown (Figure 8B) sequence variants by ticking the corresponding box in the Options panel.
When only known SNPs are displayed (Figure 8A), it can be seen that most of the variants lie either close to the X-axis (homozygous for a variant allele) or along the black diagonal (heterozygous).
In contrast, when only unknown SNPs are displayed, it may be noted that the distribution is less well defined, with a large proportion of apparent variants lying closer to the Y-axis.
This suggests that most of these novel variants are actually false positives.
The read depth and allele frequency ratio cut-off values can be interactively adjusted using the Minor allele cut off and
Minimum read depth boxes on the Options panel. When these parameters are changed, the graphics both in this window and the
Data view window are updated. Figure 9 shows the effect of changing these values on the extent of the coloured regions of the graph. As the minor allele frequency value
is increased, the areas of the pink and blue sectors (encompassing homozygous variants) increases, while the white, heterozygous region shrinks. Similarly, as the minimum read depth cut-off
increases, the size of the red triangle in the bottom left corner of the graph increases. Since most known SNP variants are probably true positives, it is better to use such known variants
as a guide for adjusting the minor allele frequency, so that the white region is as small as possible, while still including most of the variants close to the diagonal line.
Viewing an autozygous region
In the figures above, a heterozygous region is shown; for such a region, it is often not possible to know whether a detected variant is a true or a false positive. In contrast, the
remaining figures below relate to an autozygous region, within which it is reasonable to believe that all heterozygous variants represent false positives.
Viewing different classes of filtered SNPs
If the sequence variants have been filtered by AgileKnownSNPFilter choices are available to view sequence variants that have
a RS number, sequence variants that are in the 1000 Genomes Project data set but have no RS number, previously unknown sequence variants, or all sequence variants.
Figure 10 demonstrates the effect of viewing each of these classes of sequence variant, across an autozygous region (on Chromosome 20, in the PXDN data set).
Since this region is autozygous, all apparently heterozygous variants are false positives.
The Only Unknown SNPs, Only SNPs with a RS number, Only SNPs from 1000 genomes without a RS number and
All SNPs options on the SNP status panel allow each class of filtered sequence variant to be displayed (Figure 10). When all sequence
variants are displayed (default setting), both homozygous and heterozygous sequence variants appear to be present (Figure 10A). However, since this region is autozygous, it can be assumed
that the heterozygous variants are actually false positives. Furthermore, when only unknown sequence variants are selected for display (Figure 10B), it can be seen that only heterozygous SNPs
remain shown, suggesting that most of these variants are again false positives. Conversely, when only sequence variants with an RS number are displayed (Figure 10C) only homozygous variants
are now visible, suggesting that most of these variants are true positives.
Finally, selecting variants that have been seen in the 1000 Genomes Project but do not have an RS number leaves only two variants, one of which is probably a false positive.
This result is interesting, given that 55% (296,456 / 538,332) of the sequence variants used by AgileKnownSNPFilter
to filter the sequence variants fall into this category.
Viewing sequence variants by location relative to genomic features
The location of a sequence variant within a functional feature of a gene (exon, intron, splice site or Kozak consensus site) can be a strong predictor of the variant’s severity.
Therefore, the program can display the set of variants identified in each of these locations.
The Splice site, Exonic, Intronic and Kozak consensus site options on the
Variant location and type panel allow sequence variants to be displayed depending on their location within one of these elements (Figure 11).
By default, variants in the exons, splice sites and Kozak consensus sequence are displayed. However, it is possible to display sequence variants located within any combination of gene
features. For example, Figure 11A shows variants from any of the features, whereas in 11B to 11E those from each of the categories in turn are shown.
Since variants within splice site consensus sequences probably affect mRNA splicing, they are always displayed in the pink strip of the display. Conversely, intronic variants
outside splice sites are more likely to be benign, and so are always placed in the lower green strip. Since the effect of a Kozak site variant cannot easily be predicted, these are
displayed in the blue strip. Exonic variants may be placed in any of the three strips, with silent changes placed in the green strip. Those that disrupt the protein sequence (frame
shifts and those that create or destroy stop codons) are placed in the upper pink strip, with the remaining (missense) variants displayed in the blue strip.
Displaying sequence variants based on their possible severity
Since sequence variants that alter or disrupt a protein sequence are the most likely to be pathogenic, the displayed variants can be filtered according to simple categories that may
influence severity of impact (Figure 12).
While it is possible to display all the sequence variants identified in a region (Figure 12A) it is also possible to select variants based on their possible severity using the
Indels, SNPs, All variants and Variant severity options on the
Variant location and type panel. By default, the All variants option is selected (Figure 12A). If the
Indels option is selected, only insertions or deletions are shown (Figure 12B), with intronic indels always appearing in the lower green strip and exonic and splice
site indels appearing in the upper red strip, since they are very likely to disrupt protein structure and function. Given that the region shown in Figure 12 is autozygous, so that all heterozygous
variants are likely to be false positives, it is noteworthy that all the indels in the region are heterozygous and so probably not genuine.
Selection of the SNPs options displays the single base change variants, which since they vastly outnumber the other variant classes, creates a display very
similar to the default All variants option.
When using the Variant severity option, the severity of each sequence variant is calculated using a matrix derived from the scoring system used by the BLASTP
alignment algorithm to align protein sequences. The matrix scores how likely it is, during evolution, that one amino acid will be substituted, and how often a particular substitution occurs.
A conservative change like alanine to glycine scores 1, whereas a tryptophan to cysteine change scores 25. This scoring system is simplistic and does not take into account any structural
information. For example, glutamate to aspartate scores 4, suggesting a modest effect, but if an individual glutamate is used to bind a zinc atom, this substitution might
nonetheless disrupt an enzymatic activity. Despite these restrictions, since the BLASTP scoring system is easily performed, it can be useful for quick screening of variants during a
preliminary analysis. The cut-off value for this function is set using the options list to the right of the Variant severity option.
Pressing the Matrix button allows the scoring matrix to be saved to disk as a web page.
Selecting the Variant severity option and setting the cut-off value to 2 causes the variants in the green strip to be hidden, along with a small number of
variants in the blue strip (Figure 12D). However, setting the cut-off value to 17 causes the majority of sequence variants to be discarded; furthermore, those that remain are all
heterozygous, suggesting that they are false positives (Figure 12E).
Exporting filtered sequence variant data
It is possible to export the sequence variants, using the same criteria as that used to display them in the graphical display, by using the setting in the
Export data options (Figure 13). For instance to export sequence variants that have not been found by the 1000 Genome project and do not occur in intronic sequence,
select the Only Unknown SNPs option in the SNP status panel (Figure 10) and select the Splice site,
Exonic, Kozak consensus site and All variants options on the Variant location and type
panel.
Once the various parameters have been set, it is possible to export the sequence variants that meet these cut off values by pressing the Export button
in the Export data options panel (Figure 13). If there is no positional information available, it is possible to export sequence variants from the whole genome by
selecting the Whole genome option in the Export data options panel (Figure 13). Otherwise it is possible to export sequence variants
from either the currently selected chromosome or chromosomal region, by selecting the appropriate Current chromosome or Selected region
option in the Export data options panel (Figure 13). If the disease causing variant is believed to be homozygous, it is possible to export only homozygous variants
by ticking the Homozygous variants boxes. Similarly, if the patient is believed to be a compound heterozygote, ticking the
Heterozygous variants from genes with 2 or more heterozygous variants will export heterozygous variants only if two or more heterozygous variants are present in the
gene. Selecting both of the options will export both sets of variants. If the condition is believed to be dominantly inherited then do not select either of these options.
Considerations when setting the selecting parameters
The majority of unknown sequence variants appear to be false positives, while the majority of sequence variants with a RS number appear to be true positives. Similarly a severe sequence variant
is more likely to be a false positive than a benign sequence variant. Therefore, the cut off parameters should be set so that they discard as many unknown variants as possible while keeping as
many variants with a RS number. Similarly, the parameters should discard as many severe variants as possible while lossing as few benign variants as possible. When the cut off parameters have been
set, only export unknown variants.
This guide shows data for two regions from a single sequencing run on a DNA sample selectively enriched for specific regions. Neither of these regions contained the disease causing variants, however
the mutation is present in the PXDN data file and is located in a small autozygous region at the start of chromosome 2 (Figure 14). Figure 14A shows the display with the default cut off parameters, while
Figure 14B show the display after the parameters have been adjusted to display unknown, coding variants tha affect the proteins's amino acid sequence. Initially, the variant was identified as heterozygous
due to the presence of a small proportion of erroneous base reads at this position.
Viewing the exonic read depth data
By selecting the 'Read depth' option from the Show list in the Region view option panel (Figure 14) the read depth data for each exon is shown
as a histogram. As well as changing the graphical display, selecting this option removes the Minor allele cut off list and the Graphic button
from the Read depth options panel, replaying them with the Set highest graph vale option. Also selecting this option dispbles the
SNP status, Variant location and type and Export data options panels. As before it is possible to select different
genomic regions using the options in the region view options.
When the exonic read depth for an exon is above the minimum read depth value selected in the Minimum read depth list, in the Read depth
options panel, it is drawn as a grey rectangle, otherwise it is draw in red. The affect of changing this value can be seen by comparing Figure 16A (minimum read depth = 5) to Figure 16B
(minimum read depth = 100). The maximum read depth value shown by the graph is set using the Minimum read depth list in the Read depth options
panel. The affect of changing this value from 500 to 250 is shown between Figures 16B and 16C. The text to the right of the Minimum read depth list shows the read depth
interval of the horizontal lines in the graph.