AgileSMPoint guide

User guide

Introduction

AgileSMPoint identifies somatic sequence variants occurring at specific positions in unaligned next generation sequence data. Somatic mutations at specific nucleotide positions have been found to lead to or play a role in the progression of many diseases most notably in cancers. For instance mutations at codons 12 and 13 in KRAS are often associated with the development of many types of cancer. The detection of somatic mutations is hindered by the presence of varying amounts of normal tissue in the DNA sample. Consequently, important somatic mutations may occur at levels of only 1 or 2% in the analysed DNA sample. To aid the detection of somatic mutations we have developed AgileSMPoint a program which screens next generation sequence data derived from PCR amplicons that contain the mutation hotspots of interest. AgileSMPoint does not require the data to be prealigned and is able to import sequence data from either *.fasta or *.fastq files. The sequence data may contain reads from a large number of different amplicons each containing a number of different hotspots.

Importing sequence read data and somatic mutation hotspot location data

Figure 1: AgileSMPoint user interface

The file containing the locations of somatic mutational hotspots of interest is selected by pressing the Target button and navigating to the target file. The structure of this file is described below. Next select the folder that contains the sequence read data in either *.fasta or *.fastq file format by pressing the Folder button. Each data file may contain data from a number of different amplicons, but each amplicon should have been amplified from a single sample. The analysis is started by pressing the Analyse button button and entering the name of the export file. AgileSMPoint will read and analyse the data in each file in turn and export the results to two results files, one contains the mutation report and the other contains the raw data used to create the report.

Imported files formats

AgileSMPoint can import sequence data in either *.fasta or *.fastq files. The information concerning the location of the somatic mutational hotspots is imported from a *.fasta file, as described below:

Structure of the file describing the somatic mutation hotspots.

Figure 2: Each target has a description line and a sequence line as described below.

Sequence line:

This line contains the sequence flanking the positions that are to be analysed (positions of interest). The Positions of interested are identified by been written as upper case 'N's (highlighted in red above). If it is possible for the sequence reads to originate from an homologous sequence such as a pseudogene, it is possible to ignore the pseudogene reads if there reads differ at positions close to the positions of interest. To mark a position as one that can distinguish between pseudogene and transcribed sequence, enter the base of transcribed sequence at the divergent position in lower case (highlighted in yellow above). The rest of the sequence must be upper case and have at least 20 bases 5' of the first 'N' and 3' of the last 'N'.

Description line:

This line starts with a > symbol and then has four fields. The first is just a label that identifies that particular set of results from the others in the exported data file. Then separated by a tab character is a list of the positions of interest in a reference file i.e. their position in a cDNA sequence. There should be one number for each 'N' in the sequence, with each separated by a comma. This value is only used by the program when it is annotating the results for export in the output file. After another tab character, there is a list of the wild type nucleotides at the positions of interest, again one base for each 'N' in the sequence. Each base is separated by a comma. Finally, after another tab character is a 'SET' name, this is used to group target sequences together that are very close to each other to occur on the same sequence read . Normally if a read is found to match a target, the program does not attempt to match it to any other target since it is assumed that they are mutually exclusive. However, if two targets are so close enough that it is possible for a sequence read to contain information on both targets the program will attempt to match a sequence read to all the targets in a 'Set'(i.e. they have the same 'Set' name) irrespective of whether it has all ready been matched a target in the same 'Set'. Typically, a set of mutational hotspots are separated in to different targets if the distance between the first 'N' and the last 'N' in the sequence is greater than 30 base pairs. if the distance between the 'N's becomes to large it becomes increasingly difficult to map all the positions.

Exported data file formats

AgileFastaVariantFinderScreenshot raw data file format

Figure 3: The somatic mutation data file format.

The exported data is saved to two files with one containing a variant report while the second contains the raw data for each of the mutational hotspots. The report file lists the positions which have more than 1% of the reads mapping to a non-wild type base (Figure 3). A variant may be annotated with an 'N' (i.e. 223A>N), this suggests that a high number of uncalled bases were found at this position. This variant could be ignored or the data prefiltered using AgileQualityFilter to reduce the proportion of low quality reads. If the 'N' variants persist or there are a large number of them it could suggest the quality of the sequencing is low and should be repeated. This file lists the variants by the name of the sequence read data file, then by each targets name. If no variants are found the export data contains the phrase 'No Mutations'. If variants are found then each is annotated relative to the reference number given in the 'Description line' of the target file (Figure 2) and gives the number of variant reads, total number of reads and the percentage of variant reads to total read depth. If a sequence read appears to contain a indel, AgileSMPoint will note the possible indel's position and sequence and then if more than 1% of reads are found to contain the same indel, it is annotated and exported in the results file. Reads with indels are not used to detect single base somatic mutations, even so samples with an indel are also associted with a number of false positive variants that occur at a low frequency. An example of this type of output file can be found here.

Raw data file

The other file contains the raw data which enumerates the counts for each nucleotide at each position of interest. It also lists the occurrence of indels across the positions of interest and shows the number of times a sequence read contained a deletion, insertion or neither (correct length). While this file is a tab-delimited text it is best viewed in a spread sheet program such as Excel. An example of this type of output file can be found here.

Figure 4: The raw data file format.