Insilicase
deus ex computa
     Skip Navigation Links.

 

Variant file format

File format description

Introduction

The data files used by programs in the Agile suite of programs to store the varaint and read depth data are simple tab-delimited plain text. Data for each variant is located on at least two lines, with the first line containing data on the sequence variant and the subsequent lines containing the data on the affect the variant has on the genes transcripts (one line per transcript). Single base changes and deletions use a common line format, however insertion variants us a different format. Data describing a single base variant start with a 'S' and lines containing data on inserts begin with an 'S', each format is described below:

Single base variants

When opened in a spread sheet program, the data for a single base variant occupies a number of cells, labelled A to U in Figure 1. The data used to describe single base sequence variants are identified by a 'S' in the first cell of the first data line of a sequence variant (See A in Figure 1A).

File format Screenshot 1

Figure 1: The file format for a single base sequence variant.

  • A: This cell contains the row's data format type, with 'S' referring to a single base variant format and 'I' indicating an insertion variant. All data in this format starts with an 'S'.
  • B: This value identifies the number of novel transcript variants creates. If a gene has three transcripts with transcript A having a different start sequence to the other two, transcript C has a different end sequence to the others and the variant is in present in all three transcripts. Then the annotation for the variant in transcripts B and C will be the same, while it will differ for transcript A. Therefore there will be two novel transcript variants and so two variant lines of data after this line (Figure 1B shows a variant with to transcript data lines).
  • C: This identifies the chromosome's number.
  • D: If this cell contains the word 'TRUE the gene is on the forward strand while 'FALSE' indicates the gene is on the reverse strand.
  • E: This number indicates if the variant is a substitution (0) or deletion (1).
  • F: This is the variant's chromosomal position.
  • G: The name of the gene linked to the variant is in this cell.
  • H: This value identifies the variant nucleotide, with possible values of A, C, G or T for substitutions and B, D, H or U for del A, del C, del G or del T respectively. Figure 1C shows the data for a deletion (del C).
  • I: The reference sequence nucleotide.
  • J, K, L and M: The reads mapping to each nucleotide in the order A, C, G and T.
  • N: The number of reads suggesting a deletion.
  • O: The variants status, if the variant has a RS number it is shown, otherwise it can be 'U' not found in the 1000 Geneome Project, 'T' found in the 1000 Geneome Project, but has no RS number and 'N' shows the data has not been filtered by AgileKnownSNPFilter.
  • P: The number of the transcript variants, 0 = first variant, 1 = 2nd variant etc. (see item B for details.)
  • Q: Type and position of mutation with reference to the proteins sequence. WT = wild type, In = intronic, Sp = splice site, KS = Kozak consensus sequence. If the variant changes the amino acid sequence the substitution is shown. i.e. I>G is a isoluecine to glycine and I>FS is a frameshift mutation in codon coding for isoluecine.
  • R: List of the CCDS transcript's 'ID'
  • S: Number indicating the location of the variant with possible values of 0 (Intron (5')), 1 (Intron (3')), 2 (Splice site (3'), 3 (Splice site (5')), 4 (Exon) and 5 (kozak site).)
  • T: The variants distance from the start codon for transcripts on the chromosome's forward strand or from the stop codon for transcripts on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1
  • U: The number of amino acids between the affected codon distance and the start codon for transcripts on the chromosome's forward strand or from the stop codon for transcript on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1.

Insert variants

When opened in a spread sheet program, the data for a DNA insert variant occupies a number of cells, labelled A to U in Figure 1. The data used to describe single base sequence variants are identified by a 'S' in the first cell of the first data line of a sequence variant (See A in Figure 1A).

File format version2

Figure 2: The file format for a insert sequence variant.

  • A: This cell contains the rows data format type, with 'S' referring to a single base variant format and 'I' indicating an insertion variant. Data for insert variants always starts with an 'I'.
  • B: This identifies the number of novel transcript variants creates. If a gene has three transcripts with transcript A having a different start sequence to the other two, transcript C using a different end sequence to the others and the variant is present in all the transcripts. Then the annotation for the variant in transcripts B and C will be the same, while it will differ for transcript A. Therefore there will be two novel transcript variants and so two variant lines of data after this line (Figure 1B shows a variant with two transcript data lines).
  • C: This identifies the chromosome's number.
  • D: If this cell contains the word 'TRUE' the gene is on the forward strand while 'FALSE' indicates the gene is on the reverse strand.
  • E: This number indicates if the variant is an insertion (2), this format always has a value of 2.
  • F: This is the variants chromosomal position.
  • G: This cell contains the name of the gene linked to the variant.
  • H: This lists the different inserts identified at this position and the number of reads each insert was found in. In Figure 2 the value is C:23-N:1 which indicates that 23 reads suggested that a C was inserted and a single read suggested a single base was inserted, but the quality score was too low to call the nucleotide, which was set to N.
  • I, J, K, and L: The reads depths of each nucleotide in the order A, C, G, T.
  • M: The number of reads suggesting a deletion at this location.
  • N: The number of reads suggesting an insert (of any sequence) at this position.
  • O: If 'TRUE' the variant is homozygous, while if 'FALSE' the variant is heterozygous. These values are used to set the initial state of the variant and are overridden when the read depth and/or allele frequency parameters are changed.
  • P: The variants status, if the variant has a RS number it is shown, otherwise it can be U not found in the 1000 Genome Project, T found in the 1000 Genome Project, but has no RS number and N shows the data has not been filtered by AgileKnownSNPFilter.
  • Q: The number of the transcript variants, 0 = first variant, 1 = 2nd variant etc. (see item B for details.)
  • R: List of the CCDS transcript's 'ID'
  • S: Number indicating the location of the variant with possible values of 0 (Intron (5')), 1 (Intron (3')), 2 (Splice site (3'), 3 (Splice site (5')), 4 (Exon) and 5 (kozak site).)
  • T: The variants distance from the start codon for trancscripts on the chromosome's forward strand or from the stop codon for trancscripts on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1
  • U: The number of amino acids between the affected codon distance and the start codon for trancscripts on the chromosome's forward strand or from the stop codon for trancscripts on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1

Read depth file format

The read depth file format is shown in Tabe 1 below. The file is a tab-delimited plain text file, with each line containing the read depth information for a single exon. When opened in a spread sheet application the first column identifies the chromosome that contains the gene named in the second column. the third column identifies the exon, with the numbers starting at 0 and not 1. Also the exon are number from the p telomere end of the gene, so genes encoded on the reverse strand of a chromosome are numbered in the opposite direction than expected. The remaining three columns contain the read depth values that 95%, 90% and 50% of the positions in each exon have are exceed. For example row one of Table 1 relates to the first exon (as judged by is closeness to the p telomere) of SAMB11 and 95% of the coding positions have a read depth of 62 reads or more, 90% of the positions have a read depth of 66 reads or more and 50% of the positions have a read depth of 78 reads or more. The last value is equivalent to the median read depth of the coding sequences of the exon. If a gene has no reads mapped to its exons, the gene will not appear in this list and all exon read depth values will be set to 0.

ChromosomeGene nameExon number95% read depth105 read depth50% read depth
1SAMD110626678
1SAMD1113311
1SAMD112131417
1SAMD1136835
1SAMD114333448
1SAMD1156610
1SAMD116568
1SAMD117000
1SAMD118000
1SAMD119001
1SAMD1110151621
1SAMD11117811
1SAMD11123318
1NOC2L0015
1NOC2L1311330400
1NOC2L2238263422
1NOC2L3424857
1NOC2L4323339
1NOC2L59927
1NOC2L6132144184
1NOC2L7141826
1NOC2L84456102
1NOC2L9324078
1NOC2L105254110
1NOC2L1191215
1NOC2L12373749
1NOC2L13107126319
1NOC2L14114118161
1NOC2L15367414542
1NOC2L16193171
1NOC2L1791221
1NOC2L18000

Table 1



Copyright © 2011 Insilicase.

 

   
  54.196.72.162