Revise Simple Tandem repeat Error Reads
Revise Simple Tandem repeat Error Reads
ReviSTER
ReviSTER

 

Introduction

ReviSTER (Revise Simple Tandem repeat Error Reads) is an automated pipeline using a local mapping reference reconstruction method to revise mismapped (mapped to incorrect position) or partially misaligned (mapped to correct position but one of ends misaligned) reads at STR (Simple Tandem Repeat) loci. It takes FASTQ-formatted files, a reference sequence file and a list file containing STR locations as inputs and utilizes BWA as an initial mapping program.

License: GPL V3 (http://www.gnu.org/licenses/gpl-3.0.html)

 

 

How to use ReviSTER

Table of Contents

Introduction. 1

Methods used in ReviSTER.. 1

Getting started. 3

Searching STR loci 3

1. Using TRF. 3

2. Searching repeats of short motifs. 3

Run ReviSTER.. 3

1. Input files. 4

2. Other options. 4

3. Output file. 4

4. Test with simulated data. 4

 

Methods used in ReviSTER

ReviSTER utilizes three programs including BWA, BLAT and SAMTools installed in the user's environment. It takes FASTQ format files, a reference sequence file and a list file containing STR locations as inputs. If the list file is not submitted, it searches STR loci from the reference file. ReviSTER employs a novel approach called 'local mapping reference reconstruction method' to revise mismapping and misalignment of sequence reads derived from the STR loci.

 

 

The whole process to revise the read alignments is composed of six steps. The first step is to map reads to a reference sequence using BWA. The ReviSTER takes the '-n' option which is used for BWA mapping to record multiple mapping candidates for reads derived from repeat sequences. At the second step, BLAT is used to remap unmapped reads to temporary mapping reference sequences which are extracted from the original reference sequence only around the STR loci. Because BLAT generates many false alignments for a read, ReviSTER realigns them and choose the best alignment from several alignment candidates. The third step is a local assembly step using the reads mapped to each STR locus. It generates paths of reads overlapping at least 30 bases to each other, chooses a given number of paths corresponding to allele candidates, extracts sequences of the allele candidates and creates local mapping reference sequences containing the allele candidates. In this step, sequence reads containing more than one mismatch/INDEL bases or showing abnormal pair distances are saved in a separated file along with unmapped reads. At the fourth step, the reads saved in the separated file are mapped to the local mapping reference sequences by BWA (with the -n option). And at the fifth step, mapping positions of a read on the local mapping reference sequences are converted to the positions on the original reference. Then a mapping position with the most optimized pair distance and the lowest mismatch number is chosen among all mapping candidates identified at the first step and the fifth step for the read. The last step is to revise reads partially misaligned at the STR loci and it is an independent process from the previous steps. Some reads may have been incorrectly aligned to the STR loci containing long INDELs and not be revised by the previous steps. The reads are realigned to another read which has been mapped to the same STR locus and spanning long enough the flanking sequences of the locus

Getting started

ReviSTER is a PERL package and available at the Download page. The current release requires BWA, BLAT and SAMtools, and runs in the LINUX environment.

The package file is compressed. To decompress,

 

$ tar xzf revister.[VERSION].tar.gz

 

Searching STR loci

ReviSTER takes a list file containing STR locations as an input. If a user does not submit the file, it searches STR loci from the reference file and checks for uniqueness of flanking sequences. ReviSTER uses STR loci of which at least one flanking sequence is unique. For a human genome, we created a STR list file [Download].

 

Users can create their own list file. We provide two methods to search STR loci. The first method is to use TRF (Tandem Repeat Finder, http://tandem.bu.edu/trf/trf.html) which searches imperfect repeats of short motifs from reference sequences. The second method is to search perfect repeats of short motifs from reference sequences directly.

 

1. Using TRF

To convert the output of TRF to the input file of ReviSTER, we provide makeListFromTRF.pl.

 

$ trf  reference.fa 2 7 5 80 10 14 6 -h

$ [ReviSTER_path]/makeListFromTRF.pl -r  reference.fa  -t  reference.fa.2.7.5.80.10.14.6.dat  -o  str.tmp.lst

$ [ReviSTER_path]/checkFlankingUniq.pl -r  reference.fa  -l  str.tmp.lst  -o   str.new.lst

 

 

2. Searching repeats of short motifs

We also provide searchTandemRepeats.pl to search repeats of short motifs from reference sequences directly.

 

$ [ReviSTER_path]/searchTandemRepeats.pl  reference.fa  -o  str.tmp.lst

$ [ReviSTER_path]/checkFlankingUniq.pl -r  reference.fa  -l  str.tmp.lst  -o   str.new.lst

 

Run ReviSTER

 

$ [ReviSTER_path]/revister.pl  -l  str.lst  -r  reference.fa  -f  readPair_1.fastq -f  readPair_2.fastq  -o  myout

 

Usage: ./revister.pl -o <header of output>  -r <reference FASTA file>

 

  --- query sequences (choose only one among FASTQ and SAM/BAM) -----------

    -f <FASTQ file> [-f <paired FASTQ file (if available)> ]

    -s <SAM/BAM file including 'XA:' tags and unmapped reads 1> [-s <SAM/BAM file reads 2>]

  ---------------------------------------------------------------------

  --- other options (can be skipped) ----------------------------------

    -l <list file for tandem repeats in the reference sequence>

       : tab delimited format containing [ref. name] [starting pos.] [length].

         If this option is skipped, a new file containing tandem repeats in the reference will be generated.

    -p <ploidy of the target genome> (Default: 2)

    -n <maxinum number of mapping candidates for a read. Corresponding to -n of BWA> (Default: 10)

    -p1 <program for remapping of unmapped reads (1. BLAT(more accurate) or 2. BWA(faster))> (Default: 1)

    -GATK <full path of GenomeAnalysisTK.jar> (Default: '')

        If it is given, GATK will be used for the final local alignment instead of the native program,

        and the final file will be “<header of output>/<header of output>.final.bam” instead of a SAM file.

    -force (force to clean old data in the working directory)

 ---------------------------------------------------------------------

 

1. Input files

ReviSTER takes a STR list file (which requires '-l') and FASTQ format read files or a SAM/BAM file containing mapped reads (We recommend to submit FASTQ format files for paired-end reads). The STR list file has the following format.

 

[chromosome], [starting position], [length] in a tab-delimited format

 

Ex)

chr1                26454    12

chr1                28589    15

chr1                44836    32

chr1                722365  20

 

2. Other options

-p INT [1 ≤] : ploidy of the target genome; The copy number of a chromosome in the target genome.(Default : 2)

-n INT [1 ≤] : maxinum number of mapping candidates for a read. Corresponding to -n of BWA. (Default : 10)

-p1 INT [1 or 2] : program for remapping of unmapped reads (1. BLAT(more accurate) or 2. BWA(faster)) (Default: 1)

-GATK STR : full path of GenomeAnalysisTK.jar (Default: '')

    If it is given, GATK will be used for the final local alignment instead of the native program,

    and the final file will be “<header of output>/<header of output>.final.bam” instead of a SAM file.

-force : force to clean old data in the working directory

 

3. Output file

The output file of ReviSTER is a SAM format file named "<header of output>/<header of output>.final.sam".

 

4. Test with simulated data

The simulated data is available online [Download]. The data was created from human genome data with artificial STR sequences of 1~8-mer motifs and composed of six subsets. In each subset, N={1, 2, 5, 10, 15, 20} STR loci have same left flanking sequences while the right flanking sequences are unique for each locus.

 

 

Figure. Creating simulated data. (A) A reference sequence was created by inserting each STR locus for every 1,400 bases. While right flanking sequences of STR loci are unique, left flanking sequences of N={1, 2, 5, 10, 15, 20}  STR loci were replicated. (B) DNA fragments with 210 bases in length, which contain reference or non-reference alleles and completely cover a STR locus, were created for every 2 bases. While the DNA fragment at the most left side spanned left two bases of the right flanking sequence of the STR locus, the fragment at the most right side spanned right two bases of the left flanking sequence. (C) 100 base sequences at the both ends of the fragment sequences were used as paired end sequencing reads.

 

To test ReviSTER for a subset,

 

$ tar xzf revister.simulated.tar.gz

$ cd revister.simulated/repeat1

$ [ReviSTER_path]/revister  -l  test.repeat.lst  -f  test_1.fq  -f  test_2.fq   -r  test.ref.fa  -o test.realign