It is also because the same care was put into these parts of the pipeline as into the alignment stage. In part this is because SNAP avoids repeatedly reading and writing data files between stages that are implemented as different executable binaries. This code is typically about an order of magnitude faster than the typical samtools/Picard pipeline. In addition, SNAP includes code for sorting, duplicate marking and writing to BAM format. SNAP was written by skilled, professional computer systems programmers who have a deep understanding of the intricacies of computer architecture, which helps the code perform well independently of good algorithm design. Please refer to the SNAP paper (opens in new tab) for details. SNAP leverages a combination of three insights: increasing read lengths, which allow for fast hash-based location of reads using larger “seed” sequences increasing server memories, which allow trading memory to save CPU time (SNAP is designed for server machines with tens of gigabytes of RAM) and a novel algorithm for set intersection, edit distance algorithm, and pruning methodology that allow SNAP to reject most candidate locations without fully scoring them, dramatically reducing the cost of local alignment checks. The first step of this process is aligning each read to a known reference genome, so that later stages of the pipeline can view all the DNA for a specific location in the reference at once. Putting together these reads into a coherent whole is a significant computational challenge, with current pipelines taking many hundreds of CPU-hours per genome. However, current high-throughput sequencing technologies produce large numbers of short (~100-250 base) reads from random locations in the genome. For example, more and more cancer patients are having their germline and tumor genomes sequenced. With the cost of a WGS human genome below $1000, this technology is entering the realm of routine clinical practice. SNAP Quick Start Guide (opens in new tab)įAQ What is sequence alignment, and why is it important?Īs cheap DNA sequencing combined with more and more uses for sequence data increases the amount of sequence data available, there is a growing need for tools that can efficiently analyze large bodies of sequence data.ASHG 2014 SNAP Presentation, Ravi Pandya, (opens in new tab).Bolosky, Arun Subramaniyan, Matei Zaharia, Ravi Pandya, Taylor Sittler, and David Patterson. Fuzzy set intersection based paired-end short-read alignment (opens in new tab).Bolosky, Kristal Curtis, Armando Fox, David Patterson, Scott Shenker, Ion Stoica, Richard M. Faster and More Accurate Sequence Alignment with SNAP (opens in new tab).SNAPCommand for Windows (opens in new tab).SNAPCommand for Linux (opens in new tab).SNAP has one additional utility, the SNAPCommand program which sends alignment jobs to SNAP when it is running in daemon mode. SNAP v2.0.0 for OSX (64-bit) (opens in new tab).SNAP v2.0.1 for Linux (64-bit) (opens in new tab).SNAP v2.0.1 for Windows (64-bit) (opens in new tab).In addition, you can download binaries for Windows, Linux and OSX: SNAP is available under an Apache 2 license at /amplab/snap (opens in new tab). SNAP was developed by a team from Microsoft Research, the UC Berkeley AMP Lab (opens in new tab), and UCSF. SNAP does all of these tasks in a single tool, and is usually more than 10x faster than the standard samtools/Picard pipeline. Other aligners produce unsorted SAM (or in the case of Novoalign unsorted BAM) output, and require the use of other tools to compress, sort, mark duplicates and index the final output file. In addition to taking FASTQ (unprocessed reads) as input, it also accepts SAM and BAM (aligned reads). SNAP is also more full-featured than other aligners. When used with Haplotype Caller from the Genome Analysis Toolkit, SNAP produces better concordance with known-truth sets than other aligners for most of the genome-in-a-bottle and Illumina Platinum genomes. SNAP is from 2-5x faster than commonly used aligners like BWA-mem2 and Bowtie2, and 20x-nearly 30x faster than Novoalign. This is a computationally challenging problem because reference genomes are big (the human genome is over 3 billion base pairs long) and are often highly repetitive. It takes data from gene sequencing hardware that consists of short chunks of DNA (typically 70-300 base pairs long) called reads and determines where, how well and how unambiguously they match to a given reference genome. SNAP is a program that is part of a gene sequencing pipeline.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |