I'm a PhD student at the EBC in Uppsala. I'm working with metagenomic data.
As the title implies, I would like to talk about the pre-processing of sequencing reads. I have been looking around the web testing the current tools and it just seems like no solution really cuts it. In particular none of the code seems to contain tests. I have summarized my short review in a blog post.
What do you guys suggest or use in your lab ? Or simply what do you think about the topic ?
asked 22 Oct '12, 15:49
I am uncertain how good a job sffinfo does (haven't used it much), but I am very happy with sff-extract. Just make sure you extract the quality values too and use those.
My view on pre-processing reads is as follows.
First, removing adapters and other types of "contaminant" sequences is recommended. With SFF-files, this should be handled automatically by the program that extracts the reads into fasta-format. I have good experience with sff_extract. With Illumina I prefer Trimmomatic, as it can do quality screening too if preferred and, and it retains paired reads which is a must for some downstream applications.
Second, when it comes to quality screening, it very much depends on what you intend to do with your data. Many commonly used programs prefer non-cleaned data. For example, if you have 454-data, the assembler Newbler works directly on the SFF-files. MIRA prefers non-cleaned 454- and Illumina-data too. If you have Illumina RNAseq data, the transcript assembler Trinity prefers non-cleaned data. These programs, and I am sure there are many other too, remove bad quality bases internally, so as long as you have the quality values you are fine. In some cases I can see a need for quality-screening, for example when mapping reads to a known genome, and in those cases I use Trimmomatic with settings based on fastQC analyses.
But now to what I think is the problem.
My experiences are based on projects that are not meta-genomic. I assemble genomes or transcriptomes, map RNA-seq reads and so on, and I do believe that most tools out there, are designed for these purposes. In general, there is something to map against, or reads will be assembled together, and here quality-values can be used directly in these processes. Also, quality checks can often done further downstream, for example removing reads that are mapped with low quality or removing contaminant contigs in an assembly using blast-based methods.
Working directly on single reads is a totally different approach, and I can see how you would like to remove low quality data. Here I am uncertain about which program to recommend, as I have never done this on 454-data. I hope someone else can make a good suggestion. I will make our meta-genomic expert aware of this thread, he might have an opinion.
Anyway, I am not surprised that you see little discussion about pre-processing and that some programs are not updated so much anymore, as this often is a non-issue for many downstream applications. Also, a huge amount of the bioinformaticians out there work with data from humans or from some other model organism, and for them there is so much data to compare with that any result stemming from bad quality data will usually easily be discovered.
You describe two problems in your post: 1) the lack of well-designed, well-tested tools for quality control of 454 reads and 2) the poor programming and packaging principles followed by many bioinformatics tools.
I can only agree with you regarding point 2, I often come across published tools that are virtually impossible to get working after download.
My experiences from tools for quality control of 454 reads from shotgun libaries is limited. In projects I've been involved in we either ran CD-HIT-454 to reduce the number of duplicates prior to annotation and quantification of reads or we based our analysis on assembled data. (BTW, quickly scanning the literature I find the interestingly looking DRISEE (Keegan et al. “A Platform-independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE.” PLoS Computational Biology 8, no. 6 (June 2012): e1002541.))
In your list you miss a group of tools, since you focus on shotgun libraries: Those that are focused on eliminating artefacts (homopolymer runs from 454 sequencing, substitutions from PCR and chimeras from PCR) from amplicon (mostly 16S rRNA) sequences. Here the state of affairs is slightly brighter than for shotgun libraries. Chris Quince's AmpliconNoise is, if somewhat difficult to piece together to a running pipeline and very computationally demanding, based on sound principles and appears to perform well. A recently published tool, Acacia, appears promising (Bragg et al. “Fast, Accurate Error-correction of Amplicon Pyrosequences Using Acacia.” Nature Methods 9, no. 5 (May 2012): 425–426.).
answered 05 Nov '12, 11:04