|
I'm a PhD student at the EBC in Uppsala. I'm working with metagenomic data. As the title implies, I would like to talk about the pre-processing of sequencing reads. I have been looking around the web testing the current tools and it just seems like no solution really cuts it. In particular none of the code seems to contain tests. I have summarized my short review in a blog post. What do you guys suggest or use in your lab ? Or simply what do you think about the topic ? |
|
I am uncertain how good a job sffinfo does (haven't used it much), but I am very happy with sff-extract. Just make sure you extract the quality values too and use those. My view on pre-processing reads is as follows. First, removing adapters and other types of "contaminant" sequences is recommended. With SFF-files, this should be handled automatically by the program that extracts the reads into fasta-format. I have good experience with sff_extract. With Illumina I prefer Trimmomatic, as it can do quality screening too if preferred and, and it retains paired reads which is a must for some downstream applications. Second, when it comes to quality screening, it very much depends on what you intend to do with your data. Many commonly used programs prefer non-cleaned data. For example, if you have 454-data, the assembler Newbler works directly on the SFF-files. MIRA prefers non-cleaned 454- and Illumina-data too. If you have Illumina RNAseq data, the transcript assembler Trinity prefers non-cleaned data. These programs, and I am sure there are many other too, remove bad quality bases internally, so as long as you have the quality values you are fine. In some cases I can see a need for quality-screening, for example when mapping reads to a known genome, and in those cases I use Trimmomatic with settings based on fastQC analyses. But now to what I think is the problem. My experiences are based on projects that are not meta-genomic. I assemble genomes or transcriptomes, map RNA-seq reads and so on, and I do believe that most tools out there, are designed for these purposes. In general, there is something to map against, or reads will be assembled together, and here quality-values can be used directly in these processes. Also, quality checks can often done further downstream, for example removing reads that are mapped with low quality or removing contaminant contigs in an assembly using blast-based methods. Working directly on single reads is a totally different approach, and I can see how you would like to remove low quality data. Here I am uncertain about which program to recommend, as I have never done this on 454-data. I hope someone else can make a good suggestion. I will make our meta-genomic expert aware of this thread, he might have an opinion. Anyway, I am not surprised that you see little discussion about pre-processing and that some programs are not updated so much anymore, as this often is a non-issue for many downstream applications. Also, a huge amount of the bioinformaticians out there work with data from humans or from some other model organism, and for them there is so much data to compare with that any result stemming from bad quality data will usually easily be discovered. Thanks for your answer. I agree with what you wrote. I suppose I will just proceeded by using one of the tools I listed with some minor fine-tuning of parameters, maybe based on previous publications I can find. I'm just still a bit worried about all this code that we trust and doesn't include any visible form of tests.
(25 Oct '12, 16:02)
xApple
Hmm... This answer is great and well written, but following the guidelines I'm used to on stackoverflow or biostars, I wouldn't select it as the accepted answer; since there is no answer yet on what to do.
(26 Oct '12, 10:57)
xApple
Sorry, my bad, I only read the first half of the answer and it seemed pretty solid. I'll leave it as not-accepted.
(26 Oct '12, 11:40)
dahlo ♦♦
|
|
You describe two problems in your post: 1) the lack of well-designed, well-tested tools for quality control of 454 reads and 2) the poor programming and packaging principles followed by many bioinformatics tools. I can only agree with you regarding point 2, I often come across published tools that are virtually impossible to get working after download. My experiences from tools for quality control of 454 reads from shotgun libaries is limited. In projects I've been involved in we either ran CD-HIT-454 to reduce the number of duplicates prior to annotation and quantification of reads or we based our analysis on assembled data. (BTW, quickly scanning the literature I find the interestingly looking DRISEE (Keegan et al. “A Platform-independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE.” PLoS Computational Biology 8, no. 6 (June 2012): e1002541.)) In your list you miss a group of tools, since you focus on shotgun libraries: Those that are focused on eliminating artefacts (homopolymer runs from 454 sequencing, substitutions from PCR and chimeras from PCR) from amplicon (mostly 16S rRNA) sequences. Here the state of affairs is slightly brighter than for shotgun libraries. Chris Quince's AmpliconNoise is, if somewhat difficult to piece together to a running pipeline and very computationally demanding, based on sound principles and appears to perform well. A recently published tool, Acacia, appears promising (Bragg et al. “Fast, Accurate Error-correction of Amplicon Pyrosequences Using Acacia.” Nature Methods 9, no. 5 (May 2012): 425–426.). /Daniel I am happy you agree with me. I also settled for CD-HIT-454 in my pipeline for now. That DRISEE publication is definitely interesting, but it doesn't seem the software actually cleans the reads. Here is a quote from the paper: "DRISEE informed read trimming is currently under development". Yes, I ignored amplicon-type studies and focussed on shotgun libraries, but Acacia looks good for amplicon data, though it's strange that the software is only distributed on softpedia ?
(05 Nov '12, 13:34)
xApple
|
Hi! Before we get into the details I have some questions.
First, did you use SFF-extract to extract the fasta-reads? The SFF-files contain information on where there are adapters and these should be removed already in the extract-stage. Second, what programs do you intend to run downstream? Some programs prefer non-cleaned data (for example the assembly-program MIRA).
Cheers, Henrik
No, I used Roche's own
sffinfo -s reads.sffprogram to extract the reads. Thought I could trust that more than some python script. We were thinking of, at first, unassembled analysis. That would include masking RNAs, ORF finding, gene calling, protein family assignment and metabolic description directly on the individual reads (like the RAMMCAP pipeline).