|
We will be getting data from human whole exome sequencing done on Illumina GAIIx. I am planning to put together a pipeline to align (BWA?)(optional as we may get the sequence already aligned)|variant detection (Samtools?)|variant classification-deleteriousness prediction(something like SIFT/PolyPhen)|. I know that there are various options out there but I would like to know based on your experiences what the best collection of these tools is? If you had the chance to start from scratch what kind of a pipeline would you put together? I also know that big sequencing centers have their in house tool sets for these for squeezing the last drop but what is available to the general public is what I am looking for. Thanks |
|
Here is our basic pipeline. We are novices, but this seems to work for us. Please investigate all of the tools carefully, as (I'll repeat) I am not an expert in this analysis. Note that I am using mostly default values for the analysis; advice I gotten from experienced hands is it's routine to use the intersection of multiple SNP calling methods, and that indel calling is still an art. Note that this is set up for single reads, not paired reads, but the same basic pipeline should apply. PREPROCESS:
FOR EACH SAMPLE:
SAMPLE SCRIPT
Very interesting and detailed suggestion. Will try this out once our exome data is ready
(21 May '12, 17:05)
Customer
great start. thanks this was what I was looking for when I asked this question. I hope we can build on this and share our experiences as we are not experts either but together I believe we can become experts much faster. I am still hung up on the preprocessing step. how did you Convert Illumina reads to Fastq format and Convert Illumina 1.6 read quality scores to standard Sanger scores? Do you have to convert pipeline 1.6 quality scores to sanger scores?
(21 May '12, 17:05)
Customer
To convert Illumina to fastq I think I used MAQ (maq sol2sanger in.sol.fastq out.sanger.fastq). See http://maq.sourceforge.net/fastq.shtml, http://seqanswers.com/forums/showthread.php?t=1801 . It may be that the quality scores are not relevant for BWA, please chime in if you know this. My first alignment used MAQ; I switched to BWA because it can do gapped alignment, and it's faster. seqanswers.com is highly recommended for getting started.
(21 May '12, 17:05)
SupportRep
I haven't yet; our region of interest was quite small, and we only used single reads for that experiment. I think the discussion at biostar.stackexchange.com/questions/3925 is a good place to start; the respondents know a lot more than I do about this question.
(21 May '12, 17:05)
Customer
Do you do base quality score recalibration? If not, why so? Thanks for sharing!
(21 May '12, 17:05)
Customer
|
|
My this answer better serves as a comment to David Quigley's, which is great, but to emphasize the recent improvements, I still give it as separate answer. Between step 7 and 8, it is recommended to add an additional step:
This will substantially improve SNP specificity. A few other comments to David Quigley's pipeline are:
BAQ has been added to the GATK pipeline. See more detailed information about how to use it yourself here: http://www.broadinstitute.org/gsa/wiki/index.php/Per-base_alignment_qualities_(BAQ)_in_the_GATK
(21 May '12, 17:06)
SupportRep
Hi Hemp, I am confused as to what set of analysis/commands once needs to do after running calmd? Is the .baq.bam file going to input to what's next? Sorry I am newbie to this analysis and am confused what I am supposed to do next with calmd output.
(21 May '12, 17:06)
Customer
|