We will be getting data from human whole exome sequencing done on Illumina GAIIx. I am planning to put together a pipeline to align (BWA?)(optional as we may get the sequence already aligned)|variant detection (Samtools?)|variant classification-deleteriousness prediction(something like SIFT/PolyPhen)|. I know that there are various options out there but I would like to know based on your experiences what the best collection of these tools is? If you had the chance to start from scratch what kind of a pipeline would you put together? I also know that big sequencing centers have their in house tool sets for these for squeezing the last drop but what is available to the general public is what I am looking for.

Thanks

asked 21 May '12, 17:03

Customer's gravatar image

Customer
295
accept rate: 0%


Here is our basic pipeline. We are novices, but this seems to work for us. Please investigate all of the tools carefully, as (I'll repeat) I am not an expert in this analysis. Note that I am using mostly default values for the analysis; advice I gotten from experienced hands is it's routine to use the intersection of multiple SNP calling methods, and that indel calling is still an art. Note that this is set up for single reads, not paired reads, but the same basic pipeline should apply.

PREPROCESS:

  • Index human genome (Picard), we used HG18 from UCSC.
  • Convert Illumina reads to Fastq format
  • Convert Illumina 1.6 read quality scores to standard Sanger scores

FOR EACH SAMPLE:

  1. Align samples to genome (BWA), generates SAI files.
  2. Convert SAI to SAM (BWA)
  3. Convert SAM to BAM binary format (SAM Tools)
  4. Sort BAM (SAM Tools)
  5. Index BAM (SAM Tools)
  6. Identify target regions for realignment (Genome Analysis Toolkit)
  7. Realign BAM to get better Indel calling (Genome Analysis Toolkit)
  8. Reindex the realigned BAM (SAM Tools)
  9. Call Indels (Genome Analysis Toolkit)
  10. Call SNPs (Genome Analysis Toolkit)
  11. View aligned reads in BAM/BAI (Integrated Genome Viewer)

SAMPLE SCRIPT

/bin/BWA/bwa aln -f /output/FOO.sai -t 3 /seq/REFERENCE/human_18.fasta /seq/FQ/FOO.sanger.fq
/bin/BWA/bwa samse -f /output/FOO.sam /seq/REFERENCE/human_18.fasta /output/FOO.sai /seq/FQ/FOO.sanger.fq
/bin/SAMTOOLS/samtools import /seq/REFERENCE/human_18.fasta /output/FOO.sam /output/FOO.bam
/bin/SAMTOOLS/samtools sort /output/FOO.bam /output/FOO.sorted
/bin/SAMTOOLS/samtools index /output/FOO.sorted.bam /output/FOO.sorted.bam.bai
java -jar /bin/GTK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.bam  -o /output/FOO.intervals
java -jar /bin/GTK/GenomeAnalysisTK.jar -T IndelRealigner -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.bam -targetIntervals /output/FOO.intervals --output /output/FOO.sorted.realigned.bam
/bin/SAMTOOLS/samtools index /output/FOO.sorted.realigned.bam /output/FOO.sorted.realigned.bam.bai
java -jar /bin/GTK/GenomeAnalysisTK.jar -T IndelGenotyperV2 -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.realigned.bam -O /output/FOO_indel.txt --verbose -o /output/FOO_indel_statistics.txt
java -jar /bin/GTK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /seq/REFERENCE/human_18.fasta -I /output/FOO.sorted.realigned.bam -varout /output/FOO.geli.calls -vf GELI -stand_call_conf 30.0 -stand_emit_conf 10.0 -pl SOLEXA
link

answered 21 May '12, 17:04

SupportRep's gravatar image

SupportRep
2314
accept rate: 32%

Very interesting and detailed suggestion. Will try this out once our exome data is ready

(21 May '12, 17:05) Customer

great start. thanks this was what I was looking for when I asked this question. I hope we can build on this and share our experiences as we are not experts either but together I believe we can become experts much faster. I am still hung up on the preprocessing step. how did you Convert Illumina reads to Fastq format and Convert Illumina 1.6 read quality scores to standard Sanger scores? Do you have to convert pipeline 1.6 quality scores to sanger scores?

(21 May '12, 17:05) Customer

To convert Illumina to fastq I think I used MAQ (maq sol2sanger in.sol.fastq out.sanger.fastq). See http://maq.sourceforge.net/fastq.shtml, http://seqanswers.com/forums/showthread.php?t=1801 . It may be that the quality scores are not relevant for BWA, please chime in if you know this. My first alignment used MAQ; I switched to BWA because it can do gapped alignment, and it's faster. seqanswers.com is highly recommended for getting started.

(21 May '12, 17:05) SupportRep

I haven't yet; our region of interest was quite small, and we only used single reads for that experiment. I think the discussion at biostar.stackexchange.com/questions/3925 is a good place to start; the respondents know a lot more than I do about this question.

(21 May '12, 17:05) Customer

Do you do base quality score recalibration? If not, why so? Thanks for sharing!

(21 May '12, 17:05) Customer

My this answer better serves as a comment to David Quigley's, which is great, but to emphasize the recent improvements, I still give it as separate answer.

Between step 7 and 8, it is recommended to add an additional step:

samtools calmd -Abr FOO.sorted.bam human_18.fasta > FOO.baq.bam

This will substantially improve SNP specificity.

A few other comments to David Quigley's pipeline are:

  1. It is recommended to use Dindel for indel calling. It is the few indel callers that properly model indels (most not, including IndelGenotyperV2). Broad is reimplementing the Dindel model in GATK. The latest samtools also models indels properly, but it is not evaluated as widely as Dindel.

  2. For the human genome build 36, it is recommended to use the version used by the 1000 genomes project. You will lose genes in the pseudoautosomal regions if you use hg18. Hg19/b37 is fine.

  3. The samtools and GATK SNP callers are converging in algorithms and in results. GATK is arguably better than MAQ, which has been confirmed by several independent studies.

link

answered 21 May '12, 17:06

SupportRep's gravatar image

SupportRep
2314
accept rate: 32%

BAQ has been added to the GATK pipeline. See more detailed information about how to use it yourself here: http://www.broadinstitute.org/gsa/wiki/index.php/Per-base_alignment_qualities_(BAQ)_in_the_GATK

(21 May '12, 17:06) SupportRep

Hi Hemp, I am confused as to what set of analysis/commands once needs to do after running calmd? Is the .baq.bam file going to input to what's next? Sorry I am newbie to this analysis and am confused what I am supposed to do next with calmd output.

(21 May '12, 17:06) Customer
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×6
×4
×1
×1

Asked: 21 May '12, 17:03

Seen: 1,738 times

Last updated: 21 May '12, 17:06

powered by OSQA