|
Being a novice in the field of RNA-seq I've run into some trouble when going attempting to do transcript assembly from paired-ended Illumina Script-Seq v2 library in a rather dense genome (a linear DNA virus). The issue is that I'm only getting one transcript which spans almost the entire genome - a result which I find highly unlikely given existing annotations of the same genome. There is a number of samples, where each sample is part of a time series, and the aim is to look at alternate splicing over time. In more detail, this is what I have done: I have aligned the data using tophat, with the following command:
According to this (http://seqanswers.com/forums/archive/index.php/t-17767.html) fr-secondstrand should be used with script seq v2, but I have no idea of how reliable this is. I then try to assemble the transcripts using cufflinks in this way:
I've been using the "--max-bundle-frags" option since I have a very high coverage in some of the samples. And the "--upper-quartile-norm" in hope that it would make the program more sensitive to low abundance transcripts. This is where the trouble starts. First of all I get the following warning for the samples with fewer reads:
However, setting the recommended --frag-len-mean does not work, as this is now a option only allowed for unpaired reads. Hoping that this would work anyway, I still checked the results. This is where I discover that I only find one transcript for most of the samples. In some cases I get no transcripts at all. My intuition is that this is caused be the genome being very dense, which leads to problems for the program to correctly identify where one transcript end and another one begins. However, this might not be right at all. :) Another thing that I think might be causing the trouble is the relatively low number or reads in some of the samples. The number of mapped reads ranges from 16819 to 42077004, yielding a coverage between ca. < 60x to 3561x. So, to try to summarize I'm looking for information relating to these points:
Of course any other information that might help, other than the points above, would be highly welcome. Sorry about the slightly lengthy post, but I wanted to try to get as much information as possible in here, since I'm not experienced enough in this field to be more precise in my question. |
|
I think you will have a hard time assembling transcripts using Cufflinks on a dense genome like this, since cufflinks will not understand where one gene starts and the other ends, just as you mention. 60x coverage should be enough, I do not think that is the problem. We should be able to solve your problem though, but first some questions. 1) What type of data do you have? You have chosen SOLiD settings for the TopHat run, so I am guessing SOLiD. 2) What type of virus is it? RNA, DNA? 3) You mention you have several samples. Are these several rna-seq samples, or are they different genomes too? 4) What is it you intend to do with the results? Is it an expression analysis, do you want to annotate the genome, or something else? 5) Is there already an annotation of the genome? Cheers, Henrik I have tried to update the question to answer your questions. A quick answer: 1) Illumina script seq v2 2) Linear DNA Virus 3) Same genome, time series data 4) Analyze alternate splicing over time 5) Yes.
(23 Oct '12, 15:56)
johan
This is tricky. First, am uncertain if you are using the correct library-type parameter. Searching on the net, there seems to be a lot of confusion about this too. I suggest asking the bowtie/tophat developers about this using the support-mail mentioned by Samuel above. Just to be sure you are doing this right. Second, my knowledge about viral biology is poor, so I am having a hard time answering this question. As I understand it you have several genes in the genome, and some of these are alternatively spliced. Is each gene transcribed separately, or is one gigantic mRNA transcribed that later is split? Are there splice-sites to be recognized? Do the genes have introns? Top Hat is usually used in eukaryotic genomes to splice transcripts onto the genome, and I am uncertain how this would work on a virus. Without knowing more, I would suggest making certain you get the library-type right, and supply tophat with the annotation file using the -G option. This should aid tophat in differing between genic and intergenic regions.
(24 Oct '12, 10:56)
HenrikLantz ♦
|
|
Could this comment from Cole Trapnell (http://seqanswers.com/forums/showthread.php?p=76430#post76430) explain part of the problem? "As far as the transcript merging issues that others have mentioned: Cufflinks will merge transcripts, especially in particularly gene-dense organisms like some fungi, because it tries to fill in coverage gaps (by default of 50bp or less) in order to make more full length structures in low-abundance genes. It's a tradeoff - by not filling in such gaps, your overall transcriptome assembly will be more fragmented, but you'll get fewer erroneous merges. You can disable this behavior by setting --olap-radius 0 in the Cufflinks assembler. We haven't exposed this option yet for cuffmerge but we can do so in a future release." |
Hi, have you been trying to ask this on the support mail for tophat/cufflinks (tophat.cufflinks@gmail.com, as they announce here: http://cufflinks.cbcb.umd.edu/), and did you get any answers on this there? (please share here if so :) )
I will share any answers that I come across. Right now I have cross posted this here: http://www.biostars.org/post/show/55400/transcript-assembly-in-dense-genome-using-cufflinks/