Being a novice in the field of RNA-seq I've run into some trouble when going attempting to do transcript assembly from paired-ended Illumina Script-Seq v2 library in a rather dense genome (a linear DNA virus). The issue is that I'm only getting one transcript which spans almost the entire genome - a result which I find highly unlikely given existing annotations of the same genome. There is a number of samples, where each sample is part of a time series, and the aim is to look at alternate splicing over time.
In more detail, this is what I have done:
I have aligned the data using tophat, with the following command:
According to this (http://seqanswers.com/forums/archive/index.php/t-17767.html) fr-secondstrand should be used with script seq v2, but I have no idea of how reliable this is.
I then try to assemble the transcripts using cufflinks in this way:
I've been using the "--max-bundle-frags" option since I have a very high coverage in some of the samples. And the "--upper-quartile-norm" in hope that it would make the program more sensitive to low abundance transcripts.
This is where the trouble starts. First of all I get the following warning for the samples with fewer reads:
However, setting the recommended --frag-len-mean does not work, as this is now a option only allowed for unpaired reads. Hoping that this would work anyway, I still checked the results.
This is where I discover that I only find one transcript for most of the samples. In some cases I get no transcripts at all. My intuition is that this is caused be the genome being very dense, which leads to problems for the program to correctly identify where one transcript end and another one begins. However, this might not be right at all. :)
Another thing that I think might be causing the trouble is the relatively low number or reads in some of the samples. The number of mapped reads ranges from 16819 to 42077004, yielding a coverage between ca. < 60x to 3561x.
So, to try to summarize I'm looking for information relating to these points:
Of course any other information that might help, other than the points above, would be highly welcome. Sorry about the slightly lengthy post, but I wanted to try to get as much information as possible in here, since I'm not experienced enough in this field to be more precise in my question.
I think you will have a hard time assembling transcripts using Cufflinks on a dense genome like this, since cufflinks will not understand where one gene starts and the other ends, just as you mention.
60x coverage should be enough, I do not think that is the problem.
We should be able to solve your problem though, but first some questions.
1) What type of data do you have? You have chosen SOLiD settings for the TopHat run, so I am guessing SOLiD. 2) What type of virus is it? RNA, DNA? 3) You mention you have several samples. Are these several rna-seq samples, or are they different genomes too? 4) What is it you intend to do with the results? Is it an expression analysis, do you want to annotate the genome, or something else? 5) Is there already an annotation of the genome?
answered 23 Oct '12, 15:37
Could this comment from Cole Trapnell (http://seqanswers.com/forums/showthread.php?p=76430#post76430) explain part of the problem?
"As far as the transcript merging issues that others have mentioned: Cufflinks will merge transcripts, especially in particularly gene-dense organisms like some fungi, because it tries to fill in coverage gaps (by default of 50bp or less) in order to make more full length structures in low-abundance genes. It's a tradeoff - by not filling in such gaps, your overall transcriptome assembly will be more fragmented, but you'll get fewer erroneous merges. You can disable this behavior by setting --olap-radius 0 in the Cufflinks assembler. We haven't exposed this option yet for cuffmerge but we can do so in a future release."
answered 07 Nov '12, 14:15