Hi!

I am interested in hearing what you would recommend for normalization of RNASeq data. How many replicates (if any) do you run to get enough statistical power?

I am particularly interested in discussing whether you think it is needed to take the length of the transcript into account or whether it is generate new biases in the dataset.

cheers,

Dag

asked 01 Oct '12, 14:50

DagAhren's gravatar image

DagAhren ♦
1445
accept rate: 33%

edited 01 Oct '12, 20:05


Dear Dag, you are asking several different and very relevant questions. I will try to provide inputs separately for each one of them.

1) Number of samples: This question occurs in any experimental design. And the answer is always "it depends". If you plan to have samples from human cohorts, then you will need quite samples and the biological variability will be large. However if you have samples from cell-lines, then fewer samples would be required. My impression (but can be better ones) is that the best approach is to download sample/exp-designss from public resources (i.e. GEO) which are as close to your system/model and generate a power analysis with them. Then you can estimate how many minimum samples would be required.

2) Length of transcript: In this case further clarification is needed. The length of the transcript would be beneficial in the mapping to the reference genome, however it introduces some biases. when you are mentioning the length of the transcript as a bias you are referring to: a- Are you aiming to integrate several data-sets with different lengths? b- Do you have a single data-set with several lengths as you trim some reads because quality issues? c- Are you specifically talking about "normalization" dependent on your read length? I understand in the last one, so in this case go to the next point.

3) Normalization: RNA-Seq normalization has been studied in the last 3 years in real detail. As it is now, I cannot see there is a clear agreement in the literature, however there are preferences. There have been two major approaches:

a- Approaches which include a model-based normalization, which is the differential expression considers specific model of the data counts and uses it for the analysis. One specific example is DESeq: "The count values must be raw counts of sequencing reads. This is important for DESeq's statistical model to hold, as only the actual counts allow assessing the measurement precision correctly. Hence, please do do not supply other quantities, such as (rounded) normalized counts, or counts of covered base pairs { this will only lead to nonsensical results".

b- A second approach is to normalize data before hand and then consider this data in classical microarray methodologies such as limma. For this case there are several normalization methods, some examples are:

i. RPKM: Reads Per Kilobase per Million mapped reads. Helped in the beginning but it has been shown not to be enough.

ii. TMM, which computes the trimmed mean of M-values, used to compute an scaling factor "eff ective library size". (Recommended read: Robinson MD, Oshlack A (2010). A scaling normalization method for di erential expression analysis of RNA-seq data. Genome Biology 11, R25. & edgeR package)

iii. Include other technical/biological biases: GC content, sequence length and others. GC is much more evident that sequence length. In addition, my impression is that TMM and other methodologies are somehow counter-effecting the sequence-length.

In this respect two suggestions:

S1- Use several different DE methods, (edgeR, DEseq, NOIseq) and see how much do they agree. Ideally, they are just methods to identify DE genes, but no one is 100% sure about the validity of their read-model. Compare among them, and maybe you will get a robust-selection.

S2- Omit the sequence length by now, BUT once you have done the DE analysis compare if your DE genes are enriched by length or not. If the answer is positive evaluate if there is a need for normalization and go back to methods such as those shown at: Hansen, K. D., Irizarry, R. a, & Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics (Oxford, England), 204–216. doi:10.1093/biostatistics/kxr054. In this case you can look into which co-variate you aim to consider.

Hope that helps. Best regards, David

link

answered 08 Oct '12, 09:00

lunacab's gravatar image

lunacab
442
accept rate: 50%

PS. David, I do wonder if some insights as to the best way of normalizing RNAseq could be generated by comparing matched CAGE and RNAseq data from ENCODE?

with best regards, Lukasz

(09 Oct '12, 19:00) lucash ♦

PS: Lukasz, there can be. However the challenge is that two data-types are quite different, with their own normalization issues/challenges, and correlation between them is not perfect. In they case you try make sure you are getting enough depth of both and with very good quality data.

(09 Oct '12, 19:04) lunacab

Hi!

Thank you for all the interesting points. I will look into the power analysis approach for estimating sequencing requirements.

I especially like your pragmatic approach to evaluating the effect of read length. If no bias appears to have been generated, then length bias is not a major issue.

Thanks for all your interesting thoughts and ideas!

Dag

(09 Oct '12, 22:49) DagAhren ♦

Great to see biosupport.se becoming a hub for lively discussion and exchange of hot ideas in bioinformatics!!!

I think ENCODE has both RNAseq and Affy Exon Arrays for their cell lines, so it might be possible to directly compare these two platforms, with regards to both their sensitivity to expression patterns and alternative splicing!

http://encodeproject.org/ENCODE/downloads.html

The whole point of ENCODE was to do these assays in-depth and in a very well controlled fashion, so probably the best dataset out there for cross-platform and cross-methodology benchmarking of this type!

link

answered 10 Oct '12, 19:26

lucash's gravatar image

lucash ♦
211
accept rate: 0%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×5
×4
×2
×2
×2

Asked: 01 Oct '12, 14:50

Seen: 655 times

Last updated: 10 Oct '12, 19:26

powered by OSQA