Hi!

I need to contruct a catalogue of non-redudant genes at a level of 95% identity and 90% coverage of the shorter sequence. The starting number of genes is ~10 million. This can be done by CD-HIT but what I don't like about it is that it is a heuristic method that tends to create many clusters.

Another approach that I have read about is using all-against-all BLAT search and merging sequences that fulfill the criteria. The question is then how to do the clustering, with single linkage, complete linkage etc. Does someone know about some script to parse BLAT output and create these clusters or have any comments to this approach? I guess it will take very long time to do the all-against-all?

Do you have any other suggestions or comments to this?

Thanks! Fredrik

asked 01 Nov '12, 08:33

Fredrik's gravatar image

Fredrik
1
accept rate: 0%


Maybe you already saw this post on biostars.org, suggesting UCLUST over CD-HIT, and also providing some info on how to go about with BLAST.

link

answered 01 Nov '12, 11:28

saml's gravatar image

saml ♦♦
655
accept rate: 0%

edited 01 Nov '12, 11:29

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×1
×1
×1

Asked: 01 Nov '12, 08:33

Seen: 384 times

Last updated: 01 Nov '12, 11:29

Related questions

powered by OSQA