I need to contruct a catalogue of non-redudant genes at a level of 95% identity and 90% coverage of the shorter sequence. The starting number of genes is ~10 million. This can be done by CD-HIT but what I don't like about it is that it is a heuristic method that tends to create many clusters.
Another approach that I have read about is using all-against-all BLAT search and merging sequences that fulfill the criteria. The question is then how to do the clustering, with single linkage, complete linkage etc. Does someone know about some script to parse BLAT output and create these clusters or have any comments to this approach? I guess it will take very long time to do the all-against-all?
Do you have any other suggestions or comments to this?
asked 01 Nov '12, 08:33