|
I have a 19GB fastq file (~70 million reads) and I want to find ~10,000 reads (by looking for their name) and pull out their sequence and quality. Looping through the file 10,000 times is obviously not a solution as it would easily take quite a few days! By searching the net, I found cdbfasta that couldn't, however, index such a big file because the index file got really big and the program crashed. Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files. How do you usually deal with such huge fastq files? |
|
if you have only 10,000 reads you want to pull out, you can just iterate over the file once. For each record you check if the current read header is in the list (or hash) of 10,000. Something like this:
wouldn't it be better to say if (! exists $headers{})
(21 May '12, 17:07)
SupportRep
you mean (! exists $headers{$rec->display_id}) ? That is more explicit, but it won't make a difference.
(21 May '12, 17:08)
SupportRep
I took your advice brentp and started writing this Perl script. Weirdly, though, I discovered that BioPerl is VERY slow so I decided to do the file parsing "manually" (assuming there are only 4 lines per sequence) and it turned out to be A LOT faster (more than 60 times faster)!
(21 May '12, 17:08)
Customer
|
|
I would use python (no dependencies): 1. read you read names into list1 and change list to set (it's hashable, so checking for present of element is much faster than in list) 2. parse you fastq file and check if each name is present in set of read names of interest
My test set was 10k: first and last 5k headers of the fq. Fq was 4.9GB, 115,959,528 lines, seq between 31-75bases. I used 15k SAS drive - read speed is ~200MB/s, access time ~5ms. |