I have a 19GB fastq file (~70 million reads) and I want to find ~10,000 reads (by looking for their name) and pull out their sequence and quality. Looping through the file 10,000 times is obviously not a solution as it would easily take quite a few days!

By searching the net, I found cdbfasta that couldn't, however, index such a big file because the index file got really big and the program crashed. Other than this, there's also a BioPerl module (Bio::Index::fastq) which I didn't try because I read somewhere that you can run into the same problem when dealing with huge fastq files.

How do you usually deal with such huge fastq files?

asked 21 May '12, 17:07

Customer's gravatar image

Customer
295
accept rate: 0%


if you have only 10,000 reads you want to pull out, you can just iterate over the file once. For each record you check if the current read header is in the list (or hash) of 10,000. Something like this:

use Bio::SeqIO;
use Bio::Seq::Quality;

# read in 10K headers into a hash with value of 1 or whatever.
# you'll have to add your own code to do this.
my %headers = ();

$seqio  = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $rec = $seqio->next_seq())) {
    # if the current record doesn't exist in the 10,000, skip it.
    if(! $headers{$rec->display_id} ) { next; }
    # otherwise, write it to the output file.
    $out_fastq->write_seq($rec);
}
link

answered 21 May '12, 17:07

SupportRep's gravatar image

SupportRep
2314
accept rate: 32%

wouldn't it be better to say if (! exists $headers{})

(21 May '12, 17:07) SupportRep

you mean (! exists $headers{$rec->display_id}) ? That is more explicit, but it won't make a difference.

(21 May '12, 17:08) SupportRep

I took your advice brentp and started writing this Perl script. Weirdly, though, I discovered that BioPerl is VERY slow so I decided to do the file parsing "manually" (assuming there are only 4 lines per sequence) and it turned out to be A LOT faster (more than 60 times faster)!

(21 May '12, 17:08) Customer

I would use python (no dependencies): 1. read you read names into list1 and change list to set (it's hashable, so checking for present of element is much faster than in list) 2. parse you fastq file and check if each name is present in set of read names of interest

#!/usr/bin/python
import sys

#get fname from parameter
idfile=sys.argv[1]

#load ids
ids = set( x.strip() for x in open(idfile) )

#read stdid 
handle=sys.stdin

while ids:
  #parse fastq
  idline=handle.readline()
  seq   =handle.readline()
  spacer=handle.readline()
  quals =handle.readline()
  #check
  id=idline[:-1]
  if id in ids:
    #print fastq
    sys.stdout.write( '%s%s%s%s' % ( idline, seq, spacer, quals ) )
    #update ids
    ids.remove( id )

My python implementation is over 3x faster than c++ by Pierre. It's due to implementation of set (hash-able) instead of usual list.

#c++
time cat PL429_q10_filtered_1.fastq | ./parse PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqcpp

real    4m21.137s
user    4m15.950s
sys 0m10.510s

#python (3.4x faster than c++)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.fqSet

real    1m11.155s
user    1m5.270s
sys 0m10.920s

#python list instead of set (56x slower than python using set!)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.noBioPython.noSet.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqNoSet

real    70m54.534s
user    70m18.650s
sys 0m16.740s

#parsing file
time cat PL429_q10_filtered_1.fastq | wc -l
115959528

real    0m30.373s
user    0m1.990s
sys 0m8.250s

My test set was 10k: first and last 5k headers of the fq. Fq was 4.9GB, 115,959,528 lines, seq between 31-75bases. I used 15k SAS drive - read speed is ~200MB/s, access time ~5ms.

link

answered 21 May '12, 17:08

SupportRep's gravatar image

SupportRep
2314
accept rate: 32%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×6
×2

Asked: 21 May '12, 17:07

Seen: 924 times

Last updated: 21 May '12, 17:08

powered by OSQA