Why has downloading fastQ files become so complicated?

Downloading

Recently, I had to retrieve Sequencing data in fastQ format belonging to a paper from Law et al. It was for one of two remaining mini-projects standing before me and my PhD.

Mainly they’re for applying my gene centric approach (Watch out for the next part it’ll be released soon!) to a time series dataset of total RNA and for a enriched reactor core.

So it begins with the following line in the publication:

All raw metagenome, metatransriptome and amplicon sequencing data used in this study are publicly available from NCBI under BioProject ID: PRJNA320780 (http://www.ncbi.nlm.nih.gov/bioproject/320780).

metagenome ie DNA
metatranscriptome ie. total RNA
amplicon 16S only

Sounds easy now aint it, go to link, click on download and you’ll get everything you need. Well it wasn’t. =(

Previously my experience with downloading of NCBI has been mostly their web portal via a browser not much programmatically.

Day 1: Getting the Files

K, calm down all I need now is a link to wget or curl the files. No problem I’ve heard of the SRA format, SRA stands for Sequence Read Archives nothing is gonna stop me.

On the Bioproject’s page I saw i had about 40 SRA files to fetch…

Hmmmm. Do I click and download them by hand? “Of course not, I know how to write scripts why should I do this by hand”, I thought.

After some digging around for ways to get the download link I found this: Entrez Direct: E-utilities on the UNIX Command Line

To install the tool you’ll need install some perl modules first: (forgive the PERL cause everyone knows perl is like never going away in Bioinformatics)

you’ll probably need to CPAN some modules (I recommend installing cpanminus aka cpanm) Perl’s unofficial package manger

Yes so thats a bunch of perl modules to install Net::FTP

cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -c edirect.tar.gz | tar xf -
rm edirect.tar.gz
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh

After installing this well you could finally start downloading the SRA… (you wished)

Digging through the website it was easy to find the button to download the SRAs, but getting the links to all 40 SRAs programmatically, not so easy! And yeap I was pretty much right, after looking for a way to get the runInfo.csv

Day2 : The saga continues: Do dont need to download the files

EDIRECT=/path/2/eDirect
cd $EDIRECT
esearch -db sra -query PRJNA320780 | ./tools/edirect/efetch --format runinfo

which looks like this:

Run	ReleaseDate	LoadDate	spots	bases	spots_with_mates	avgLength	size_MB	AssemblyName	download_path	Experiment	LibraryName	LibraryStrategy	LibrarySelection	LibrarySource	LibraryLayout	Platform	Model	SRAStudy	BioProject	Study_Pubmed_id	ProjectID	Sample	BioSample	SampleType	TaxID	ScientificName	SampleName	g1k_pop_code	source	g1k_analysis_group	Subject_ID	Sex	Disease	Tumor	Affection_Status	Analyte_Type	Histological_Type	Body_Site	CenterName	Submission	dbgap_study_accession	Consent	RunHash	ReadHash
SRR3501849	2016-05-18 11:35:07	2016-05-13 11:31:37	25818676	7797240152	25818676	302	4224	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501849	SRX1759558	844	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435427	SAMN04957382	simple	942017	activated sludge metagenome	UPWRP_SW_d1_r1	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	8C81A9CE61F9010A73220794D655E084	0AE4D27EB24ECF49E094557AD7255216
SRR3501850	2016-05-18 11:50:28	2016-05-13 11:46:22	31189839	9419331378	31189839	302	5112	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501850	SRX1759559	845	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435428	SAMN04957383	simple	942017	activated sludge metagenome	UPWRP_SW_d2_r1	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	E649F6CDCC80915B98BE85CD437B7EFE	B58C5296FB135FCF2E9BFD8544C33B29
SRR3501851	2016-05-18 11:47:02	2016-05-13 11:42:17	31966019	9653737738	31966019	302	5244	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501851	SRX1759560	945	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435429	SAMN04957392	simple	942017	activated sludge metagenome	UPWRP_SW_d1_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	81EC07EC8BC6509DBCB00BC4FA7401A9	9AD8B926EF9D20E3A2FD10582C72B592
SRR3501852	2016-05-18 12:02:10	2016-05-13 11:57:54	29331148	8858006696	29331148	302	4854	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501852	SRX1759561	946	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435430	SAMN04957393	simple	942017	activated sludge metagenome	UPWRP_SW_d2_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	63B30D9EC717121777A138CECA1F1ACA	35A116CCE17CBA7F425465AA9D7DBB6B
SRR3501853	2016-05-18 11:50:18	2016-05-13 11:46:11	34045865	10281851230	34045865	302	5630	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501853	SRX1759562	947	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435431	SAMN04957394	simple	942017	activated sludge metagenome	UPWRP_SW_d3_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	3AEB6D6C4FE383F80D1E16E588C2D374	876D1E61221339EF202EAAEC93AD0C5C
SRR3501854	2016-05-18 11:46:21	2016-05-13 11:41:11	29717524	8974692248	29717524	302	4935	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501854	SRX1759563	948	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435432	SAMN04957395	simple	942017	activated sludge metagenome	UPWRP_SW_d4_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	D467387C3A275485CC8EA2025E6044ED	9EB031A8BDAD3C2135E92CF3DBB29169

Great the links to the SRAs are in the column download_path

So by the way I found this awesome download script which combined pycurl + tqdm (friend recommended me this, if you were wondering what tqdm stands for, it means “progress” in Arabic: taqadum)

import os
import pycurl
from tqdm import tqdm


downloader = pycurl.Curl()


def sanitize(c):
    c.setopt(pycurl.UNRESTRICTED_AUTH, False)
    c.setopt(pycurl.HTTPAUTH, pycurl.HTTPAUTH_ANYSAFE)
    c.setopt(pycurl.ACCEPT_ENCODING, b'')
    c.setopt(pycurl.TRANSFER_ENCODING, True)
    c.setopt(pycurl.SSL_VERIFYPEER, True)
    c.setopt(pycurl.SSL_VERIFYHOST, 2)
    c.setopt(pycurl.SSLVERSION, pycurl.SSLVERSION_TLSv1)
    #c.setopt(pycurl.FOLLOWLOCATION, False)
    c.setopt(pycurl.FOLLOWLOCATION, True)


def do_download(url, local, *, safe=True):
    rv = False
    with tqdm(desc=url, total=1, unit='b', unit_scale=True) as progress:
        xfer = XferInfoDl(url, progress)
        if safe:
            local_tmp = local + '.tmp'
        else:
            local_tmp = local

        c = downloader
        c.reset()
        sanitize(c)

        c.setopt(pycurl.NOPROGRESS, False)
        c.setopt(pycurl.XFERINFOFUNCTION, xfer)

        c.setopt(pycurl.URL, url.encode('utf-8'))
        with open(local_tmp, 'wb') as out:
            c.setopt(pycurl.WRITEDATA, out)
            try:
                c.perform()
            except pycurl.error:
                os.unlink(local_tmp)
                return False
        if c.getinfo(pycurl.RESPONSE_CODE) >= 400:
            os.unlink(local_tmp)
        else:
            if safe:
                os.rename(local_tmp, local)
            rv = True
        progress.total = progress.n = progress.n - 1
        progress.update(1)
    return rv


class XferInfoDl:
    def __init__(self, url, progress):
        self._tqdm = progress

    def __call__(self, dltotal, dlnow, ultotal, ulnow):
        n = dlnow - self._tqdm.n
        self._tqdm.total = dltotal or guess_size(dlnow)
        if n:
            self._tqdm.update(n)


def guess_size(now):
    ''' Return a number that is strictly greater than `now`,
        but likely close to `approx`.
    '''
    return 1 << now.bit_length()

K so I’ve downloaded the SRA files, I just need to extract the fq from the SRA. Which brings to the SRAtoolkit

Its basically a collection of cmd line tools to deal with the SRA files, what we’re really interested with is fastq-dump

Its not exactly clear in NCBI’s readme, but here’s what it does fastq-dump tries to automatically download the SRAs again even though you’ve got the local file ready. Running fastq-dump -v shows you its trying to download from NCBI.

The rationale for this I assume is to prevent corrupted files since there’s another tool in the toolkit vdb-validate ./<filename>.sra which checks its integrity.

You could read the whole issues thread but I think this user’s frustration just sums it up for me as well.

@klymenko That is unacceptable. I do not need alignments. just the raw fastq files. This has nothing to do with RefSeq files. Further, neither fastq-dump -h nor online man pages say anything about accompanying refseq files. It simply says you can act on local SRA files. Further, all of the above validation tools approve of the downloaded SRA file

The owner of the repo goes on to threaten the poor fella who’s just like me trying to download file

If you want help, please ask. If you want to flame, then I’ll close the issue.

LONG SIGH

So the prescribed way of doing this is actually to run the following if you havent downloaded the SRA.

prefetch <SRA ID>
fastq-dump <SRA ID>

Yes you won’t even have to go thru downloading 1. entrez tool to get the 2. runInfo.csv with the links to get the 3. SRA files.

And if you’ve already downloaded a local SRA file like me, you will have to run prefetch to check the local file, my guess is it stores the location for fastq-dump to recognise.

prefetch <localFile>
fastq-dump <localFile>

The story deepens, turns out the extraction of the fq from the SRA is excruciatingly slow and its not just me

It’s been running for about 3 hours and so far extracted ~15GB of what I expect to be around 60GB. An improvement, but still not exactly fast…

patiently

Looked around for other solutions to speed this up and game across the gnu parallel tool.

parallel fastq-dump --split-files -F --gzip {} ::: *.sra

but it doesnt really solve anything since each file still has to be extracted by 1 thread.

Thank god, later i stumbled across parallel-fastq-dump which makes use of the -N and -X flags in the original fastq-dump which splits the extraction over different ranges so it can be parallelized.

parallel-fastq-dump --sra-id SRR3501865 -F --threads 20 --outdir ../unzipped --split-files --gzip --tmpdir /scratch/uesu/

The results are stunning

results

Conclusion

Thats all folks, the moral of the story will be to try and avoid downloading through NCBI if u can but straight from the source if possible. Have a good one!