Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Compression of FASTA/FASTQ files

We will discuss here only one topic: FASTA/FASTQ compression. Currently there is no standardized compressor for FASTA/FASTQ files and often general purpose archivers are used as a substitute. Out of troika the most popular archivers (gzip/pigz bzip/pbzip2 and xz) pbzip2 has an edge.

	Sample	Compressor	Parameters	Orig size	Compressed size	Compress time		Compress speed	Decompress time		Speed relative to gzip -6	Size relative to gzip	Qratio (speed divided on square of size) All relative to gzip	% of space occupied by archive	Compression ratio
				(GB)	(GB)	Min	sec	MB/sec	Min	sec
1	FASTQ	gzip	-6	7.43	2.08	18	3	6.86	1	14	1.00	1.00	1.00	0.28	3.58
2	FASTQ	pigz		7.43	2.08	1	16	97.78	0	51	14.39	1.00	14.37	0.28	3.57
3	FASTQ	pigz	-9	7.43	2.04	2	32	48.89	0	51	7.20	0.98	7.47	0.27	3.64
4	FASTQ	bzip2		7.43	1.53	20	42	5.98	7	21	0.88	0.74	1.63	0.21	4.86
5	FASTQ	pbzip2		7.43	1.60	1	51	66.95	0	56	9.86	0.77	16.63	0.22	4.64
6	FASTQ	xz	-9	7.43	1.34	400	5	0.31	3	19	0.05	0.64	0.11	0.18	5.56
7	FASTQ	xz		7.43	1.53	206	1	0.60	3	19	0.09	0.74	0.16	0.21	4.86

There are two major categories of FASTA/FASTQ files compression:

reference-free methods. They in turn can be generic and specialized. Generic also can be single threaded or multi-threaded that can take advatage fo multi-core CPU that became dominant (for example, gzip is a single threaded and pigz is parallelized version that produced and can decompress gz files)
- generic Typically this is gzip/pigz (the power of inertia), more rarely bzip2/pbzip2 (the best combination of speed and compression ratio for this type of data), even more rarely xz; the latter has much slower compression speed than pbzip2). Those are generally limited by compression ration of five (archived file size is 20% or more of the original). Of them pbzip2 is currently the undisputable leader. Here is a relevant quote about bzip2 from the chapter 8 (Contextual Data Transforms Practical Implementations) of the book Understanding Compression Data Compression for Modern Developers Colt by McAnlis, Aleks Haecky (O'Reilly, 2016)
  
  BWT and DNA
  
  BWT has always been an edge-case of compression. Its initial existence showed really good results for text-based data, but it could never compete from a performance perspective with other algorithms such as GZIP. As such, BWT (or bzip2, the dominant BWT encoder) never really took the compression world by storm.
  
  That is, until humans began sequencing deoxyribonucleic acid, or DNA.
  
  Human DNA has a pretty simple setup with only four basic nucleotide bases, labeled A, C, G, and T. A given genome is basically a massive string containing these four symbols in various orderings. How much? Well, the human genome contains about 3.1647 billion DNA base pairs.
  
  It turns out that BWT’s block-sorting algorithm is an ideal transform that could be applied to DNA to make it more compressible, searchable, and retrievable. (There’s actually a boat-load of papers proving this.) The reduction in size and availability for fast reads are of high importance when aligning reads of new genomes against a reference.
  
  This just goes to show how there’s no single silver bullet when it comes to data compression. Each stream of information has its own variable characteristics and responds differently to different transforms and encoders. Although BWT might not have taken the web away from its cousin, GZIP, it stands alone as an important factor in the next few decades of bioinformatics.
- specialized that are based on the spcific characteristics of the FASTA/FASTQ files (FASTQZ, etc) such as small alphabet size (i.e., mainly four, nucleotides A (adenine), T (thymine), C (cytosine) and G (guanine)), repeats and palindromes. They can achieve higher compression ratio, but often at the cost of additional time both during compression and decompression. The simplest reference-free specialized method is based on conversion of sequences into 2 bit values with exceptions (for example letters N) encoded via special commands after this compressed string which overwrite the decoded string in specified places. It can achieve the compression ratio of 4 but postprocessing of encoded sequence with generic archivers does not improve it much. See for example fapack written by Mahoney (fastqz - FASTQ compressor)
  - fapack is a program for packing FASTA files 4 bases (A,C,G,T) per byte. fapacks also accepts lowercase (a,c,g,t) used to indicate repeats. It produces a larger reference genome but generally better compression.
    fapack.cpp v1.0 (Mar. 12, 2012).
    fapacks.cpp v1.0 (Mar. 13, 2012).
reference-based methods, that exploit one or a set of reference sequence(s) iether from some open source database, or one of FASTA/FASTQ file in a set for particular organism. They generally can provide higher compression ration.

In FASTA N bases need to stored via special commands for inserting substring of them into particular place. For FASTQ N bases generally do not need to be stored and they can be replaced with the most suitable letters - whichever takes the least space to encode. They can be replaced with N during the decompression phase based on the fact that only N has the quality is zero. (This should be enforced.)

Algorithms that do well with text compression are probably worth investigating, insofar as uncompressed FASTQ is structured text.

Good overview of compression of FASTQ is given in LFQC a lossless compression algorithm for FASTQ files Bioinformatics Oxford Academic

Compression of nucleotide sequences has been an interesting problem for a long time. Cox etal. (2012) apply Burrows–Wheeler Transform (BWT) for compression of genomic sequences. GReEn (Pinho etal., 2012) is a reference-based sequence compression method offering a very good compression ratio. (Compression ratio refers to the ratio of the original data size to the compressed data size). Compressing the sequences along with quality scores and identifiers is a very different problem. Tembe etal. (2010) have attached the Q-scores with the corresponding DNA bases to generate new symbols out of the base-Q-score pair. They encoded each such distinct symbol using Huffman encoding (Huffman, 1952). Deorowicz and Grabowski (2011) divided the quality scores into three sorts of quality streams:

quasi-random with mild dependence on quality score positions.

quasi-random quality scores with strings ending with several # characters.

quality scores with strong local correlations within individual records.

To represent case 2 they use a bit flag. Any string with that specific bit flag is processed to remove all trailing #. They divide the quality scores into individual blocks and apply Huffman coding. Tembe etal. (2010) and Deorowicz and Grabowski (2011) also show that general purpose compression algorithms do not perform well and domain specific algorithms are indeed required for efficient compression of FASTQ files. In their papers they demonstrate that significant improvements in compression ratio can be achieved using domain specific algorithms compared with bzip2 and gzip.

The literature on FASTQ compression can be divided into two categories, namely lossless and lossy. A lossless compression scheme is one where we preserve the entire data in the compressed file. On the contrary, lossy compression techniques allow some of the less important components of data to be lost during compression. Kozanitis etal. (2011) perform randomized rounding to perform lossy encoding. Asnani etal. (2012) introduce a lossy compression technique for quality score encoding. Wan etal. (2012) proposed both lossy and lossless transformations for sequence compression and encoding. Hach etal. (2012) presented a ‘boosting’ scheme which reorganizes the reads so as to achieve a higher compression speed and compression rate, independent of the compression algorithm in use. When it comes to medical data compression it is very difficult to identify which components are unimportant. Hence, many researchers believe that lossless compression techniques are particularly needed for biological/medical data. Quip (Jones etal., 2012) is one such lossless compression tool. It separately compresses the identifier, sequence and quality scores. Quip makes use of Markov Chains for encoding sequences and quality scores. DSRC (Deorowicz and Grabowski, 2011) is also considered a state of the art lossless compression algorithm which we compare with ours in this article. Bonfield and Mahoney (2013) have come up with a set of algorithms (named Fqzcomp and Fastqz) to compress FASTQ files recently. They perform identifier compression by storing the difference between the current identifier and the previous identifier. Sequence compression is performed by using a set of techniques including base pair compaction, encoding and an order-k model. Apart from hashing, they have also used a technique for encoding quality values by prediction.

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20180606 : Reliable Fastq compression programs ( Jan 03, 2016 , www.biostars.org )
20180521 : Time to zip very large (100G) files ( May 21, 2018 , superuser.com )
20180520 : How should I compress FastQ-format files ( May 20, 2018 , uppmax.uu.se )
20180520 : The Bioinformatics Adventure Comparing fastq compression with gzip, bzip2 and xz ( May 20, 2018 , dskernel.blogspot.com )
20180520 : Performance comparison of sequential and parallel compression applications for DNA raw data by Aníbal Guerra, Jaime Lotero, Sebastián Isaza ( Jun 10, 2016 , link.springer.com )

Old News ;-)

[Jun 06, 2018] Reliable Fastq compression programs

Notable quotes:

"... -Q {0,3,5} -s5+ -e -q 3 ..."

Jan 03, 2016 | www.biostars.org

2.6 years ago by Eric Normandeau ♦ 9.8k Quebec, Canada

Eric Normandeau ♦ 9.8k wrote:

I would like to use another compression algorithm on fastq files than gzip for long term storage. Decompressing and re compressing could free about half the disk space I am using, which would save me approximately 20-30 To of space now and more in the future. (Funny, for some people this is probably big and for others ridiculously small :) The bottom line here is that disk space is money and up-scaling means the costs won't be linear but probably more expensive.

I have found a few contenders, but the most solid of them is fqzcomp-4.6 ( http://sourceforge.net/projects/fqzcomp/files/) . It boasts very good compression rates (about twice smaller compressed files when compared to gzip), is fast enough, installs easily.

Now these are important data, so I need to be sure that in 2 or 5 or 10 years I will be able to get them back.

I saw this other question ( https://www.biostars.org/p/156303/#163646) , but I really want to know if one of these tools is a) good enough to improve compression b) dependable. The article cited by Charles Plessy doesn't help much in this regard.

For those of you that work in big groups, institutes, organisations... What fastq compression tool would you recommend from the point of view of data safety? What have you used with success?

** EDIT **

I did a quick compression ration comparison for gzip, bz2, and fqzcomp. I used the default parameter values for gzip and bz2 and -Q {0,3,5} -s5+ -e -q 3 for fqzcomp. Here are the compression ratios (compared to the non compressed files):
Algo     ratio
gzip     0.272
bz2      0.218
fqzcomp  0.101 - 0.181 (depending on the Q param value)

bz2 / gzip      0.801
fqzcomp / gzip  0.371
fqzcomp / bz2   0.463
From these figures, we can see that bz2 reduces files only by an additional ~20% when compared to gzip.

However, fqzcomp files can be as much as 2.7 times smaller than gzip ones and 2.2 times than bz2 files. This is why I am really considering this algorithm. Of course, these figures will change with different fastq files and it assumes you do not care much about the quality values (for Ion Proton, which we sequence in high volume, we actually don't really care that much) but the potential gain is considerable.
compression reliability fastq • 6.3k views ADD COMMENT • link •

Follow via messages

Follow via email

Do not follow

modified 12 months ago by Petr Ponomarenko • 2.5k • written 2.6 years ago by Eric Normandeau ♦ 9.8k 2
Even assuming the absolute worst case of a computer 5-10 years from now being entirely incompatible with your toolkit of choice -- which frankly seems so unlikely a possibility as to be a non-issue -- so long as you have the source code for "custom" or specialized compression and extraction tools, consider that 5-10 years from now, you could run current operating systems (and older compilers) within a VM hypervisor, like VirtualBox, Docker, etc. to extract archives. To help rebuild the environment down the road, it might help to keep a manifest with an archive, a README that describes the host, development tools and versions of things.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Alex Reynolds ♦ 24k
Is that enough to define dependability? What about bugs that are hard to spot? I don't know, I just want to find a tool I can trust with our group's data :) Mainly, I would like to get recommendations from big users too.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Eric Normandeau ♦ 9.8k 2
I guess it may depend on the details of your use case, but to me this seems like a lot of worry over a very very tiny part of the problem. I think that as long as you pick a reasonable compression tool where you have the source code and enough instructions to build it on a clean system, you'll be fine.

I would be much more worried about other aspects of the long-term storage problem, like adequate off-site backups and catastrophic fail-safe plans at the physical level. For instance, what if your backup provider goes out of business, the AC and emergency power in your datacenter fails and all your hard drives are damaged, or your systems are compromised and a rogue user tries to permanently delete many files? Or maybe something simpler, like system or backup account credentials being lost or forgotten as people move on? At your timescale and data sizes, I might even be worried about (relatively) far-fetched problems like bit rot .

So in sum, I guess I think that if decompressing the data is the only problem you have after X years, you would be lucky. I would personally focus on having enough independent off-site backups and then think about applying redundancy at other filesystem layers, like maybe with ZFS.
ADD REPLY • link written 2.6 years ago by matted ♦ 6.7k
A non-lossy compression tool should be deterministic, which I'd think is a minimum standard for dependability. In other words, if you run the same input bytes through the same compression or extraction algorithm in the same environment, you should get the same expected bytes as output on repeated trials. If you can take the environment out of the equation with a VM, then you just need to worry about the compression tool, which you could probably set up post-compression extraction tests to verify functionality.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Alex Reynolds ♦ 24k
Considering the files are on different servers, how would you go about spinning up a VM to (de)compress hundreds of files across different systems weighting a few dozen To? It doesn't seem too fun to me. I need something that will work, now and in the foreseeable future, on *NIX machines without giving me or a possible descendant headaches.

Also, I don't see the link between deterministic and dependable. It just means it will give the same output for the same input, not that the code is well written and not bug ridden.

I am looking for input about quality fastq compression tools (a very specific need) that I can depend on. Your answer is basically that all tools are equal as long as they work in a VM. I don't think I can agree.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Eric Normandeau ♦ 9.8k
Good idea about the manifest and info about the system.
ADD REPLY • link written 2.6 years ago by Eric Normandeau ♦ 9.8k
So what did you settle on?
ADD REPLY • link written 10 months ago by Yannick Wurm • 2.3k
Yannick, I am very interested too since we made our lossless compression algorithm Lossless ALAPY Fastq Compressor (now for MacOS X with 10-20% improved speed and compression ratio) and we think it is worth mentioning.
ADD REPLY • link written 9 months ago by Petr Ponomarenko • 2.5k 1
I never felt 100% sure about any of the alternative compression softwares, so I continued depending on gzip at the cost of having to buy more disk space. I would still love to find a better way but a major tradeoff is that fasta.gz and fastq.gz can be read by most bioinfo pieces of software so deviating from that format means a bit more work. This could be fine for long term storage though.

Anybody adopted something other than gzip or bz2 and would like to report?
ADD REPLY • link written 9 months ago by Eric Normandeau ♦ 9.8k 1
Have you tried using compressed files with decompression in process substitution in Linux <(...) or read from stdin?
fastqc <(decompress compressed.file)
ADD REPLY • link written 9 months ago by Petr Ponomarenko • 2.5k
You know, Clumpify's output is still just a gzipped fastq/fasta. The only difference is that the order of the reads is changed. So it's 100% compatible with all software that can read gzipped fastq. Or bzipped, for that matter.
ADD REPLY • link modified 9 months ago • written 9 months ago by Brian Bushnell ♦ 15k 2 23 months ago by John ♦ 12k Germany John ♦ 12k wrote:
I tried LFQC, but it had a bug where if the bundled precompiled binaries (for lpaq and zpac) didn't work, the ruby script that controls them would still print "created successfully!", then delete the work space after moving an empty tar file over the top of your original fastq.

I then tried fqzcomp, and its really really fast, and the output is tiny (for ENCODE's ENCFF001LCY.fq, which is 600Mb+ after gzip, fqzcomp has got it down to 250Mb) - but it has an unfortunate condition where it can't decompress what it wrote :/

You do not want to be stuck with a Floating point exception 8 10 years from now, thats for sure - so i think the answer is "everything is terrible, just stick to lzma" :) You'll only squeeze out a few more Mb with the other tools. If however you start using some of the lossy functions of the other tools, filesize will drop quickly. For example, binning quality scores, renaming the ids to just 1, 2, 3, 4... , converting all Ns to low-quality As, not retaining the original order of the fastq and instead sort entries by what compresses best, etc. But these methods all lose some information, so it might not be that exciting.
ADD COMMENT • link modified 23 months ago • written 23 months ago by John ♦ 12k 2
I would be sort of wary of any "lossless" compression that uses floating point anywhere...

I like lzma for personal use, but it seems pretty slow compared to low-compression or multithreaded gzip, for a big-data production environment. Is there a multithreaded implementation?

Also, when you say "not retaining the original order of the fastq and instead sort entries by what compresses best"... technically that's kind of not lossy. At least with Illumina, I think you can probably recover the original order (or close to it) by looking at the read names, and I've never thought the order was important except when diagnosing machine problems.

But if you are willing to discard that information, you might want to try Clumpify , a tool I wrote. It re-orders reads so that sequences sharing kmers are close together, quickly and without any mapping, and using an arbitrarily small amount of memory (how small depends on your system's file handle limit). This allows gzip to compress error-free reads generated from a bacterial genome down to the size of around the bacterial genome, even if there is, say, 40x coverage. This near-perfect compression requires you to replace the quality scores with a fixed value, and give the reads very short names (like 1, 2, 3, etc), and it works much better on long, single-ended reads (or merged pairs). But even with paired reads containing sequencing errors, and using the raw quality scores and names, you get a substantial increase in compression. The output fastq file is still a valid fastq file, and for purposes where you don't care about the order of the sequences (which are most purposes I care about), it will be no different... except faster in most cases like mapping, assembly, or kmer-counting, due to improved caching and branch prediction from similar reads being adjacent. Of course if you rename the reads with numbers you get better compression and can easily recover the original order as well.

P.S. I should note that core-counts are increasing, while IPC and frequencies have stagnated, and essentially been flat for 4 years on workloads I care about. 10 years from now, I expect multithreaded compression and decompression to be very important; a fast program capable of using 128 threads is crippled if it can only decompress at 160 MB/s, roughly the current limit of gzip... let alone lzma, which on my computer is many times slower.
ADD REPLY • link modified 23 months ago • written 23 months ago by Brian Bushnell ♦ 15k 2
Yes, a floating point error for something thats lossless and doesn't contain floating point numbers is a bit weird, but hey - at least it didn't delete my data -_-;

For parallel lzma, the official xz tool (which is the new compressor that does lzma2 compression) has a --threads option, but i've never used it. I'm unsure if you need to decompress with the same number of threads, or what exactly threads means here in terms of speedup. Theres are also a github project called pxz for "parallel xz", which looks like its the lzma cousin of pigz, but however you slice it your point of it not being anywhere close to as fast as gzip for big-data is valid.

Clumpify on the otherhand is an approach I haven't seen anywhere else. Usually in the encoding step before compression, data is split into names/dna/quality to help out the compressor, and converted to binary. You only sort on the DNA and don't convert to binary - which I actually prefer since that's where obscure "floating-point-esk" errors usually crop up, and it's going to be really fast compared to the other methods. And the output is valid FASTQ, and that FASTQ will be faster at being processed by downstream tools. I think thats a really really neat idea, and should probably be something built into the sequencers that are outputting FASTQs in the first place. It might not get the compressed file size down as low as fqzcomp, but I think it answers the OPs question perfectly by being undoubtedly the most reliable method.
ADD REPLY • link written 23 months ago by John ♦ 12k 1
Note: Clumpify.sh is part of BBMap suite .
ADD REPLY • link written 23 months ago by genomax ♦ 48k 2 12 months ago by Petr Ponomarenko • 2.5k United States / Los Angeles / ALAPY.com Petr Ponomarenko • 2.5k wrote:
Eric, I am very interested in what have you selected for fastq compression. We developed ALAPY Compressor Lossless ALAPY Fastq Compressor (now with stdin/stdout support) and technically it is reliable based on about 2000 different files that were compressed, decompressed, md5 sums compared and found to be exact in all cases. It is available on GitHub for free https://github.com/ALAPY/alapy_arc as a compiled binary for Linux and Windows. We hope GitHub will be around in 10 years and it will provide the current functionality of distributing these files. Could you please tell us what do you need to consider compression tool reliable and dependable?

Overall I am very interested in your thought about ALAPY Compressor in general, ie if the compression ratio is good, memory usage, features, etc.
ADD COMMENT • link written 12 months ago by Petr Ponomarenko • 2.5k 1 2.6 years ago by Brian Bushnell ♦ 15k Walnut Creek, USA Brian Bushnell ♦ 15k wrote:
BAM gives you some compression over fastq.gz (particularly if you map and sort first). And pigz produces gzip files, faster than gzip, which allows you to increase the compression level. There's also bz2, which has a parallel implementation and gives better compression than gz.

If you want to be confident in recovering the data at some point in the future... I would go with one of those rather than something that is not widely used.
ADD COMMENT • link written 2.6 years ago by Brian Bushnell ♦ 15k 1
I've gone the bz2 way for longer term storage. I haven't really studied the compression algorithm in detail so I don't know how much your data affects the level of compression. As recent examples, 2 x 8.9G fastq files compressed into a 3.3G tar.bz2 archive and 2 x 21G fastq files compressed into a 6.9G tar.bz2 archive. So, overall ca. 6X reduction in size. In these cases, the reads were quite heterogenic (QC'd HiSeq-sequenced metagenomes). LZMA could perhaps achieve a better ratio still although memory requirements may prevent its use..
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by 5heikki ♦ 7.3k
Hi, by fastq do you actually mean fastq.gz? Because if you mean fastq, then 8.9G to 3.3G seems less efficient than compressing with gz, or am I wrong?
ADD REPLY • link written 2.3 years ago by toharl • 0
2x8.9G fastq (total 17.8G) into 3.3G (total), i.e. ca. 6X reduction.
ADD REPLY • link modified 2.3 years ago • written 2.3 years ago by 5heikki ♦ 7.3k
In most cases, I do not have access to a reference genome or it would be incomplete and I would lose sequences, so BAM is not an option.

I know about pigz and use it on my computer but compression speed is not the major issue here. Space is. I am not convinced yet I want to go the more risky route of using a less known compression tool without some encouraging user stories from people who handle lots of data.

Thanks for the opinion.
ADD REPLY • link written 2.6 years ago by Eric Normandeau ♦ 9.8k
You can store unaligned reads in BAM just fine (all the mapping-related information is just blank). The sequences, quality scores, and read names are all there, so it's effectively lossless.

This has been proposed/pushed before (see here ) and discussed in various places like here and here , if you want to see some responses to it. For me, I like the cleanliness of the approach (a "universal" format), but it hasn't really caught on with the datasets and experiments that we see or do.
ADD REPLY • link written 2.6 years ago by matted ♦ 6.7k
Actually, If you save them as aligned BAM, the BAM files will be larger than fastq.gz, usually. I think. However, maybe you are right non-aligned BAM will be a little smaller compared with fastq.gz.
ADD REPLY • link written 18 months ago by Shicheng Guo • 4.8k

[May 21, 2018] Time to zip very large (100G) files

My experience suggest that compression rations in gzip are non-linear -- for example, you can sometime get slightly better compression ratio with -5 then with -6. but experiment itself is pretty educational.

May 21, 2018 | superuser.com

I find myself having to compress a number of very large files (80-ish GB), and I am surprised at the (lack of) speed my system is exhibiting. I get about 500 MB / min conversion speed; using top , I seem to be using a single CPU at approximately 100%.

I am pretty sure it's not (just) disk access speed, since creating a tar file (that's how the 80G file was created) took just a few minutes (maybe 5 or 10), but after more than 2 hours my simple gzip command is still not done.

In summary:
tar -cvf myStuff.tar myDir/*
Took <5 minutes to create an 87 G tar file
gzip myStuff.tar
Took two hours and 10 minutes, creating a 55G zip file.

My question: Is this normal? Are there certain options in gzip to speed things up?

Would it be faster to concatenate the commands and use tar -cvfz ?

I saw reference to pigz - Parallel Implementation of GZip - but unfortunatly I cannot install software on the machine I am using, so that is not an option for me. See for example this earlier question .

I am intending to try some of these options myself and time them - but it is quite likely that I will not hit "the magic combination" of options. I am hoping that someone on this site knows the right trick to speed things up.

When I have the results of other trials available I will update this question - but if anyone has a particularly good trick available, I would really appreciate it. Maybe the gzip just takes more processing time than I realized...

UPDATE

As promised, I tried the tricks suggested below: change the amount of compression, and change the destination of the file. I got the following results for a tar that was about 4.1GB:
flag    user      system   size    sameDisk
-1     189.77s    13.64s  2.786G     +7.2s 
-2     197.20s    12.88s  2.776G     +3.4s
-3     207.03s    10.49s  2.739G     +1.2s
-4     223.28s    13.73s  2.735G     +0.9s
-5     237.79s     9.28s  2.704G     -0.4s
-6     271.69s    14.56s  2.700G     +1.4s
-7     307.70s    10.97s  2.699G     +0.9s
-8     528.66s    10.51s  2.698G     -6.3s
-9     722.61s    12.24s  2.698G     -4.0s
So yes, changing the flag from the default -6 to the fastest -1 gives me a 30% speedup, with (for my data) hardly any change to the size of the zip file. Whether I'm using the same disk or another one makes essentially no difference (I would have to run this multiple times to get any statistical significance).

If anyone is interested, I generated these timing benchmarks using the following two scripts:
#!/bin/bash
# compare compression speeds with different options
sameDisk='./'
otherDisk='/tmp/'
sourceDir='/dirToCompress'
logFile='./timerOutput'
rm $logFile

for i in {1..9}
  do  /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile
  do  /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile
done
And the second script ( compressWith ):
#!/bin/bash
# use: compressWith sourceDir compressionFlag destinationDisk logFile
echo "compressing $1 to $3 with setting $2" >> $4
tar -c $1 | gzip -$2 > $3test-$2.tar.gz
Three things to note:

Using /usr/bin/time rather than time , since the built-in command of bash has many fewer options than the GNU command

I did not bother using the --format option although that would make the log file easier to read

I used a script-in-a-script since time seemed to operate only on the first command in a piped sequence (so I made it look like a single command...).

With all this learnt, my conclusions are

Speed things up with the -1 flag (accepted answer)

Much more time is spend compressing the data than reading from disk

Invest in faster compression software ( pigz seems like a good choice).

Thanks everyone who helped me learn all this! You can change the speed of gzip using --fast --best or -# where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip runs at level 6.

robingrindrod
The reason tar takes so little time compared to gzip is that there's very little computational overhead in copying your files into a single file (which is what it does). gzip on the otherhand, is actually using compression algorithms to shrink the tar file.
The problem is that gzip is constrained (as you discovered) to a single thread.

Enter pigz , which can use multiple threads to perform the compression. An example of how to use this would be:
tar -c --use-compress-program=pigz -f tar.file dir_to_zip
There is a nice succint summary of the --use-compress-program option over on a sister site .
Floris Feb 29 '16 at 12:55

Thanks for your answer and links. I actually mentioned pigz in the question. –

David Spillett 21.6k 39 62

I seem to be using a single CPU at approximately 100%.

That implies there isn't an I/O performance issue but that the compression is only using one thread (which will be the case with gzip).

If you manage to achieve the access/agreement needed to get other tools installed, then 7zip also supports multiple threads to take advantage of multi core CPUs, though I'm not sure if that extends to the gzip format as well as its own.

If you are stuck to using just gzip for the time being and have multiple files to compress, you could try compressing them individually - that way you'll use more of that multi-core CPU by running more than one process in parallel.

Be careful not to overdo it though because as soon as you get anywhere near the capacity of your I/O subsystem performance will drop off precipitously (to lower than if you were using one process/thread) as the latency of head movements becomes a significant bottleneck.

Floris May 9 '13 at 15:09

Thanks for your input. You gave me an idea (for which you get an upvote): since I have multiple archives to create I can just write the individual commands followed by a & - then let the system take care of it from there. Each will run on its own processor, and since I spend far more time on compression than on I/O, it will take the same time to do one as to do all 10 of them. So I get "multi core performance" from an executable that's single threaded... –

[May 20, 2018] How should I compress FastQ-format files

Notable quotes:

"... However, this tool is experimental and is not recommended for general, everyday use. ..."

May 20, 2018 | uppmax.uu.se

Short answer: The best compression using a widely available format is provided by bzip2 and its parallel equivalent pbzip2. The best compression ratio for FastQ is provided by fqz_comp in the fqzcomp/4.6 module. However, this tool is experimental and is not recommended for general, everyday use.
Long answer
We conducted an informal examination of two specialty FastQ compression tools by recompressing an existing fastq.gz file. The first tool fqzcomp (available in the module fqzcomp/4.6) uses a compiled executable (fqz_comp) that works similar to e.g., gzip, while the second tool (LFQC in the module lfqc/1.1) uses separate ruby-language scripts for compression (lfqc.rb) and decompression (lfqcd.rb). It does not appear the LFQC scripts accept piping but the documentation is limited.
module load bioinfo-tools
module load fqzcomp/4.6
module load lfqc/1.1
Both modules have 'module help' available for more info. The help for fqzcomp gives the location of their README which is very helpful in describing minor changes that might occur to the FastQ file during decompression (these do not affect the read name, sequence or quality data).

One thing changed from the 'standard' implementation of LFQC was to make the scripts stand-alone with #! header lines, rather than requiring e.g., 'ruby lfqc.rb ...' as you see in their documentation.

Since piping is not available with LFQC, it is preferable to avoid creating a large intermediate decompressed FastQ file. So, create a named pipe using mkfifo that is named like a fastq file.
mkfifo UME_081102_P05_WF03.se.fastq
zcat UME_081102_P05_WF03.se.fastq.gz > UME_081102_P05_WF03.se.fastq &
lfqc.rb UME_081102_P05_WF03.se.fastq
rm UME_081102_P05_WF03.se.fastq
This took a long time, 310 wall seconds.

Next,fqz_comp from fqzcomp/4.6. Since this works like gzip, just use it in a pipe.
zcat UME_081102_P05_WF03.se.fastq.gz | fqz_comp > UME_081102_P05_WF03.se.fastq.fqz
This used a little multithreading (up to about 150% CPU) and was much faster than LFQC, just 2-3 seconds. There are other compression options (we tried -s1 and -s9+) but these did not outperform the default (equivalent to -s3). This is not necessarily a surprise; stronger compression means attempting to make better guesses and sometimes these guesses are not correct. No speedup/slowdown was noticed with other settings but the input file was relatively small.
-rw-rw-r-- 1 28635466 Mar 10 12:53 UME_081102_P05_WF03.se.fastq.fqz1
-rw-rw-r-- 1 29271063 Mar 10 12:52 UME_081102_P05_WF03.se.fastq.fqz9+
-rw-rw-r-- 1 46156932 Jun 6   2015 UME_081102_P05_WF03.se.fastq.gz
-rw-rw-r-- 1 28015892 Mar 10 12:53 UME_081102_P05_WF03.se.fastq.fqz
-rw-rw-r-- 1 24975360 Mar 10 12:45 UME_081102_P05_WF03.se.fastq.lfqc
We also compared against bzip2 and xz, which are general-use compressors. These both function like gzip (and thus like fqz_comp) and both outperform gzip, as expected. xz is becoming a more widely-used general compressor like bzip2, but for this file it required perhaps 20x as much time as bzip2 and did worse.
-rw-rw-r-- 1 35664555 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.bz2
-rw-rw-r-- 1 36315260 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.xz
Neither of these improved general-use compressors did as well with FastQ as the specialty compressors. This makes sense given the specialty compressors can take advantages of the restrictions of the format.

Which is the best method in this trial?
From the results of this trial, the tool that provides the best compression ratio in a reasonable amount of time is fqz_comp in the fqzcomp/4.6 module. It is as fast as bzip2 which is also much better than gzip but does a much better job of compressing FastQ. However, fqz_comp is experimental so we do not recommend using fqz_comp for everyday use. We recommend using bzip2 or its parallel equivalent, pbzip2.

The fqz_comp executable could be used to decompress FastQ within a named pipe if FastQ is required for input:
... <(fqz_comp -d < file.fastq.fqz) ...
Note that fqz_comp is designed to compress FastQ files alone, and neither method here provides the blocked compression format suitable for random access that bgzip does; see http://www.uppmax.uu.se/support/faq/resources-faq/which-compression-format-should-i-use-for-ngs-related-files/ for more on that subject.

Why not LFQC?
Though LFQC has the best compression of FastQ, there are some strong disadvantages. First, it takes quite a long time, perhaps 50x longer than fqz_comp. Second, it apparently cannot be used easily within a pipe like many other compressors. Third, it contains multiple scripts with multiple auxiliary programs, rather than a single executable. Fourth, it is quite verbose during operation, which can be helpful but cannot be turned off. Finally, it was difficult to track down for installation; two different links were provided in the publications and neither worked. It was finally found in a github repository, the location of which is provided in the module help.

[May 20, 2018] The Bioinformatics Adventure Comparing fastq compression with gzip, bzip2 and xz

May 20, 2018 | dskernel.blogspot.com

2012-11-06 Comparing fastq compression with gzip, bzip2 and xz
Storage is expensive. Compression is a lossless approach to reduce the storage requirements.
Sébastien Boisvert 2012-11-05

Table 1: Comparison of compression methods on SRR001665_1.fastq -- 10408224 DNA sequences of length 36. Tests were on Fedora 17 x86_64 with a Intel Core i5-3360M processor and a Intel SSDSC2BW180A3 drive. Tests were not run in parallel. The time is the real entry from the time command. Each test was done twice.

Compression
Time
Size (bytes)

none
0
2085533064
(100%)

time cat SRR001665_1.fastq | gzip -9 > SRR001665_1.fastq.gz
7m31.519s
7m20.340s
376373619
(18%)

time cat SRR001665_1.fastq | bzip2 -9 > SRR001665_1.fastq.bz2
3m12.601s
3m25.243s
295000140
(14%)

time cat SRR001665_1.fastq | xz -9 > SRR001665_1.fastq.xz
32m45.933s 257621508
(12%)

Table 2: Decompression tests. Each test was run twice.

Decompression
Time

time cat SRR001665_1.fastq.gz | gunzip > SRR001665_1.fastq
0m14.612s
0m13.247s

time cat SRR001665_1.fastq.bz2 | bunzip2 > SRR001665_1.fastq
1m3.412s
1m4.337s

time cat SRR001665_1.fastq.xz | unxz > SRR001665_1.fastq
0m24.194s
0m23.923s

It is strange that bzip2 is faster than gzip for compression.
Posted by Sébastien Boisvert at 00:07 1 comment:

Mark Ziemann said...

pbzip2 is very fast.

Monday, November 26, 2012 at 11:45:00 PM EST

Post a Comment

[May 20, 2018] Performance comparison of sequential and parallel compression applications for DNA raw data by Aníbal Guerra, Jaime Lotero, Sebastián Isaza

Jun 10, 2016 | link.springer.com
The Journal of Supercomputing
December 2016 , Volume 72, Issue 12 , pp 4696–4717 | Cite as

We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to 7× during compression and up to 3×during decompression. Parallelism scaled performance up to 13× when using 20 threads.

Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements.

Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019

Compression of FASTA/FASTQ files

BWT and DNA

NEWS CONTENTS

Old News ;-)

[Jun 06, 2018] Reliable Fastq compression programs

Notable quotes:

"... -Q {0,3,5} -s5+ -e -q 3 ..."

Jan 03, 2016 | www.biostars.org

[May 21, 2018] Time to zip very large (100G) files

My experience suggest that compression rations in gzip are non-linear -- for example, you can sometime get slightly better compression ratio with -5 then with -6. but experiment itself is pretty educational.

May 21, 2018 | superuser.com

[May 20, 2018] How should I compress FastQ-format files

Notable quotes:

"... However, this tool is experimental and is not recommended for general, everyday use. ..."

May 20, 2018 | uppmax.uu.se

[May 20, 2018] The Bioinformatics Adventure Comparing fastq compression with gzip, bzip2 and xz

May 20, 2018 | dskernel.blogspot.com

[May 20, 2018] Performance comparison of sequential and parallel compression applications for DNA raw data by Aníbal Guerra, Jaime Lotero, Sebastián Isaza

Jun 10, 2016 | link.springer.com

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Etc

Compression	Time	Size (bytes)
none	0	2085533064 (100%)
time cat SRR001665_1.fastq \| gzip -9 > SRR001665_1.fastq.gz	7m31.519s 7m20.340s	376373619 (18%)
time cat SRR001665_1.fastq \| bzip2 -9 > SRR001665_1.fastq.bz2	3m12.601s 3m25.243s	295000140 (14%)
time cat SRR001665_1.fastq \| xz -9 > SRR001665_1.fastq.xz	32m45.933s	257621508 (12%)

Decompression	Time
time cat SRR001665_1.fastq.gz \| gunzip > SRR001665_1.fastq	0m14.612s 0m13.247s
time cat SRR001665_1.fastq.bz2 \| bunzip2 > SRR001665_1.fastq	1m3.412s 1m4.337s
time cat SRR001665_1.fastq.xz \| unxz > SRR001665_1.fastq	0m24.194s 0m23.923s