The answer to that is datasets are growing at a faster rate than storage throughput or network bandwidth.
100 MB of data can be compressed to 10 MB in less than a second and transferred in under a second over a 100 Mbps connection.
Most decompression systems run at 100 to 500 MB/s meaning the compression overhead isnt enough to burden the wall clock time needed to deliver the original 100 MB of content.
Not only does this make best use of the network connection bottleneck, it frees up resources for other payloads.
Also consider that if you can only read at 100 MB/s off a mechanical drive but your CPU can decompress data at ~500 MB/s then the mechanical drive is able to provide 5x the throughput youd otherwise expect thanks to compression.
The world of compression is a rabbit hole of intellectual curiosity.
In this post Ill describe the highlights Ive come across while trying to improve compression performance in database systems I deploy.
From Entropy to Lempel-Ziv Entropy, an Information Theory term coined by Claude Shannon in 1948, describes the minimum number of bits, on average, needed to encode a dataset.
Think of entropy as the number of yes/no questions that would need to be answered in order to describe any one piece of information within a dataset.
Compression aims to reduce that ceiling of bits by transforming data in such a way that patterns emerge exposing duplication within a given dataset.
The act of producing optimally large, repeating tokens within a dataset was the key concept behind Abraham Lempel and Yaakov Ziv work on the LZ77 and LZ78 algorithms developed in 1977 and 1978 respectively.
Their work has had a huge impact on the field of lossless compression.
Many algorithms used in popular software today are variations and tweaks to those two algorithms developed in the late 1970s.
Below is a family tree of related dictionary-based compression algorithms which trace their linages back to LZ77 and LZ78.
Ive also included the year of earliest publication I could find for each of them and a few noteworthy pieces of software and file format implementations.
├── LZ77 (1977) │ ├── LZR (1981) │ ├── LZSS (1982, WINRAR) │ │ ├── LZB (1987) │ │ └── LZH (1987) │ ├── DEFLATE (1989, ZIP, gzip, PNG, zlib) │ │ └── DEFLATE64 (1999) │ ├── ROLZ (1991) │ │ ├── LZP (1995) │ │ └── LZRW1 (1991) │ │ ├── LZJB (1998, ZFS) │ │ └── LZRW1-A, 2, 3, 3-A, 4, 5 (<1996 and onward) │ ├── LZS (1994, Cisco IOS) │ ├── LZX (1995, CHM files) │ ├── LZO (1996, Nintendo 64, PlayStation) │ ├── LZMA (1998, 7ZIP, xz) │ │ ├── LZMA2 (2009) │ │ └── LZHAM (2009) │ ├── Statistical Lempel-Ziv (2001) │ ├── SLZ (<2009) │ ├── LZ4 (2011, Hadoop, ZFS, bcolz) │ │ └── ZStandard (2015, Redshift, ORC, Parquet, RocksDB, bcolz) │ ├── Snappy (2011, ORC, Parquet, MariaDB, Cassandra, MongoDB, Lucene, bcolz) │ └── Brotli (2013, Google Chrome, Firefox, nginx) │ └── LZFSE (2015, iOS, Mac OSX) └── LZ78 (1978) ├── LZW (1984, PKZIP, GIF, TIFF, PDF) │ ├── LZMW (1985) │ │ └── LZAP (1988) │ ├── LZC (2007) │ │ └── LZT (1987) │ └── LZWL (2006) └── LZJ (1985) And for contrast, these are a few non-dictionary-based compression systems.
├── PPM (1984) │ └── PAQ (2002) │ └── 20+ Variants (2003-2009) └── bzip2 (1996) As you can see, the compression world may not be fast-moving but rarely does a year past without some iteration and improvement.
Lossless Compression Benchmarks In 2015, Przemyslaw Skibinski started a project on GitHub called lzbench.
This project aims to compile the source code of a wide range of lossless compression algorithms into a single binary and then benchmark them against various workloads.
The projects README contains benchmark results for a 212 MB file extracted from the Silesia compression corpus.
Each of the compression methods used are run with a variety of settings; this gives an idea of what sort of results could be expected from a performance tuning exercise.
In the above benchmark results, blosclz, density, lz4, lz4fast, lzo1x and shrinker were all tuned to compress at more than 400 MB/s while maintaining at least a 5:3 compression ratio.
The benchmark shows memcpy, which simply copied the data from one memory location to another, running at ~8.
This transfer rate can be seen as a ceiling on the hardware used given the computational overheads of each of the compressors (C-Blosc claims it can get past memory-bound performance issues but Ive yet to study their claim in detail).
The density and lz4fast benchmarks stand out as their compression speeds of 800 and 785 MB/s respectively, were the fastest of any compressor able to achieve a 3:2 compression ratio.
Of the compressors that were able to achieve a 3:1 compression ratio, only Zstandard could do so at 88 MB/s.
More than 75% of the other compressors that could achieve that compression ratio couldnt do so at any more than 15 MB/s, some 6x slower than Zstandard.
The vast majority of decompressors could break the 250 MB/s barrier, 25% of the decompressors broke the 1 GB/s barrier and a few, including LZ4, could be tuned to decompress in excess of 2 GB/s.
Decompression at this rate would demand either RAM drives, RAID 0 or NVMe storage in order to keep up with these levels throughput.
The SATA bus on most systems is limited to 1.
2 GB/s so this would be a bottleneck in need of addressing if it were included in the storage pipeline.
Lastly, the compression ratio of more than 4:1 that xz achieved is interesting.
This compressor is popular in packaging software.
A software package could be downloaded 100s of millions of times.
Its worth the effort to find the best compression ratio possible given the amount of bandwidth and the diversity of network connections involved in the distribution of said software.
This can excuse the 2 MB/s compression rate xz managed during the compression process.
Do bear in mind some decompression tools can require an excessive amount of memory and compute power; some of the high-ratio compression systems suffer from this greatly so this trade off should be considered as well.
Sort, then Compress Many lossless compression systems can be aided by being fed sorted data.
The sliding window compressors use rarely cover the entire dataset and the more clustered the values the easier it is to detect repeating patterns.
Most SQL-based systems dont guarantee the order of rows returned without an ORDER BY clause so the sorted form of the data on disk should be of little concern (with Redshifts sort key a notable exception).
Below Ive setup a standalone Hadoop system running HDFS and Presto using the instructions from my Hadoop 3 Install Guide.
Ive taken the first 6 ORC files representing 120 million of the 1.
1 billion taxi rides that have taken place in New York City over six years.
This post describes how I produced this dataset in CSV format and Ive run a large number of benchmarks where Ive converted that CSV data into ORC format before examining query performance.
Below are a few modifications Ive made to the Presto configuration in my stand alone guide.
First, to sort 120M rows of data in Presto will require a memory limit of at least 8 GB.
$ sudo vi /opt/presto/etc/config.
properties coordinator=true node-scheduler.
uri=http://localhost:8080 $ sudo /opt/presto/bin/launcher restart Next, Ill create a warehouse folder as the tables created below will run via the Hive connector and it defaults to store tables it helps create along with Presto in /user/hive/warehouse on HDFS.
$ hdfs dfs -mkdir -p /user/hive/warehouse Ill then copy the first six ORC files I have saved in my home folder onto HDFS.
$ hdfs dfs -mkdir /trips_orc $ hdfs dfs -copyFromLocal ~/orc/00000[0-5]_0 /trips_orc/ Ill create a schema for the trips_orc table in Hive.
This lets Presto do schema-on-read and understand the column layout of the ORC files.
$ hive CREATE EXTERNAL TABLE trips_orc ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_orc/'; The following SQL will create four new tables.
Each will be in Snappy-compressed, ORC format.
Each one will pick a different field to sort on.
Note Im only storing 4 columns from the original table for the sake of both time and memory consumption during this operation.
$ vi sort.
sql CREATE TABLE sorted_by_vendor_id WITH (format='ORC') AS SELECT trip_id, vendor_id, pickup_datetime, pickup_longitude FROM trips_orc ORDER BY vendor_id; CREATE TABLE sorted_by_trip_id WITH (format='ORC') AS SELECT trip_id, vendor_id, pickup_datetime, pickup_longitude FROM trips_orc ORDER BY trip_id; CREATE TABLE sorted_by_pickup_datetime WITH (format='ORC') AS SELECT trip_id, vendor_id, pickup_datetime, pickup_longitude FROM trips_orc ORDER BY pickup_datetime; CREATE TABLE sorted_by_pickup_longitude WITH (format='ORC') AS SELECT trip_id, vendor_id, pickup_datetime, pickup_longitude FROM trips_orc ORDER BY pickup_longitude; Ill execute the above with Presto.
$ presto –server localhost:8080 –catalog hive –schema default –file sort.
sql The resulting table sizes were as follows: GB | sorted by ———————- 0.
91 | pickup_longitude 1.
06 | trip_id 1.
12 | pickup_datetime 1.
33 | vendor_id The largest table is 1.
46x bigger than the smallest.
There is an argument that one should test the first 50K-odd records of any table against all possible sort keys when compressing a given dataset.
There is no speeding up a solid state drive by 1.
46x but the above happily reduced the throughput requirements by that amount.
If the memory usage of sorting your entire dataset exceeded your clusters capacity you could look to sort on each table partition one at a time instead.
Good is not the enemy of perfect and reclaiming storage and throughput capacity is helping you make the most of your clusters hardware.
An Attack Vector Compression has also been an attack vector for decades now.
There have been "Compression Fuzzing" efforts to help uncover vulnerabilities but given decompression utilities are some of the most widely-deployed software in the world, any exploit can have a global impact.
A few examples include the "ZIP of death" which is a 42 KB ZIP file that extracts 4.
5 PB of data.
Unsuspecting web applications that decompressed user-submitted content need to be hardened for this sort of attack.
BREACH exploited an HTTPS compression vulnerability and CRIME was another exploit disclosed around the same time that worked over HTTPS and SPDY connections that used compression.
Incidents like the above are making compression designers contemplate what attack vectors could result from forth-coming dictionary encoding methods.
.. More details