dirkmeister.de: 11/01/2008

How is data compression fundamentally different from data de-duplication? I really don't see it. But I'm not convinced that there is no need for a new word either.

Jon Bentley presented a compression algorithm for finding und eliminating common long strings in 1999. The approach works by dividing a stream of bytes into chunks of static size (he uses 1K chunks for the evaluation) and then hash them using Rabin's fingerprinting method. He maintains a table of all fingerprints seen up to date, and there he lookup up the hash value of a new chunk to check if a chunk with the same value was seen before. If there is such a collision, he checks these two chunks byte for byte. If the chunk was not seen before, the hash value is added to the table. If the chunk was there already, a link to an existing occurrences is stored.

The hash-based data de-duplication approach works as follows: The divide a stream of bytes into chunks (often using a more elaborate chunking method that is often falsely called "Rabin Fingerprints" because it used that fingerprinting technique), and then hash them using a cryptographic fingerprinting method (like SHA-1). Dedup systems maintain an index that contains all fingerprints seen up to data, and there they lookup up the fingerprint of the new chunk to check if the chunk was seen before. If the chunk was not seen before, the chunk is stored. If a chunk contains the same data (whp.) the data is not stored.

Looking similar?

In contrast to Jon Bentleys approach, it is common to skip the byte-by-byte comparison for performance reasons claiming that the collision probability using a cryptographic hash value is much lower than e.g. a random byte flip in memory. If there is a data loss, it is whp. not caused by a fingerprint collision. However, even that is not inherent to the de-duplication approach. You could easily perform a byte-by-byte-comparision in dedup systems, too.

One problem — from a students perspective writing a master thesis about that topic — is that before 2008 I found no evidence of the term "data de-duplication" (in a storage context, not in a data mining context) in research literature. The first usage I found was Zhu's paper about Data Domain's ways to avoid the disk bottleneck of hash-based data de-duplication.
There was an interesting discussion about the difference between the storage blogs "Backup Central" and "StorageMojo".
A StorageMojo author says:

I still don’t get why the industry refers to “de-duplication” rather than compression - why use a well-understood term when you can invent a new one? - but they did make the point that compression rates depend on your data types, backup policies and retention policies. Basically the more stuff stays the same the higher your back up compression rate.

W. Curtis Preston of "Backup Central" takes on that.

There are different definitions to distinguish compression from data de-duplication. Here are few tries:

Most often (Zhu, Kulkarni) it is claimed that the difference is that compression founds data only in a single file, while data de-duplication looks at multiple files. This is true e.g. for the zip compression application, it is a bit blurred by tar.gz, but the claim still seems to hold. But fundamentally, this is not more than an implementation detail. I have no trouble writing a small compression app that tries to find common strings from multiple files. Is that a compression application? Sure! In general compression is defined as working over a stream of bytes. If that stream is a single file, multiple files, or what ever is not important at a conceptually level. If the byte-stream stops producing new data for a week and than continue to produce new bytes like in a backup scenario also seems not important to me at a conceptually level.
In a recent SNIA white paper defines data de-duplication "as the process to examining a data set or byte stream at the sub-file level and storing and/or sending only unique data". Well, this differentiates dedup from "Content Addressable Storage"(CAS) or "Single Instance Storage"(SIS), but not from compression. The "unique data"-term wonders me a bit. All dedup systems I aware of (hash-based systems for sure and also Diligent's implementation of "Delta Encoding via Resemblance Detection") are very course grained approaches for finding redundancy. E.g. hash-based systems miss every redundancy smaller than a chunk size. They may miss large chunks of data if static-sized chunks are used (like in the Venti system) and a small shift has changed the data. Claiming that only unique data is stored, is not exactly true. A dedup systems stores no redundant data that it has classified as redundant. That is a difference.
The same SNIA paper defines compression as "the encoding of data to reduce its storage requirements". The referral to the "encoding" is a good point. Dedup systems use a permanent table to lookup, while most compression approaches use a temporary table and that is used to find a small encoding for the table, e.g., Huffman encodings. But wait a moment. A filesystem de-duplication system may consist of three parts a) A storage of all as new classified chunks b) an index with a mapping from the chunk fingerprint to the storage component and c) an index mapping from an Inode to a list of chunk fingerprints (together with offset and sizes and so). Can the value of an entry in the last index not be seen as an encoding of the file? I don't know. The point is not bad.
In a whitepaper from Diligent, they differentiate the two terms by claiming that compression finds redundancies only in a short "sliding window". Well, I have found no sliding window in Jon Bentleys compression algorithm.
W. Curtis Preston from Backup Central refers also to the file-by-file- or backup-by-backup property of compression, while de-duplication finds redundancies between multiple backups. Another difference is that de-duplication compression ratios are based on the types of data and how the backup is done. "Repeated incremental backups of brand new data (e.g., seismic data) would not de-dupe at all". This limits data de-duplication to the one (important) area of backup- and archival storage. De-duplication is very effective in that area, but not limited to it.

There are some approaches to differentiate the terms, but there is at least no clear and conscience definition and separation.

A commentator at "Backup Central" pointed out that data de-duplication "has more in common with image compression than text compression" and both dedup classes "look like out-of-order MPEG-4 compression". I honestly have no clue about image compression and MPEG-4 (I should!), but maybe he is right. It seems like the text compression approaches based on LW and the implementations like ZIP are simply stated as "compression". The whole concept of compression is reduced to that.

For me the goals are the same: To save storage capacity or bandwidth usage by finding and eliminating redundancies. The approaches are sometimes remarkable similar, while the implementations are not. May be de-duplication is just a subfield of compression

in that redundancies are found over all seen data,
used to build storage systems like block devices and file systems.

I'm really not sure about the formulation and the scope. I included the second point because a compression app that finds redundancies over all seen data and stores them in a single file, would I not call deduplication application. The first point is important because a storage system that applies only "classical" text compression wouldn't be a deduplication system.

dirkmeister.de

Thursday, November 20, 2008

Monty Python on YouTube

Monday, November 17, 2008

I will start at the Paderborn Center for Parallel Computing

Thursday, November 13, 2008

Es wurde ein Zeichen gesetzt

Thursday, November 06, 2008

Compress, Encrypt and then remove redundancies:Really?

Tuesday, November 04, 2008

Compression vs. Data Deduplication