I just released some tools developed to evaluate chunking-based data deduplication techniques on various systems and to evaluate new chunking methods into the new open source project "fs-c" (for filesystem chunking).
The fs-c tools allow to analyze the internal and temporal redundancy of file system directories that are found by content-defined chunking using Rabin's fingerprinting method and static chunking with different chunk sizes.
The goal is to allow users to provide a rough estimate of the redundancy found by de-duplication systems for their concrete workload and to provide a basis for further enhancement to the tools and for e.g. application-specific chunking methods.
Currently the analysis is only done using an in-memory hashtable which limits the size of the system to a few hundred GB of data (or you need a large shared memory systems). I have also developed Hadoop MapReduce tasks to calculate the redundancies, but that code is not ready to publication.
Automatically imported comment
ReplyDeleteAuthor: Prakti
Date: Monday 06. April 2009
Nice Pun with the fs-c ! Intent?