Tiny deduplicator

1/6/2024

So again you have to adjust to your input. Smaller block sizes typically offer better deduplication, but are a little slower and add indexing overhead. The maximum size of each block can also be adjusted. Dynamic blocks offer better deduplication. “Fixed size” blocks are easier to deal with in terms of memory use and are significantly faster. There are two types of blocks: Fixed Size blocks all have the same size, or Dynamic blocks which have a variable size. To simplify your choice when encoding I have made this chart: If it is too high, you can simple use a NewSeekReader instead of a NewReader. To determine if you should use a SeekReader, you can use ` MaxMem()` once the index has been parsed to get the real memory usage. This will never keep any blocks in memory, but will instead need to be able to seek on the input content. This is of course sub-optimal compared to the indexed mode.Ī third option when decoding is to use a ` NewSeekReader`. Since it doesn’t know if a block will be used in the future, it will simply keep the number of blocks in memory that it is allowed to. The non-indexed mode must have a maximum decoder RAM specified when encoding. That is why there exists an indexed mode, which I strongly recommend you use if possible. So, since we cannot predict the future, there is an index created that basically tells the decoder when it is ok to throw away a block. Below in red is marked when we can de-allocate a block when decoded from indexed streams.īut, you may have noticed that for this to work we need to know in advance if we will need a block again. This mean we can deallocate block like this when decoding: Input stream divided into block. When we are decoding, we keep block 3 & 4 in memory, until they have been written. If we look at the previous illustration, we can see how we can manage memory. It could be a copy of a database you’ve made, etc.Ī lot of data streams that synchronize servers are more optimized for speed than size, so they are likely to have many duplicate entries. Often when you do bulk backups you will have duplicate data in various places on your computer. If you know the block size of your file system and deduplicate based on that size you will find a surprising amount of duplicate data. This includes virtual machine images. You will often find duplicated data in harddisk images, since deleted files aren’t zeroed out.

Of course duplicate data is unnecessary, but will happen unintentionally. This is of course obvious, but lets list a few possible scenarios where that could occur. “Benson Triplets” by Christopher Allison Photography (cc-ny-nd)ĭeduplication will only give you an advantage if you expect there are duplicates. When should I use deduplication? Sometimes you wish you only had one. That means you still get good compression, but have to compress less data, giving you overall smaller output. In fact it is often a good idea to compress the deduplicated output, since the original input of the original file is kept as-is. In contrast to compression, you will not get references to partial data inside the block. However, the speed difference to ordinary compression is significant, and with fixed block sizes you can expect more than 1GB/s to be deduplicated with fixed block sizes. The main downside is that data will only be deduplicated if the complete block matches. Deduplication will work across gigabytes and even terabytes of data, since it only needs to store a hash for each block. In compression like deflate, snappy, etc, you are looking for matches for each individual byte, in a limited window, for example 32KB-1MB. First of all, only complete blocks are checked. While this may sound like ordinary compression, it has some key differences. If you consider an input, it will look like this: If it re-encounters a block with the same hash, it simply inserts a reference to the previous block. This will however provide you with a streaming/indexed format you can use in your own Go programs.įor the impatient, you can find the package here: What is deduplication?ĭeduplication divides an input into blocks, creates a hash for each of them. If you are looking for a pure commandline tool, I can also recommend SuperRep by Bulat Ziganshin. It is inspired by the work on the zpaq journaling archiver done by Matt Mahoney. This post will introduce the concept of deduplication and a Go package I have written to help you implement it.

0 Comments

Tiny deduplicator

Leave a Reply.

Author

Archives

Categories