Large file synchronization

By Ewan

September 15, 2013

Efficiently synchronize copies of a large sparse file locally. I deal with a large amount of large sparse files because of virtualization and other technologies. Because of their size, often a small number of blocks have data and, of these, a small number of blocks are changed and need to be backed up. Using a log-based (snapshotting) file system on USB 2 as a backup device, I only want to write blocks if absolutely necessary.

Out of the box there are two methods to achieve this, both with pros and cons. First is cp.

cp -p --sparse=always oldfile newfile

This has the advantage of creating the most compact destination file. Any blocks that have been written as all zeros will become sparse on the destination file and take up no space on disk. Unfortunately the whole file will be rewritten. On a log based file system this wastes as lot of space. Space on a log based file system above that of stored files is used for old copies of data. So rewriting the whole file severely diminishes the amount of old snapshot data we can retain.

Next is rsync.

rsync --archive --existing --inplace --no-whole-file --progress --human-readable --stats --verbose oldfile newfile

This overcomes the limitation of needing to rewrite the whole file, but there are some issues. Some types of data confuse the algorithm, for example disk images of Oracle ASM disks. This results in rsync rewriting the whole file including the sparse areas. This wastes even more space than using cp. Another problem is performance. The program compares the files and keeps track of which sections need to be synchronized, and finally performs the synchronization. For a large file over many hundreds of gigabytes, rsync grinds to a near halt and never completes. Of course we are running the synchronization as we already know the files differ, so the first pass of the file is a needless slowdown.

So what's the solution? Some simple custom code that

checks that both file sizes are identical;
verifies that some metadata has changed (i.e time stamp, permissions or owner/group);
reads both files block-by-block;
writes only changed blocks to the destination file, and
updates any changed metadata.

Note that I don't guarantee the correctness or quality of the below code! It is simple to look up, and compare, the file's metadata. The stat (2) function achieves everything we need (see syncutl.c and syncchk.c). The data from the source file's stat can be used to update the destination file's metadata at the end of the process with just a handful of function calls.

It is important to choose an appropriate block size when writing the file. If the block size is too big then there is a chance that data could be written to blocks that should remain as sparse. If the block size is too small then the operating system may re-read data from the file, causing a huge slow-down in the processing. An appropriate block size can be determined by, again, running stat against the destination file.

A simple loop is required to read, compare, and write the data (see syncrun.c).

while (!feof(sf)) {
   b = fread(&sb, 1, block_size, sf);
   if (b > 0) {
      fread(&db, 1, block_size, df);
      if (memcmp(&sb, &db, b)) {
         fseek(df, -b, SEEK_CUR);
         fwrite(&sb, 1, b, df);
      }
   }
}

All the source code is included, and this has been tested on various files, including a 699 gigabyte file that rsync was never able to process.

Attachment	Size
Source code for an efficient sparse file synchronization (tar.gz) (2.02 KB)	2.02 KB

Classifications

GNU/Linux

Add new comment