A White Paper
Andy Henson, 9 February 2004 (modified 17 February 2004)
There is a need to copy files to a remote machine for archive or backup purposes. The remote machine may not be trusted, so the archive stored on there must be encrypted. To reduce the amount of data sent over the wide area network, a protocol such as rsync [1] is needed. But these two goals contradict: most forms of encryption will reduce the ability to transfer changes using the rsync algorithm. This paper presents an encryption technique which allows such files to be transferred using rsync, and a slight modification to the rsync algorithm which allows a plaintext file to be encrypted and efficiently copied to a remote encrypted file without keeping a local copy of the encrypted file, and without transferring the key to the remote machine. The encrypted archive is no more than 1% larger than the original, and the rsync algorithm is as efficient as for unencrypted files.
In a typical situation, a number of files need to be archived, either all at the same time or in individually. Typically single master key will be specified for them all.
A single file can be trivially encrypted using a symetric key using a cipher such as RC4 [2] or Blowfish [3]. For good cryptographic security we should never use the same key to encode different information, so a small random seed or key modifier, typically 128 bits (16 bytes) long, is needed per file to ensure each file encrypted with the same master key is encrypted slightly differently. The combination of the two generates a cipher stream which can be combined with the plaintext by exclusive-or or similar operation to yield a cipher text. Each block of data in the file needs to be encrypted with part of that cipher stream in such a way that even if it moves in position within the file between versions it will still be encrypted using the same part of the cipher stream, ie, it will yield an identical block for rsync matching purposes. New data which is introduced when the file is modified needs to be encrypted using a new part of the cipher stream which has never been used before.
The encrypted file thus contains the cipher text (same size as the plaintext) plus the “cipher table” which contains the random seed or key modifer, and the table mapping offsets into the cipher stream to blocks of the file. The next available cipher stream offset is stored in the table as well – this can ensure the next time data is added it is not accidentally encrypted reusing part of the cipher stream already used in a previous version even if later deleted.
It should be noted that the cipher table contains no key or plaintext information and need not normally be encrypted.
We can design the cipher table to be no bigger than 1% of the cipher text, plus a small overhead. If a typical block is 1024 bytes, this gives us a limit of 10 bytes to descibe the cypher stream for that block which is quite reasonable. Larger blocks are more appropriate for larger files – these blocks should correspond to those used by the rsync algorithm to transfer the file. An overhead for a magic number, size of cipher table and 16 byte random seed can be kept down to 30 bytes.
Note that if a change is made within a block (or a number of small changes made close together) it is best to consider the entire block to be all “new data” and encrypt it using a new part of the cipher stream. This keeps the number of cipher table entries down (often just one per block) and mimmics the way that rsync algorithm works, thus not increasing the amount of data which would be transferred by rsync if using the same block size.
The rsync algorithm [1] allows a file to be transferred by only sending blocks which have changed. The sender tells the (remote) receiver which file to consider and the reciever then sends a list of block checksums for that file. The sender then scans its copy of the file looking for those blocks and either sends “Use block N” or else sends the new data.
If we encrypt the file as descibed above to a local encrypted file, it can can be efficiently be transferred by the rsync algorithm (using the same blocksize). This requires a local copy of the encrypted file, which is not desirable.
But how is the remote encrypted file is to be compared with the local unencrypted data? It would be insecure to send the key to the remote, untrusted, machine to decrypt the archive before computing the checksum. Providing a copy of the cipher table can be transferred to the sender (easy since it is small), the details of the cipher stream can be computed locally, yet encrypting each block locally when doing a rolling checksum would be too expensive.
A small modification to the rsync checksum allows a checksum, generated locally from the cipher stream, to be combined with that from the block of remote ciphertext to yield a plaintext checksum for use in the normal way. When the checksum matches, a full encryption of the local data using the remote cipher table entry will be needed for MD5 computation, but this only occurs on probable-matches and the expense is acceptable. On non-matches, the encrypted byte is sent as normal and an entry is made in the new cipher table.
In order to prevent blocksize mismatches, it is recommended that blocksize for transfer and encrypted file computation is always a power of two.
The original rsync rolling checksum is defined by sum of bytes:
a(k,l) = SUM(i=k..l OF X[i]) mod M
b(k,l) = SUM(i=k..l OF (l-i+1)X[i]) mod M
Thus: a(k+1,l+1) = (a(k,l) - X[k] + X[l+1]) mod M
and: b(k+1,l+1) = (b(k,l) – (l-k+1)X[k] + a(k+1,l+1)) mod M
For simplicity M=216
So the checksum s(k,l) = a(k,l) + 216 b(k,l)
We begin by defining a plaintext P[i] encrypted ciphertext X[i] and cipher C[i] such that
X[i] = P[i] + C[i] mod M
where M = 216
Decryption occurs by the reverse proceedure where the cipher text is subtracted.
In order to make this work reversably, X, P, and C are streams of 16 bit elements - each element is two bytes. Little endian is used because that makes odd size streams easier to handle. If the data being encrypted is an odd number of bytes long, then the final element is just one byte.
The normal rsync checksums are now applied to the streams of 16 bit elements (rather than 8 bit ones in the traditional rsync implementation). Thus, the esync rolling checksum is defined by sum of these elements:
a(k,l) = SUM(i=k..l OF X[i]) mod M
b(k,l) = SUM(i=k..l OF (l-i+1)X[i]) mod M
and thus we see that:
a(k,l) = SUM(i=k..l OF P[i]) mod M + SUM(i=k..l OF C[i]) mod M
and hence checksums of ciphertext and cipher stream can be computed separately and recombined later.
An implementation of this algorithm has been produced as a prototype program “esync” [4] and the associated file format [5].
Written by Andy Henson. Copyright © 2004 by Zexia Access Ltd
If you want to contact me regarding this, please email <esync@zexia.co.uk>
Please feel free to link to this document.
[1] “The rsync algorithm”, Andrew Tridgell & Paul Mackerras, November 1998, http://samba.anu.edu.au/rsync/tech_report/ also see http://rsync.samba.org/
[2] “What is RC4?” http://www.rsasecurity.com/rsalabs/faq/3-6-3.html also “A Stream Cipher Encryption Algorithm "Arcfour"”. Thayer, R. and K. Kaukonen, July 1997, http://www.mozilla.org/projects/security/pki/nss/draft-kaukonen-cipher-arcfour-03.txt
[3] “Description of a New Variable-Length Key, 64-Bit Block Cipher (Blowfish)”, Bruce Schneier, Fast Software Encryption, Cambridge Security Workshop Proceedings (December 1993), Springer- Verlag, 1994, pp. 191-204., http://www.schneier.com/paper-blowfish-fse.html
[4] “The esync program”, Andy Henson, 2004, http://www.zexia.co.uk/esync/esync.html
[5] “The esync file format”, Andy Henson, 2004, http://www.zexia.co.uk/esync/esync-fileformat.html