August 23, 2010

A New Approach to Syncing Folder Deletions

by with 1 comment

One of goals of SpiderOak sync is that it will never destroy data in a way
that cannot be retrieved, even if the Sync happens wrongly.  So, as a design
goal, SpiderOak sync will never delete a file or folder that is not already
backed up.

Every time SpiderOak deletes a file, it checks that the file already exists
in the folder’s journal, and the timestamp of the file currently on disk
matches that of the file in the journal, or a cryptographic fingerprint match
if the timestamps differ.  This allows only the narrowest of what programmers
call “race conditions” when deleting a file because of a sync.  Here’s how the
race works:

  1. SpiderOak checks that the timestamp and cryptographic fingerprint match the journal (i.e. it is backed up and could be retrieve from the backup set.)
  2. SpiderOak deletes the file

The trouble is that there is a very small time window between step 1 and 2. 
The user could potentially save new data into the file during this very small
time window.  If the user were to save new data into this file at this instant,
the two actions are racing to completion.

Since the time window is so very small (less than milliseconds), this is an
acceptable risk.

Now consider the same scenario for deleting a folder.  Again, SpiderOak
makes a pass through the folder and verifies that the complete contents are
available in the journals, then it removes the folder.  The trouble now though
is that the window is much larger between step 1 and 2.  A very large folder
could take minutes for SpiderOak to scan through and verify.  It maybe modified
again between the time we start scanning and the time we finish scanning, and
before the deletion begins. 

Even though SpiderOak is plugged into the OS’s system for notification of
changes to the file system, such notifications are not guaranteed to be
immediate or to happen at all (such as on a network volume.)

So there is a larger “race condition,” or opportunity for data to be saved
to the folder between step 1 and 2 in the case of a large folder.

So, SpiderOak again tries to be conservative.  Instead of deleting the
folder, it tries to rename it out of the way.  Then, later it can verify that
nothing is changed inside the folder, after it has been renamed out of the

Syncing deletes of folders actually most commonly fails in this renaming
step.  Sometimes it just can’t rename it.  There are some differences in how
Windows and Unix platforms handle open files in these cases, and the rename
solution tends to work well on Unix and has greater opportunity for error on
Windows.  There are also some cases in which it categorically fails — such as
trying to rename across drive letters in Windows or (in Unix) across different
file systems.

We could fix those, but I think an entirely new approach in probably

Starting in the next version, instead of approaching the “delete a folder”
action as the deletion of an entire folder, it will now approach it as the
deletion of each individual item contained within the folder and all of its
subfolders recursively. We will use the same sequence as described above for
individual file deletions for each file, from the lowest subfolders on up, and prune folders when they are free
of files.

This eliminates the need for the rename step, reduces the race condition
down to milliseconds in the case of each removed file. Most importantly, this
means that the files causing problems (i.e. the files in use, or are changing
to fast to backup and thus SpiderOak refuses to delete, etc.) will be obvious:
they will be the only files remaining.

We’ll have a beta available with this behavior soon, announced in the href="">release notes (rss).

  1. While the historical archiving of deleted items is an excellent safeguard against inadvertent loss of data, is there a means of truly deleting and wiping clean the storage of any items that you positively and definitely want to delete, and leave zero trace? Or are any uploaded files essentially undeleteable, given the ability to restore through multiple levels of deletes?