Mysteries of Dropbox: Testing of a Distributed Sync Service (2016) [pdf]

I was lead on Syncplicity's desktop client. File synchronization has a myriad of corner cases that are difficult and non-intuitive to think through; and non-programmers often thoroughly underestimate just how difficult these are to anticipate and mitigate.

The fact that they found bugs that rely on sensitive timing doesn't surprise me.

Can you share which difficult and non-intuitive corner cases there are? I guess debouncing, etc.

The way I used to explain it:

Imagine that you are on a plane, (and don't have an internet connection). You edit a file.

At the same time, I edit that file.

What should we do? We can't possibly know every file format out there, and implement operational transform for all of them.

Now, imagine that we both edit the same file, at the same instant. One of us is going to submit the change first, and the other will submit it second. It's the same use case, and there's no way to avoid this.

---

Renaming folders was a lot weirder, because you get situations like:

I rename a folder, but you save a new change to a file in that renamed folder and your computer doesn't know about the renamed folder.

Or, I rename a folder and you have a file open. That application has an open file handle to that file, so we can't just rename the folder. What do we do? (This is how Excel does it.)

Or, I rename a folder and you have a file open, but that application doesn't have an open file handle to that file. What happens when you try to save the file and it's been moved? (This is how most applications do it.)

---

Application bundles (on Mac) were weird because we didn't support the metadata needed to sync them.

---

The general "Merge" use case, which had to do with the fact that Syncplicity could sync folders anywhere on disk. (As opposed to the way That Dropbox, Google Drive, and OneDrive stick everything into a single folder.) We'd have customers disconnect a folder, and then re-add it to the same location. The problem was if they were disconnected for a long time, they would "merge" the old version of the folder into the new one:

If you edited a file while disconnected, it hit the same "multiple editors" use case that I mentioned above.

If someone deleted a file but you still had it, we'd recreate it. (We can't read minds, you know!)

If someone renamed a folder, but you still had the old path, we'd re-add it.

I remember overhearing non-programmer product managers trying to talk through these use cases and just getting overwhelmed with the complexity and realizing they were deep, deep over their heads.

---

A lot of these corner cases were smoothed over when we wrote "SyncDrive", which was a virtual disk drive, because all of the IO came through us. (Instead of scanning a folder to understand what the user did.)

oh and the parent folder is on a shared NAS with some caching.

We had to add logic to block network and USB drives. (They were an ever-present source of customer issues.)

The root cause of the problem is that in .net, there is a bug with File.Exists. If there is a filesystem / network error, instead of getting an exception, the error is swallowed and the call just returns false. I'm not sure if newer versions of .net fix it or not; I only learned about this when we were implementing a driver / filesystem.

You can just mount network drives as folders

Which we detected and blocked, due to phantom deletions. (I explain the .net bug higher in this thread.)

doesn’t surprise me, network drive is the typical example of something that doesn’t “just work”.

I used a paid SaaS sync service 10 years ago (not Dropbox) that had the following failure mode even though it had been around for a few years..

You could have it mirror an entire subdirectory, including external drives.

If you booted up long enough and that external drive was not mounted, the service registered that as a subdirectory delete (bad). When you then mounted it again, the sync agent saw it as out of sync with the newer server-side delete and proceeded to clear the local external drives.

They also implemented versioning so poorly that a deleted directory was not versioned, only the files within it. So you could recover raw files without the directory structure back in a giant bundle of 1000s of files. Horrible.

See: https://dynamicsgpland.blogspot.com/2011/11/one-significant-...

Examples I've seen in similar systems:

- Receiver tried to create a file before receiving attributes of the directory containing the file. Receiver author assumed it would always receive directory attributes first and create the directory, so it crashed.

- Receiver created a file before receiving attributes of the directory containing the file. Parent directory was created automatically, but with default attributes so the file was too accessible on the receiver when it should not have been.

- Bidirectional sync peers got into a non-terminating protocol loop (livelock) when trying to agree if a directory deep in a tree should be empty or removed (garbage collected) after synchronising removal of contents. It always worked if one side changed and sync settled before the next change, but could fail if both sides had concurrent changes.

- Mesh sync among multiple peers, with some of them acting as publish-subscribe proxies forwarding changes to others as quickly as possible merged with their own changes, got into a more complicated non-terminating protocol loop when trying to broadcast and reconcile overlapping changes observed on three or more nodes concurrently. The solution was similar to distributed garbage collecting and spanning tree protocols used in Ethernet switch networks.

- Transmission of commands halted due to head of line blocking (deadlock) on a multiplexed sync stream because a data channel was going to a receiver process whose buffer filled while waiting for a command on the command channel, which the transmitter process had issued but couldn't transmit. The fault was separate, modular tasks assuming data for each flowed independently. The solution was to multiplex correctly with per-channel credits like HTTP/2 and QUIC, instead of incorrectly assuming you can just mix formatted messages over TCP.

- Rendered pages built from mesh data-synchronised components, similar to Dropbox-style sync'd files but with a mesh of 1000s of peers, showing flashes of inconsistent data, e.g. tables whose columns should always add to 100% showing a different total (e.g. "110% (11050 of 10000) devices online"), displayed addresses showing the wrong country, numbers of devices exceeeding the total number shipped, devices showing error flags yet also "green - all good" indication, number of comments not matching the shown commments, number of rows not matching rows in a table, etc. Usually for only a few seconds, sometimes staying on screen for a long time if the 3G network went down, or if rendered to a PDF report. Such glitches made the underlying systems look like they had a lot of bugs when they really didn't, especially when captured in a PDF report. It completely undermined trust in the presented data being something you could rely on. All for want of more careful synchronisation protocol.

>Receiver tried to create a file before receiving attributes of the directory containing the file. Receiver author assumed it would always receive directory attributes first and create the directory, so it crashed.

This case, and a bunch of the others, are variations on failing to correctly implement dependency analysis. I'm not saying it's easy, it is far from easy, but this has been part of large systems design (anything that involves complex operations on trees of dependent objects) for years, especially in the networking space.

Indeed, your fourth bullet gets to some of the very ancient techniques (though STP isn't a great example) to address parts of the problem.

The last bullet is very hard. Honestly, I'd be happy if icloud and dropbox just got the basics right in the single-writer case and stopped fucking up my cloud-synced .sparsebundle directory trees. I run mtree on all of these and routinely find sync issues in Dropbox and iCloud drive, from minor (crazy timestamp changes that make no sense and are impossible, but the data still complete and intact) to serious (one December, Dropbox decided to revert about 1/3rd of the files to the previous October version).

The single writer case (no concurrency, large gaps in time between writers) _is_ easy and yet they continue to fuck it up. I check every week with mtree and see at least one significant error a year (and since I mirror these to my NAS and offline external storage, I am confident this is not a user error or measuring error).

There was a discussion of a self-built dropbox on the frontpage (https://news.ycombinator.com/item?id=47673394). This is just to show that dropbox is thoroughly tested for all kinds of wierd interactions and behaviours across OS using a very formal testing framework.

This is the kind of thing I think about when i see the mindset of “we’ll just replace all the SaaS with vibe code” pitches.

Not everything is a CRUD app website.

I was running my own hacky sync thing to the cloud a decade ago. I would never in my boldest dreams compared it to dropbox.

Even if you know the use cases, the edge cases could be 99% of the work. POCs are 100x easier than working production multi-user applications. Don’t confuse getting to a POC in 2 hours with getting a final product in 4 hours.

Anything written by John Hughes is worth a read. He also also wrote quickcheck.

Seconded. All the Haskell people from Chalmers have produced very interesting work. Another example is Agda.

One of the authors, John Hughes did a talk on property-based testing at Clojure West some number of years back. Worth a watch if you're interested: https://www.youtube.com/watch?v=zi0rHwfiX1Q

John Hughes is also one of the investors of property based testing

So from what I am seeing in this with a brief look over it, the only cases in which data loss seemed to occur were when two clients were editing the same file temporally close to each other? I.e. you end up creating something similar to a git merge conflict, which cannot be solved automatically well, and thus can generate loss of data.

Yes, because these are not network drives, but instead a drive on your local filesystem, there's no perfect way to lock files the way a network drive can.

Business idea- a file sync software run by a company that promises to fire any employee who suggests adding a "feature."

[deleted]

Great paper! I’m glad I avoided OwnCloud after discovering how much of a hot PHP mess it is (and that it was about 10x slower for LAN sync than Seafile on a same machine).

I would love to have all the file sync solutions tested with this suite.