This continues from my previous post on the various online storage/sync solutions available today.
I’ve been a Dropbox (and Box, and Google Drive) user for a while now, and like it for its convenience. It is easy to use and setup, and lets you keep multiple devices in sync with next to no effort. However, I’ve always had some concerns over privacy and security issues. In light of the recent attack on the service provider, I started wondering how safe my files and accounts really are (not just with Dropbox, but actually with any online storage solution, including a home-brewed one).
I also have some concerns regarding the privacy of my documents. Say, I’ve got some sensitive data uploaded to an online storage service. Who’s to say these documents are safe from data mining, or (god forbid) human eyes? (I’m not pointing fingers at any individual storage provider here. Some may respect your privacy, others may not.) Many people would be extremely wary of the possibility of information harvesting (even if it is completely anonymized and automated) and/or leakage.
Then of course, there are some less critical, but nevertheless important limitations:
- Only x GB of (free) storage space. One can always upgrade to a paid package, but I don’t want to pay for 50 GB of storage when I’m only going to use 10 GB in the foreseeable future. There are services who provide a large amount of storage space for free, but most of them still charge you for bandwidth usage above a fraction of the amount.
- No support for multiple profiles. You have to put EVERYTHING you want to sync under one single top-level folder. This may not be a suitable or acceptable restriction in all situations.
- Lack of flexibility - you don’t get to move your repository around if you need to. Once you subscribe to a service, you’re locked into using their storage infrastructure exclusively.
It is not necessary that the limitations I’ve described so far are all present in any single service, or even that they are a matter of concern for everybody. These are just a few issues that got me going on a personal quest to find a better alternative.
There are actually quite a few ways of setting up your own personal online storage and sync solution, whose security is limited only by your ability to configure it. But the most visible benefit over any existing service is the flexibility -
- to use a storage infrastructure of your choice, and
- to manage multiple profiles.
The rest of this post documents my experiments with one such solution, named bitpocket. It performs 2-way sync by using a wrapper script to run rsync twice (once on the master, once on the slave). It can also detect, and correctly propagate file deletions. It does have one limitation in that it doesn’t handle conflict resolution. You have been warned. (Unison is supposedly capable of this, but that is another post :-).)
The basic setup instructions are right on the project landing page. Follow them and you’re all set. I’ll elaborate on two things here -
- how to do a multi-profile setup, and
- how to alleviate the problem of repeated remote lockouts when multiple slaves always try to sync at the same time.
I’ve got two folders on my laptop that I want to sync:
I want these two folder profiles to be self-contained, without requiring the tracking to be done at the common parent. Following the instructions on the project page, I did a bitpocket init inside each of the above folders. On the master side (I’m running an EC2 micro-instance on a 64-bit Amazon Linux AMI), I’ve got one folder: /home/ec2-user/syncroot where I want to track all synced profiles. So in the config file of the individual profile folders on the slave machine I set the _REMOTEPATH variable as follows:
That’s it! You can manage as many profiles as you want, with each slave deciding where to keep its local copy of each profile.
Preventing remote lockouts
Say, all your slaves are configured to sync their system clock over a network source. They are in sync with each other, often to the second (or finer). Now if all crons are configured to run at 5 minute intervals, then all the slaves attempt to connect to the master at exactly the same time. The first one to establish a connection starts syncing, and all the others get locked out. This happens on every cron run. The problem is further exacerbated by the fact that even blank syncing takes a few seconds at the very least, and the lockout is in force for that duration. We’re thus left with a very inefficient system which can sync ONLY one slave with every cron run. If one slave is on a network that enjoys consistently lower lag with the master than all the others, then the others basically never get a chance to connect! Even if that is not the case, the system overall always has a success rate of 1/N for N slaves, in each cron run. Not good.
One way to alleviate this (though not entirely) is to introduce a random delay (less than the cron interval) between when cron initiates and when the connection is actually attempted. Over several cron runs, this scheme spreads out the odds evenly (duh!), for each slave, of running into a remote lockout. Local lockouts are not a problem. Bitpocket uses a locking mechanism to prevent two local processes from syncing the same tracked directory at the same time. If a new process encounters a lock on a tracked directory, meaning the previously spawned process hasn’t finished syncing yet, it simply exits. The random delay is introduced as shown below (assuming a cron frequency of 5 min):
#! /usr/bin/env bash
That’s it! Assuming you’ve saved this file in /usr/bin/bpsync, edit your crontab entries like so, and you’re done:
*/5 * * * * bpsync ~/Documents
*/5 * * * * bpsync ~/scripts
EDIT: I ran into trouble with stale server-side locks preventing further syncs with any slave. This happens when a slave disconnects mid-sync for whatever reason. Lock cleanup is currently the responsibility of the slave process that created it. There is no mechanism on the server to detect and expire stale locks (See https://github.com/sickill/bitpocket/issues/16). This issue needs to be fixed before this syncing tool can be left to run indefinitely, without supervision.
EDIT #2: One quick way to dispose of stale master locks is by periodically running a little script on the server that checks each sync directory for any open files (i.e. some machine is currently running a sync). If none are found, it simply deletes the leftover lock files. The script and the corresponding crontab entries are as below:
for DIR in *;
OUT=`/usr/sbin/lsof +D $DIR`
if [ "$OUT" = "" ];
rm -rf $DIR/.bitpocket/tmp/lock
And the corresponding crontab:
*/5 * * * * /usr/bin/cleanup.sh