## Sunday, October 3, 2010

Many would be surprised to hear that several common GNU/Linux programs don't handle symlinks properly.  By that, of course I mean that they don't handle them the way I would want them to, but close enough.  For instance, if you want to copy a directory from one server to another, the command scp -r source-dir target-dir looks very attractive. Unfortunately scp follows symlinks, meaning instead of copying a link to some other part of the file system, it instead copies that other part of the file system.  For a heavily symlinked directory this can be disastrous.

The correct and fool proof way to grab a portion of a file system from a server is to use tar.  Don't worry, this doesn't mean you have to actually create a tar file, you can use tar to pipe the output over ssh and untar it on the other side.

tar -c some-files some-dirs \
| ssh -C my-server "tar -C path/to/extract/root -x"

ssh -C my-server \
"tar -C path/to/archive/root -c some-files some-dirs" | tar -x

The -C switch to tar tells it to change directories prior to performing the operation.  The -C switch to ssh tells it to compress the traffic with gzip like compression.  You can even use a better compression if you have a slower connection to the server or a pay by the bit plan, by including lzma or p7zip in the pipe, or just passing a -j switch to both tar commands.  By the way, p7zip also treats symlinks badly and you need to protect any hierarchy with a tar archive.

In case you are wondering why scp defaults to bad behavior, well all file systems aren't created equal.  Since you are copying to a server and who knows what file system they have (for instance, it could be FAT), you might not be able to create symlinks there.  So it is an alright decision to only copy files and not links to files.  If you only deal with, hmmm, how to put it, modern file systems, this sure seems like incorrect behavior.  Maybe someday this will change, but in the mean time, the tar method works great and has been the method of choice since tar, pipes, and networks existed.

But wait, there's more.  Even if you don't have symlinks, piping a tar archive over ssh might be a good idea.  Since scp operates on individual files, it incurs an overhead on each one.  If you have many small files you want to transfer, small enough that the actual transfer time is almost insignificant, this overhead can become quite costly.  In these cases the tar method will be faster.

smithzv@ciabatta:~$ssh scandal "ls -R kappa-slices-3d | wc" 3993 3958 117715 smithzv@ciabatta:~$ ssh scandal "du -sh kappa-slices-3d"
36M     kappa-slices-3d

smithzv@ciabatta:~$time scp -qr scandal:./kappa-slices-3d dat/ real 0m8.004s user 0m1.152s sys 0m1.184s smithzv@ciabatta:~$ time ssh scandal "tar -c kappa-slices-3d" \
| tar -x -C ~/dat/

real    0m2.442s
user    0m0.824s
sys     0m0.728s

This directory on our scandal cluster has 4000 small files in it which total up to 36 MB.  Performing the piped tar method takes about a third the time of the recursive scp copy.  Also, I should point out that the scp process will, as far as I know, at best be as fast as the taring procedure.  Of course, note that we didn't use compression here as this is a transfer of already compressed files over a fast connection and compression just slows both commands down.  If you ever need to backup your computer over a your home LAN so you can reinstall an OS or something, this is a lifesaver (or at least a time saver).

So, piping a tar archive over ssh is a great tool.  That being said, there is a program that does so much more and might be a better choice as long as it is installed on both systems; it's called rsync. rsync follows symlinks just like scp by default (for the same reasons), but it has a switch, -a for archive mode, that allows it to perform the symlink preserving behavior as seen above.  rsync has other benefits over just an ssh or scp copy (like incremental updates: i.e. only transmitting data that has changed) and really should be preferred in most cases if it is an option, but you have to read the man page first or it will bite you, especially if you have heavily internalized the way cp and scp work.

1. This is the simplest, clearest explanation of this topic I have found. Thanks!

2. Or you could use rsync like this:

rsync -avz /dir1/dir2 host:/dir1

and it will copy dir2 and contents into dir1 on remote host.

1. I mention rsync in the post, and it is undoubtedly the best tool for the job. However, I didn't want to suggest it outright as it can be tricky. Namely, while:

rsync -avz /dir1/dir2 host:/dir1

...does the right thing...

rsync -avz /dir1/dir2/ host:/dir1

...does something wildly different (it copies all the files in /dir1/dir2 into host:/dir1, dir2 is not created on host). In every other CLI I can think of, a trailing slash on a directory pathname doesn't significantly alter the action that the command performs (okay, it does if you are dealing with soft links to directories, but it makes sense there). Because of this behavior, rsync violates the Principle of Least Astonishment (https://en.wikipedia.org/wiki/Principle_of_least_astonishment), hence why I suggest that any potential user should read the man page thoroughly, and play around with it a bit, before you start using it.

3. I like that principle. I have never heard of it before but it is a powerful fact in systems programming and administration.