Many would be surprised to hear that several common GNU/Linux programs don't handle symlinks properly. By that, of course I mean that they don't handle them the way I would want them to, but close enough. For instance, if you want to copy a directory from one server to another, the command
scp -r source-dir target-dir
looks very attractive. Unfortunately scp
follows symlinks, meaning instead of copying a link to some other part of the file system, it instead copies that other part of the file system. For a heavily symlinked directory this can be disastrous. The correct and fool proof way to grab a portion of a file system from a server is to use
tar
. Don't worry, this doesn't mean you have to actually create a tar
file, you can use tar
to pipe the output over ssh
and untar it on the other side. tar -c some-files some-dirs \ | ssh -C my-server "tar -C path/to/extract/root -x"
If you want to download from a server…
ssh -C my-server \ "tar -C path/to/archive/root -c some-files some-dirs" | tar -x
The
-C
switch to tar
tells it to change directories prior to performing the operation. The -C
switch to ssh
tells it to compress the traffic with gzip like compression. You can even use a better compression if you have a slower connection to the server or a pay by the bit plan, by including lzma
or p7zip
in the pipe, or just passing a -j
switch to both tar
commands. By the way, p7zip
also treats symlinks badly and you need to protect any hierarchy with a tar archive. In case you are wondering why
scp
defaults to bad behavior, well all file systems aren't created equal. Since you are copying to a server and who knows what file system they have (for instance, it could be FAT), you might not be able to create symlinks there. So it is an alright decision to only copy files and not links to files. If you only deal with, hmmm, how to put it, modern file systems, this sure seems like incorrect behavior. Maybe someday this will change, but in the mean time, the tar method works great and has been the method of choice since tar
, pipes, and networks existed. But wait, there's more. Even if you don't have symlinks, piping a tar archive over
ssh
might be a good idea. Since scp
operates on individual files, it incurs an overhead on each one. If you have many small files you want to transfer, small enough that the actual transfer time is almost insignificant, this overhead can become quite costly. In these cases the tar method will be faster. smithzv@ciabatta:~$ ssh scandal "ls -R kappa-slices-3d | wc" 3993 3958 117715 smithzv@ciabatta:~$ ssh scandal "du -sh kappa-slices-3d" 36M kappa-slices-3d smithzv@ciabatta:~$ time scp -qr scandal:./kappa-slices-3d dat/ real 0m8.004s user 0m1.152s sys 0m1.184s smithzv@ciabatta:~$ time ssh scandal "tar -c kappa-slices-3d" \ | tar -x -C ~/dat/ real 0m2.442s user 0m0.824s sys 0m0.728s
This directory on our scandal cluster has 4000 small files in it which total up to 36 MB. Performing the piped tar method takes about a third the time of the recursive
scp
copy. Also, I should point out that the scp
process will, as far as I know, at best be as fast as the taring procedure. Of course, note that we didn't use compression here as this is a transfer of already compressed files over a fast connection and compression just slows both commands down. If you ever need to backup your computer over a your home LAN so you can reinstall an OS or something, this is a lifesaver (or at least a time saver). So, piping a tar archive over ssh is a great tool. That being said, there is a program that does so much more and might be a better choice as long as it is installed on both systems; it's called
rsync
. rsync
follows symlinks just like scp
by default (for the same reasons), but it has a switch, -a
for archive mode, that allows it to perform the symlink preserving behavior as seen above. rsync
has other benefits over just an ssh
or scp
copy (like incremental updates: i.e. only transmitting data that has changed) and really should be preferred in most cases if it is an option, but you have to read the man page first or it will bite you, especially if you have heavily internalized the way cp
and scp
work.
This is the simplest, clearest explanation of this topic I have found. Thanks!
ReplyDeleteOr you could use rsync like this:
ReplyDeletersync -avz /dir1/dir2 host:/dir1
and it will copy dir2 and contents into dir1 on remote host.
I mention rsync in the post, and it is undoubtedly the best tool for the job. However, I didn't want to suggest it outright as it can be tricky. Namely, while:
Deletersync -avz /dir1/dir2 host:/dir1
...does the right thing...
rsync -avz /dir1/dir2/ host:/dir1
...does something wildly different (it copies all the files in /dir1/dir2 into host:/dir1, dir2 is not created on host). In every other CLI I can think of, a trailing slash on a directory pathname doesn't significantly alter the action that the command performs (okay, it does if you are dealing with soft links to directories, but it makes sense there). Because of this behavior, rsync violates the Principle of Least Astonishment (https://en.wikipedia.org/wiki/Principle_of_least_astonishment), hence why I suggest that any potential user should read the man page thoroughly, and play around with it a bit, before you start using it.
I like that principle. I have never heard of it before but it is a powerful fact in systems programming and administration.
ReplyDeleteI am trying the same but when i exit my putty session or putty session gets closed due to timeout, it stops scping. Also i do not see any jobs once i re-login. But if i do "ps -ef|grep scp" i can see the scp process there but it is hung as it is not scping anything.
ReplyDeleteAny suggestion in this case please ?
linux scp