Parsync, a tool for parallel SSH transfers – 7x faster than rsync

The claim of being 7x faster than rsync is very dubious. I would like to know the test conditions for such a result.

I use every day rsync over SSH, and even between 7 to 10 years old computers it reaches the maximum link speed over 2.5 Gb/s Ethernet.

So in order to need something faster than rsync and be able to test it, one must use at least 10 Gb/s Ethernet, where I do not know how good must be your CPU to reach link speed.

For 7x faster, one would need to use at least 25 Gb/s Ethernet, and this in the worst case for rsync, when it were not faster on higher speed Ethernet than what I see on cheap 2.5 Gb/s Ethernet.

If on a higher-speed Ethernet the link speed would not be reached due to an ancient CPU that has insufficient speed for AES-GCM or for AES-UMAC, then using multiple connections would not improve the speed. If the speed is not limited by encryption, then changing TCP parameters, like window sizes, would probably have the same effect as using multiple connections, even when using just rsync over ssh.

If the transfers are done over the Internet, then the speed is throttled by some ISP and it is not determined by your computers. There are some cases when a small number of connections, e.g. 2 or 3 may have a higher aggregate throughput than 1, but in most cases that I have seen the ISPs limit the aggregated throughput for the traffic that goes to 1 IP address, so if you open more connections you get the same throughput as with fewer connections.

> I use every day rsync over SSH, and even between 7 to 10 years old computers it reaches the maximum link speed over 2.5 Gb/s Ethernet.

What are you rsyncing? Is it Maildirs for 5000 users? Or a multi-TB music and movie archive? The former might benefit greatly if the filesystem and its flash backing store is bottlenecking on metadata lookup, not bandwidth. The latter, not so much.

I too would like to know the test conditions. This is probably one of those tools that is lovely for the right use case, useless for the wrong one.

Maildirs too, though not for so many users, so usually only a few thousand files are transferred, but more frequently some big files of tens of GB each are transferred.

The syncings are done most frequently between (Gentoo) Linux using XFS and FreeBSD using UFS, both on NVMe SSDs (Samsung PRO).

As I have said, on 2.5 Gb/s Ethernet, the bottleneck is clearly the network link, so rsync, ssh, sshd and the filesystems are faster than this even on old Coffee Lake CPUs or first generation Epyc CPUs.

The screen capture from the linked repository of parsync shows extremely slow transfer speeds of a few MB per second, so that seems possible only when there is no local connection between computers but rsync is done over the Internet. In that case the speed is greatly influenced by whatever policies are used by the ISP to control the flow and much less by what your computers do. For a local connection, even an older 1 Gb/s Ethernet should display a constant speed of around 110 MB/s for all files transferred by rsync.

When the ISP limits the speed per connection, without limiting the aggregate throughput, then indeed transferring many files in parallel can be a great win. However the ISPs with which I am interacting have never done such a thing for decades and they limit the aggregated bandwidth, so multiple connections do not increase the throughput.

Anecdote: I have rsync’d maildirs and I recall managing a ~7x perf improvement by combining rsync with GNU parallel (trivial to fan out on each maildir)

Awww yeah. +1 for GNU parallel.

When I think of those obscenely ugly scripting hacks I used to do back in the day....

"Well, trust me, this way's easier." -- Bill Weasley

I've used parsyncfp2 which I think is just another implementation of the same thing and I've definitely seen 2x-3x throughput improvement when transferring over large distances.

As you mentioned it definitely depends on how the ISP handles traffic.

I have yet to try but I've heard good things about hpn-ssh as well.

This is less of a usable tool and more of a concept right now, but there are algorithmic ways to do better than rsync (for incremental transfers, ymmv).

https://github.com/google/cdc-file-transfer

Hint: I really like the animated gifs on that page but they are best viewed frame-by-frame like a presentation.

A few days ago I built https://github.com/overflowy/parallel-rsync to scratch my own itch: I realized I could just launch multiple rsync instances in parallel to speed things up.

How does this differ from parsyncfp2?

https://github.com/hjmangalam/parsyncfp2

[deleted]

Why not tar.gz and send as a single stream?

Because (afaik), the single-threaded ssh program is the bottle-neck.

It used to be possible in openssh to use -c none and skip the overhead of encryption for the transport (while retaining the protection of rsa keys for authentication). Even the deprecated blowfish-cbc was often faster than aes-ni for bulk transfers. I remember cutting off hours of wait time in backup jobs using these options.

Sadly it appears those days are gone now. 3des is still supported, probably for some governmental environments, but it was always a slower algorithm. Unless there are undocumented hacks I think we're stuck with using proper crypto. Oh darn.

If blowfish was faster than AES, then it is certain that either the CPU did not support the AES instructions (AES NI = AES New Instructions), or the ssh program, either at the client or at the server was compiled without AES instructions support.

Blowfish is many times slower than AES on any CPU with AES instructions. Old CPUs with AES support needed around 1 clock cycle per byte, but many modern CPUs need only around 0.5 clock cycles per byte and some CPUs are even twice faster than this.

0.5 clock cycles per byte means a 10 GB/s = 80 Gb/s encryption throughput per core @ 5 GHz, so only for 100 Gb/s or faster Ethernet you might need to distribute encryption on multiple cores to reach full network link throughput.

For full AES speed, one must not use obsolete modes of operation like CBC or obsolete authentication methods like HMAC. For maximum speed, one must use either aes128-gcm@openssh.com or aes128-ctr + umac-64@openssh.com.

For increased security at lower speed, one can use aes256-gcm@openssh.com or aes192-ctr or aes256-ctr with umac-128@openssh.com.

In general, one should never use the default configuration of ssh for cipher and MAC algorithms, but one should delete all obsolete algorithms and allow only the few without problems, unless one has to make connections to legacy systems.

I don't remember which generation of Sparc CPU we were working with, but yes, Blowfish was faster. I made a matrix of relevant options and benchmarked all combinations.

Do you have citations for sustained 0.5 core-cycles/byte (80 Gb/s)? The benchmarks I have seen are closer to ~20-30 Gb/core-s though I have heard claims of 40-50 Gb/core-s.

`-c none` hasn't worked in SSH for at least a decade.

The `none` option was for SSHv1 which was already quite old when it was fully removed from OpenSSH 7.6 in 2017[1].

https://www.openssh.com/releasenotes.html

Now that you mention it, I think that probably was around the the last time I used it. Time flies.

It is a bottleneck for multiple files, but will it speed up with a single file?This is how we sent files for decades. Archive, transfer, unarchive. So I'm wondering what the point is.

It depends on the size of the file, of course. For copying your 90 line .bashrc, probably not noticeable in the noise. For copying an 800GB database? Um, yeah. :-)

I see this project's main value in turning loose the power of multiple cores on a filesystem full of manifold directories, backed by flash based storage that only runs optimally at queue depth >1 (which is most of them). On spinning rust this will probably just thrash the heads.

Hmmm. I wonder how 2 or 3 threads perform with zfs and a reasonable sized ARC?

800gb database is still a single file, if it only threads on file level, you shouldn't see a difference.

We definitely need more information. 9 or 10 years ago, under Solaris/Sparc with 1000BaseT connections, things were quite different than even the most boring and average Linux environment today.

This discussion should drive home the importance of not listening to conventional wisdom and presuming optimal performance without actually testing your options. And particularly, don't presume that a blog post (or an LLM that scraped it) knows what's right for your particular use case.