Bit-twiddling optimizations in Zed's Rope

I really like all the blog posts and videos the Zed team has put out, thank you if you’re reading this!

Unrelated to this specific post I’m such a fan of Zed. It’s the first feature complete text editor in recent memory that I’ve truly enjoyed using (i.e. it stays out of the way, is really fast, feels well engineered). I’m coming to Zed after years of Emacs which I still have love for but no longer feels like a competitive piece of software (it does not take full advantage of how good computers are today, e.g. gpu rendering or multicore). I really hope Zed stays a fast and lightweight text editor instead of becoming some bloated growth-at-all-cost VC ware (not that they’ve exhibited any signs of that happening). I’d also happily pay for Zed without a subscription based thing for access to LLM features (which I do not use).

> it does not take full advantage of how good computers are today, e.g. gpu rendering or multicore

Why does Emacs need that though? I hear people say this all the time and I don't get it. Multicore kind of works against the structure that Emacs touts as a feature. And GPU rendering? In many applications, I totally agree with these complaints. But it's a text editor.

I tried Zed myself, and it's good. But it doesn't dethrone Emacs (for me personally).

> But it's a text editor.

Long time emacs user here (+20 years, yikes). I've used it on all kinds of computers during this time. Even a relatively modest computer from 2024 is an absolute beast compared to something from the year 2000.

With that said, there are text editing operations that can cause it to grind to a complete halt, especially working with large files (or very long lines). And it shouldn't, you know? I think emacs users sort of internalize which operations they should avoid. It's kind of ridiculous to have to do that in a text editor with the massive amounts of compute that our computers have today.

> (or very long lines)

Long line handling has greatly improved in emacs-29. Multi-megabyte lines are not a problem anymore.

from Emacs 29.1 NEWS file, https://raw.githubusercontent.com/emacs-mirror/emacs/refs/he...

    ** Emacs is now capable of editing files with very long lines.
    The display of long lines has been optimized, and Emacs should no
    longer choke when a buffer on display contains long lines.  The
    variable 'long-line-threshold' controls whether and when these display
    optimizations are in effect.

    A companion variable 'large-hscroll-threshold' controls when another
    set of display optimizations are in effect, which are aimed
    specifically at speeding up display of long lines that are truncated
    on display.

> Multicore kind of works against the structure that Emacs touts as a feature.

I have consistent issues with emacs locking up when executing network requests. I'm sure there's a specific bug that could be hunted down and addressed, but this sort of thing shouldn't happen much in an editor that's multicore by default.

I'm not trying to dismiss emacs' reasoning, of course, but I can understand being disgruntled with it.

The actual rendering I've been quite please by, though!

Yeah this is one reason, or Emacs freezing for up to a minute when updating packages. Also when using an LSP I notice latency.

I use Emacs GUI (outside of the terminal) and comparing performance for rending to something like Zed or Sublime is definitely noticeable. It’s great that Emacs is so resource efficient but sometimes I wish it used more of my beefy computer(s).

Like I said I still love Emacs and it’s okay for it to make a different set of trade-offs. I honestly didn’t think I’d ever switch editors but here we are!

Removing the interpreter lock for a few specialized tasks (without sweeping runtime changes to Emacs) would be enough to fix most of these issues -- parsing JSON from process output into lisp data in a background thread is one candidate. [1]

Installing packages does not need to block either, there is no architectural limitation here. The Elpaca package manager for Emacs provides async, parallel package updates. Loading packages into the Lisp image will block though, there's no way around that.

The other big source of input lag is garbage collection, and there are some ongoing efforts to use the MPS library in Emacs for a copying, concurrent GC. This is a big change and I don't know if this experiment will go anywhere, but Eli Zaretskii and co are trying.

[1]: https://github.com/emacs-lsp/emacs

Someone made a alternative way of optimizing LSP by converting the JSON to elisp bytecode on the fly: https://github.com/blahgeek/emacs-lsp-booster

That's fair I guess. In the case of IO that can be an issue. When I hear multicore, I assume we're talking about parallelism, not concurrency.

As for LSP, other than the Nim langserver, I've been quite satisfied with Eglot's performance. I'm curious what your setup is like. To be fair, I'm running a highly customized Doom Emacs config, so it's possible I inherited some non-vanilla optimizations I'm not aware of.

> When I hear multicore, I assume we're talking about parallelism, not concurrency.

Parallelism trivially enables concurrency. The lack of parallelism magnifies the issues emacs has with concurrency.

“Need” is strong but using GPU rendering is definitely better than CPU rendering and makes things feel very snappy

Most new TTY projects use GPUs for that reason

GPU rendering simplifies smooth text scrolling which used to be a thing on some dumb terminals and microcomputers like Amiga that supported it in hardware. Most emulators are locking character cells on a fixed grid and we miss out on such niceties.

AMIIIIGAAAAA!

I really want to admire the Zed folks even though they are a new faction in the editor wars where I’m an emacs guy: they take mostly all the things I care about seriously.

The are serious about local vs. remote vs. shared. They are serious about hardware acceleration because they care about users who type fast and edit big files. They care about real computer science on how to push the current hardware to serve the user, rather than treating the hardware as a crutch to give the user a slightly worse experience at a fraction of the development cost. They care about code highlighting the snippets on their blog like very similar to the default Zed theme.

These are cool, serious people. If they give me emacs key bindings with full paredit-everywhere, I might switch my daily driver.

And this is about using modern SIMD-style stuff, branch less stuff, Lemire stuff in a modern and relevant context. Other commenters have pointed out that you can do the mask or popcount better with intrinsically, and yeah, but they probably know that, and B. They got the massive win, the remaining factor of K on the pop count is coming.

Long Zed.

Why not store just a small u8 count of newlines in a chunk instead of their u128 positions and then only loop through the last chunk for precision?

You don't need information about the position of newlines in all the chunks located before the one your offset lands on

As I understand it, they do exactly what you say. TFA is about optimizing the last chunk's loop.

Maybe I got confused, but how do they then count the newlines in all the previous chunks? That information is still needed to calculate the line for a specific position in the last chunk

It's not you who's confused, it's that this part of the process is not described. We only have some hints, here:

> If you called rope.offset_to_point(7234), the Rope would traverse its SumTree to find the Chunk that contains offset 7234 and then, on that Chunk, it would call offset_to_point again.

And here:

> while the Rope can get us to the right Chunk in O(log(n))

I would guess that each node of the SumTree includes the number of newlines of all the chunks before it.

It's outlined in their previous post on the Rope/SumTree data structure they use, which this article links to: https://zed.dev/blog/zed-decoded-rope-sumtree.

SIMD can work quite well here too - with 128-bit SIMD (available on both baseline x86-64 and aarch64) this can be just ≤8 loop iterations checking for the newline character (each iteration counting the number of newline characters encountered, and a lzcnt on the last iteration), and similar for characters (assuming valid UTF-8, it's a single comparison to test if a byte starts a new char).

  >  // Parallel bit count intermediates
  >  let a = v - ((v >> 1) & (u64::MAX / 3));
  >  let b = (a & (u64::MAX / 5)) + ((a >> 2) & (u64::MAX / 5));
  >  let c = (b + (b >> 4)) & (u64::MAX / 0x11);
  >  let d = (c + (c >> 8)) & (u64::MAX / 0x101);

That "parallel bit count" is almost certainly slower than using two POPCNT instructions on a modern cpu. Should just call __builtin_popcount() and let the compiler do it the most optimal way. Luckily, people do this sort of thing so often that many modern compilers will try (and often succeed) to detect you trying this insanity and convert it to a POPCOUNT (or a pair of POPCOUNTs as the case may be here)

What that code does is a per-byte-pair popcount, which is not what the POPCNT instruction does (it computes the popcount for the whole word).

On processors with BMI2 the whole algorithm reduces to a PDEP as mentioned in another comment, but if you don't have that this is pretty much the best you can do (unless you use lookup tables but those have pros and cons).

Which compilers support __builtin_popcount()? From memory, it's a gcc extension. If the compiler selects a CPU POPCOUNT instruction for it, are you sure it will work on all machines that you want to run it on?

The above code is completely source- and binary-portable and reasonably fast -- certainly faster than naively looping through the bits, and within a small constant factor of a CPU POPCOUNT instruction.

Everything supports __builtin_popcount or some variant these days (__popcnt for MSVC). It's a complete non-issue, really.

And the compiler is not required to lower it to a single instruction. It will if the target architecture is specified appropriately, but there's nothing that says it has to explode if it can't. In fact, by doing it this way, the compiler is actually more free to generate code in a way that's optimal for the architecture in all cases, because all the implementation details are hidden e.g. loads of large constants may be avoided if the compiler is allowed to choose the exact implementation, while using the portable version may tie its hands more depending on how it's feeling on that day. Here's __builtin_popcount working just fine while targeting a ~20yo architecture without native support for SSE4.2; it can generate this code knowing what the proper instructions and schedules are: https://godbolt.org/z/ra7n5T5f3

The moral here is that the primitives are there for you to use. Just use them and save yourself and your would-be code reviewer's time.

Most vaguely recent compilers will convert naively looping through bits into a native POPCOUNT instruction. The parallel bit count algorithm was not reliably detected until more recently and therefore would sometimes produce unoptimized code, though current versions of gcc/clang/msvc can all detect it now.

Also, pretty much every compiler for a very long time has supported __builtin_popcount or equivalent.

Your compiler will know the best way to popcount, that is the point of that builtin. It'll use the best method - sometimes this one. GCC does this, MSVC does this, clang does this, i think even rust has some way to do it (EDIT: it does: count_ones()). On archs which lack POPCNT, it will use this method or another, based on knowing the target. On x86 this approach is OK as is. On arm64, for example, it will be suboptimal due to all the literals needed. On armv6m, this method is bad and table lookups are faster.

Note that by default rustc targets x86-64-v1 when compiling for x86-64, and that lacks the popcount instruction. You need to change the target_cpu to at least x86-64-v2 or enable the popcount target_feature. This means that even if your cpu is relatively new and you intend to run your code on relatively new cpus, rustc will still generate older and slower code for count_ones() using bitshifts and masks. That said, I don't see the point in writing them manually if the compiler can generate them for you.

It's not unreasonable to think that Rust will change the minimum version and you should always override the target cpu anyway for C++-like toolchains when building production binaries (`-Ctarget-cpu` for rust, `march=` for clang/gcc).

I once wrote that algorithm, divided into single lines, intending each line to be a single 64-bit ARM instruction. The compiler did idiom detection, transforming it to "builtin popcnt" and (because 64-bit ARMv8.0 lacks a POPCNT instruction) back to the same algorithm. Only that the emitted code was one instruction larger than my code.

64-bit ARM's actually has a very peculiar encoding of immediates to arithmetic instructions. It supports only recurring bit patterns such as used by this algorithm. For example "add x2, x3, #3333333333333333" is encoded as one four-byte instruction.

> because 64-bit ARMv8.0 lacks a POPCNT instruction

It does have this: https://developer.arm.com/documentation/ddi0596/2021-09/SIMD...

And GCC happily uses it https://godbolt.org/z/dTW46f9Kf

> Which compilers support __builtin_popcount()?

Clang supports __builtin_popcount() too. And MSVC has __popcnt().

In addition to the other comments, the iso C23 standard added the <stdbit.h> header to the standard library with a stdc_count_ones() function, so compiler support will become standard.

The only one that exists. This is rust and rustc supports popcnt on all platforms that have it and emulates it on those that don't

I guess it depends how many different compilers you are targeting but generally speaking, compilers look for trivial implementations of popcount and can replace it with the single instruction.

However, as someone already mentioned, this is not equivalent to a pair of popcounts.

nth_set_bit_u64: wouldn't that be __builtin_ctzll(_pdep_u64(1<<n, v)) with BMI2?

That's my guess as well.

Bitstring rank/select is a well-known problem, and the BMI and non-BMI (Hacker's Delight) versions are available as a reference.

That's assuming you're ok with your program not running on some older cpus.

That and that you're not willing to entertain splitting the manual version as #[cfg(not(target_feature = "bmi2"))] fallback implementation. For something already down to ~ 1 ns both of those may well be very reasonable assumptions of course.

AMD machines prior to Zen 3 had a micro-coded implementation of pdep and pext, so they're actually relatively expensive for those earlier Zen machines (as well as Bulldozer). Some people still have Ryzen 3000 series chips.

On the Intel side, pdep has been fast since its release with the Haswell in 2013, so pretty much everyone using Intel should be fine in this regard.

I believe the equivalent ARM64 instructions are in SVE2 which isn’t yet supported on Apple’s M-series chips as of M4, sadly.

> Turns out CPUs are pretty good with zeros and ones.

I've been saying this a lot recently, that CPU's are powerful 1-bit vector machines!

Is there a way to adjust text contrast in light mode in Zed yet? The editor is unfortunately unusable for me, because of how washed out the colors are.

There are themes now and a UI to install them, I also didn’t like the washed out colors.

Moved to zed a while ago. I’m using a light theme - Xcode low key now (With some settings to turn off the bold fonts).

Have you found a light theme you're happy with? Which one?

Does anyone know of any good presentations about the Zed architecture and internals around?

I know that they have some recordings from their coding session, but I am looking for something more overall.

This is all really cool, but I just want soft wrap to be fixed

> opacity: 0;

> filter: blur(1px);

Wonderful styling!

Isnt the tab example wrong? Id assume it to be

aa -> -> bb -> -> bb

It only takes up two spaces, after all