Mergiraf: Syntax-Aware Merging for Git

- Related in fine-grained diffing approach: Git heatmap: diff viewer for code reviews

> Heatmap color-codes every diff line/token by how much human attention it probably needs. Unlike PR-review bots, we try to flag not just by “is it a bug?” but by “is it worth a second look?” (examples: hard-coded secret, weird crypto mode, gnarly logic).

https://0github.com/

Hmm, it would be nice to just see a heatmap over how many times a line has been changed. There must be some easy-ish way to do that right?

I think you'd need to write a tool that goes through all revisions of a file and does a count, but if that's cached then it's doable. There's a few tools to view that by file though, including some Git commands, it's a valuable tool to determine which files are edited the most (see also the word "churn").

The idea is cool but boy does it make you blind to anything the AI doesn't deem noteworthy. Comes down to whether you trust a human reviewer more, or the LLM

Have been using Mergiraf for the past 4 months. It's automatically solved about 70% of my conflicts and, luckily, I've never contested any of them. Pretty pleased.

This is my experience as well. Not a gamechanger, but definitely on the positive side.

> luckily, I've never contested any of them.

That's to be expected. The philosophy behind git merges is that it will merge only if it is absolutely and unambiguously sure that the resolution is correct. That's when there is only one solution for the merge. It will just throw it's hands up and leave it to the developer if there is any ambiguity - that's if there's more than one way to do the merge.

Every single chunk of merge is a potential conflict. But have you ever contested the regular merge algorithm (ort by default) when it did work? Like when the merge was fully successful, or the successfully merged chunks within a conflicted merge? You can expect the same experience with any merge algorithm that sticks to the git philosophy of being a git [1]. Problems will happen only if they start using some complex heuristics or LLM or something unpredictable like that for the merge.

> It's automatically solved about 70% of my conflicts

At the risk of explaining the obvious, I'm going to try to explain this. (So please don't get angry at me if you already know this.) Imagine that you're trying to manually merge 2 branches without any sort of merge algorithm. For the first case, just assume that you don't know the programming language (imagine that it's in some foreign script). All you have to go by is the record of when each line was added in each branch. The best 'dumb' strategy you have to go with, is the 3-way merge [2]. The referenced page illustrates this. It clearly shows you the advantage of the 3-way merge algorithm over the traditional 2-way merge that we all are familiar with.

But this method still has a disadvantage. You are looking at the source files simply as a bunch of lines, without the knowledge of its more granular structures like the syntax. (Note: That assumption itself may be wrong. That's why merges and git in general doesn't work well on binary files.) At best, all you can hope for is that the two branches don't contain any edits on the same or the adjacent lines. You won't even know the order in which the lines should be arranged. Now you have a conflict - a merge that you're leaving for someone else to solve.

Now assume a second case. You know the programming language this time. But you have no idea what the program does - it's not your project. Even with that limitation, you'll still be able to do a better job than just comparing the lines blindly. Mergiraf docs has a page full of these examples [3]. You can see how obvious the merges look - there is no way you can go wrong. See if you can resolve them just by looking at the lines. That's why mergiraf gives you much better performance without any errors.

There is of course a deeper level of knowledge - the semantic level. The knowledge of what the program does. You need that knowledge to resolve 100% of the merges. And that ultimate merge algorithm is ... you.

> Pretty pleased.

Understandable. But I see a potential problem here. As you are aware, the files to submit to mergiraf are specified in the gitattributes file. There are two ways this can go wrong. First, someone else with your repo may not have or even know about mergiraf. The second, even bigger problem is that some people have global gitattributes files [4] where you place your default attributes. It's possible to setup mergiraf there. But if you do so, your colleagues may not even get a clue as to why certain merges succeed for you, but fail for all of them.

The above problem becomes a bigger issue because merge and rebase conflicts sometimes reappear in later merges or rebases. If that's something mergiraf can solve and you have it, then everything's fine. But if the conflict reappears for someone without mergiraf, they will have to repeat the manual resolution again and again. This happens because git simply wont commit a merge or rebase until we resolve the conflict manually. Therefore, git has no idea what we did in between to resolve it - that is not recorded anywhere. (Well, git-rerere [5] records it if we ask it to. But that's a local-only solution. Everyone will have to do it once on their system.)

There is actually a known solution to the problem. It's called 'first class conflicts' [6]. The idea is to record the conflicts and its resolution in the repo itself (the same info that rerere stores, but in the shared repo). This means that a conflict once resolved will not come back again, because the structured information to resolve it is available in the repo. This means not everyone needs mergiraf and nobody needs to repeat a completed manual resolution. It has other advantages too. You can just continue working after a conflicted merge and leave the resolution for later. Or you could send the conflicts to someone else more specialized in that area of the code.

I have seen this feature in Jujutsu [6] and Pijul [7]. Git doesn't have it probably because this wasn't around when it was developed. But Jujutsu uses git repository format and they somehow managed to implement first-class conflicts on it. Meanwhile, the concept is already there in git as rerere. So perhaps first-class conflicts are possible in Git too. It would be awesome if we had that in Git too. So if anybody who sees this knows how to do it, please please take it up as a wish!

[1] https://github.com/git/git/blob/e83c5163316f89bfbde7d9ab23ca...

[2] https://blog.git-init.com/the-magic-of-3-way-merge/

[3] https://mergiraf.org/conflicts.html

[4] https://git-scm.com/docs/git-config#Documentation/git-config...

[5] https://git-scm.com/docs/git-rerere

[6] https://jj-vcs.github.io/jj/latest/conflicts/

[7] https://pijul.com/manual/why_pijul.html#modeling-conflicts

fyi, comes configured in jj by default. Just `jj resolve --tool mergiraf` and some conflicts go away :)

Very impressive enhancement. Not a panacea though. It uses tree-sitter approach to solve situations when two users change the same line of code. For example one change function name and other adds a new argument. It will merge it without conflicts. It still has some troubles to solve complex issues, without knowing author intensions. But can significantly simplify developers' lives. Not sure if it would land into git very soon. It requires all git to know all the parsers you need. But definitely worth adding.

This is a seprate tool that one can tell git to use.

> After extracting a list of every merge conflict in the kernel's Git history, I tried using Mergiraf to resolve them. 6,987 still resulted in conflicts, but 428 were resolved successfully. A much larger fraction of merge conflicts were still partially resolved.

Take this with a grain of salt as I haven't tested this claim, but I think C might be a pretty weak language for this tool because you can't really parse it without running the whole preprocessor, which it can't do:

https://codeberg.org/mergiraf/mergiraf/issues/612#issuecomme...

So I think in a more sensible language you might get much better results than this.

Another aspect is the fact this repo reflects Torvalds’ view of the world. He operates in large-ish changesets.

I wish there were a lot more syntax aware merges built into git (et al). Why are separate columns on the same row of a CSV or multiple appends to a list (in any language you don't want a trailing comma) so annoying to merge?

It could be so much better.

This is a very interesting idea that could save a lot of time and pain in big projects.

The example shown reminds me pf Zed's CRDTs [1], and their journey to build a fine-grained version control system for agentic development [2]—I imagine this work could prove useful to the Zed/Cursor team, and likely shares a lot of functionality with DeltaDB [2].

- [1]: https://zed.dev/blog/crdts

- [2]: https://zed.dev/blog/sequoia-backs-zed

I’m pretty sure one of the Zed founders wrote tree-sitter, so I’m sure there’s some overlap

It’s really cool to see tree-sitter unlock so many of these use cases. I love using [difftastic] for my diffing tool to get context aware diffs. So in the example from the article, the diff would highlight the `void` and `int` changes with a heavier background of red and green respectively

[difftastic]: https://github.com/Wilfred/difftastic

Max Brunsfeld in fact, yep. He went along to Zed from the Atom team.

But curiously Zed hasn't been very interested in Tree-sitter. They don't seem to see it as having much strategic value to their company, which is odd because lots of other people do see it as a valuable platform. You have Tweag building code formatting on it, you had GitHub building stack graphs on it, you have Merigraph. You even have sone really "out there" stuff like the Software Evolution Library!

I really liked the last section of your article, thanks for the numbers

Way back in the day when I primarily wrote c# I used to use a tool called SemanticMerge. It was pretty cool, it actually parsed the code and could pick up refactors like moving a method to a different class and what not. This kinda reminds me of that a bit.

Yeah, the article mentions a similar project for Java; I'm a bit surprised / disappointed that there's no more language specific merge tools tbh, or a super-tool that has plugins for individual languages. Maybe this article will attract more attention though.

Very interesting to see what Tree Sitter starting to get used for more things.

finally...

I've been using 1-arg-1-line to avoid most conflicts

I've been doing some SQL again and one technique I learned years ago was having each thing on its own line, both to reduce churn in version control and allow for easier reordering and commenting out.

Instead of

    SELECT foo, bar, quux FROM baz WHERE storge = 'grault';

    SELECT
       foo
      ,bar
      ,quux
    FROM
      baz
    WHERE
      storge = 'grault'
    ;

It's pretty hideous in this example but for bigger queries maintained over a long period of time it can be beneficial. I assume, it's been nearly 20 years since I did anything more serious with SQL.

[dead]

claude "resolve merge conflicts"

Using 30s worth of H100 GPU instead of <10ms worth of an entry-level CPU, for a worse result.

Well done.

"Compositing text into graphical data to display it on a 2D array of millions of 32bit RGB pixels instead of just using a pencil and a 50 cent notebook."

Actually I've done this a hundred times now and it has yet to make a single mistake. I don't give a crap how much GPU it uses, grandpa.

OK, I'm going to try and resolve these merge conflicts for you!

First, let me pull up the diff and git status

......

....

...

Hmm, that didn't quite work, let me try that again!

I've resolved hundreds of conflicted merges this way and I don't remember it making a single mistake.