The release of the mergiraf 0.4 project has been published. It develops a driver for Git with the implementation of the three-way merge feature. Mergiraf supports resolution of various types of merge conflicts and can be used for various programming languages and file formats. It is possible to either separately call mergiraf to handle conflicts that arise when working with standard Git, or replace the merge handler in Git to expand the capabilities of commands such as merge, revert, rebase and cherry-pick. The code is distributed under the GPLv3 license. The new version adds support for Python, TOML, Scala and Typescript, and also optimizes performance.
Below is a detailed description of the problems solved by mergiraf:
Software is a prime example of an extremely complex system. Complex systems have one thing in common - they are COMPLEX - and you cannot expect the desired complex behavior to emerge by chance. Instead, these systems evolve over time, step by step, with each mutation carefully tested at each step. Achieving this requires a well-defined framework and appropriate tools. The evolution of any complex system can be visualized as a directed tree, where the root is an empty set of functions, and each node - except the root - is the result of applying a mutation to its parent.
In the context of products, each node is called a "version," which represents a particular set of features and anti-features. Any change to this set is considered a mutation, forming an edge in our directed acyclic graph. These features are abstract in nature; they do not directly reflect the way physical systems function, but rather represent how intelligent agents perceive the usefulness of those systems. Translating these ideas into real-world implementations requires rolling up your sleeves and diving into *low-enough* low-level details that can be used to express and explain how things work. In software development, these low-level details are typically represented by source code.
To gradually bring source code to a state that exhibits desired behavior, and to document how they got there, programmers think of their work in terms of snapshots and changesets. A snapshot represents a particular state of the product with all the low-level details, while a changeset represents a transition between snapshots. Snapshots are typically spawned from single changesets to their parents, so these snapshots are almost always labeled by what the changesets that created them do, so the terms are often used interchangeably.
Sometimes there are snapshots that result from multiple transitions — merging commits. They are difficult to work with, so they are usually avoided. Modern open source version control systems like Git provide very basic capabilities for managing development workflows. They allow developers to organize snapshots as directed acyclic graphs, annotate them with comments, and reorder them as needed.
This functionality allows developers to write semantically meaningful history of a project, which is crucial for debugging and answering questions like "Why was this low-level detail (e.g. variable) introduced?", "What percentage approximately is my contribution to this project?", "Who was hacked by backdoor injections and when?", "What low-level change broke this feature (even though it shouldn't have, we checked everything!)"
Version control systems complement this with the concept of a branch, a low-level concept that simply means a continuous piece of low-level project history that is semantically meaningful to the developer. Branches are typically used for a specific implementation of a feature, sometimes creating multiple branches for different candidates for the same feature. By using branching workflows (which are the de facto mainstream and standard for development, used everywhere), each individual developer can effectively manage many conflicting branches of the project, each of which differs in maturity or quality. This allows developers to combine their own and others' work without having to re-type everything manually each time.
Typically, there is a main branch representing the "official" product, from which side branches for each feature are branched off, and these are regularly (ideally after each commit) synced with the main branch, allowing developers to work on the latest version of the product while simultaneously implementing the features they are currently developing, while detecting problems caused by other developers as early as possible.
Problems arise when trying to combine the features of different snapshots (which is simply finding a common ancestor and applying the changesets that generate them sequentially on top of each other, an operation called rebase, while merging is almost like rebase, it just structures the commit graph differently, making it awkward to manipulate, which is why merges are being abandoned in favor of rebases). Modern VCSs use internal merging algorithms that simply split files into separate lines, treat each line as a character and the files as sequences of them, and then use algorithms to combine them that come from bioinformatics.
Unfortunately, such a line-by-line representation of the source code has nothing to do with its content. Its only advantage is that it is simple and universal. Inconsistency leads to conflicts, being a constant source of headaches for developers. Resolving conflicts requires the developer to carefully study both versions of the code, and not only the sections marked by the line-by-line comparison algorithm as "changed" or "conflicting", but possibly the entire project.
The developer must understand the changes, manually write the merged code, and fix any inconsistencies. The problems are compounded when the line-by-line tool misidentifies the changes, which often happens with large changes, including trivial ones like refactoring the code. If subsequent changes fail to apply to the manually merged code, the situation becomes a nightmare. Despite the horrific cases, the line-by-line algorithm works in most cases, especially if developers actively try not to cause problems for it. One way to minimize such problems is to require source code to be processed by canonicalization tools like black.
Of course, the correct solution to the horrific cases (and in general, not only for them, the line-by-line algorithm is a heuristic, it can trivially lead to non-working code, for example, one developer renamed a variable, and another at that time wrote a piece of new code using that variable, there will be no merge/rebase conflict here, but the result will become non-working) is to use the correct internal model.
Despite the fact that research in this area has been going on for about 30 years, and resulted in the creation of several proprietary commercial products, this research has not been converted into practical open source products until recently. The bulk of FOSS solutions began to develop in the early 2010s, and were focused mainly on the Java language.
The most prominent open source implementation of that period, GumTree, was created by a researcher with an academic background, is written in Java, has its own abstract internal representation that predates treesitter, has backends both based on treesitter and based on other tools for parsing source code into abstract representations. This system can only generate (in the form of a text event log, there is also an API that can be trivially called from any PL that has bindings to Java) and visualize changes. However, it is not applicable out of the box for merging changes, as well as for viewing the diff files it generates (however, it is likely that loading diffs can be implemented via the API).
A younger and more practical implementation of difftastic is written in Rust, based on treesitter, and focused on generating highlighted diffs in the console. This system is also aimed at visualizing diffs and does not aim to merge changes or apply patches at all.
The mergiraf project has recently appeared and is actively developing. This Rust-written tool (it takes up 21 MiB!) is also based on treesitter, which has already become the same standard for context-free grammar parsers in development tools as LLVM has become for optimizing low-level representations of instructions. Unlike its competitors, mergiraf provides functions not for generating diffs, but for automatically resolving merge conflicts. Under the hood, mergiraf uses an implementation of the algorithm used in GumTree to generate patches, and an implementation of the algorithm used in spork, adapted to the treesitter structures, for application.
Serialization of patches into files that can be applied later is unfortunately not implemented (but could plausibly be implemented by parsing event logs generated by GumTree). Another promising way to apply differences may be to apply differences not through patches, but through the refactoring functionality of LSP servers, which can help detect conflicts at the project level. Visualization is supported only for conflicts.
Example of work: common ancestor "base.py" (indented with tabs, extra line at the beginning) foo = 1 def main(): print(foo + 2 + 3) "a.py" (indented with tabs, still, 2 extra lines at the beginning instead of one, icecream library is used for debug printing, "baz" class is added: from icecream import ic foo = 1 def main(): ic(foo + 2 + 3) class baz: def __init__(self): """baz""" "b.py" (variable "foo" is renamed to "bar", processed with "black" after changes, as a result indented with spaces and extra lines are cut): bar = 1 def main(): print(bar + 2 + 3) Call ./mergiraf merge ./base.py ./a.py ./b.py -x a.py -y b.py -s base.py -o ./res.py gives the following output from icecream import ic bar = 1 def main(): ic(bar + 2 + 3) class baz: def __init__(self): «»»baz»»» (the «icecream» library is used for debugging printing, the «foo» variable is renamed to «bar», processed with «black» after changes, as a result, indents are spaces and extra lines are cut out, a mixture of tabs and spaces for indentation, but the allowed form).
The drawback of the tool is immediately apparent. The document style is usually configured in ".editorconfig" files, and global style changes, such as changing tabs to spaces and adopting the black style, as was done in "b.py", are usually accompanied by changes in ".editorconfig". Therefore, to apply such changes more correctly, the tool must have a concept for a global "default" style, and be able to pull settings from ".editorconfig".
Source: opennet.ru
