Converting Complex SVN Repositories to Git - Part 3
Resolving Branches and Calculating Merges
The most important part of the repository conversion I did was resolving all of the branches and calculating the merge points. The majority of the rest of the process is easily automated with other tools.
The main part of this section was determining what had happened to all of the branches. One of the important differences between Git and SVN is that if a branch is deleted in Git, any commits that only existed in that branch are permanently lost. With SVN, the deleted branches still exist in the repository history. git-svn can't delete branches when importing them, because that would be losing information. So all of the branches that existed throughout the history of the repository will exist in a git-svn import and must be dealt with.
There are four possibilities for what happens to branches. The simplest are the branches that currently exist. These we obviously want to maintain as branches. Some branches are merged then deleted. Once the merge is recorded, we can delete these branches in Git. Others don't have any real commits in them, consisting just of commits creating the branches and then being updated to the current trunk. These can just be deleted. The last are branches that existed and had real changes committed to them, but were then thrown away for various reasons. These can't be deleted without losing information, so I just filed them into a sub-directory 'trash'. Without knowing the full history of the project I couldn't know how valuable these branches were.
At the end of this process, the only branches that should have existed were the current branches, and anything marked as trash. So I created the unresolved-branches script, noting all of the current branches in it. It simply reports the branches that I hadn't found a resolution for.
Next, I used another part of nothingmuch's git-svn-abandon to delete all branches that had been merged into others 60.delete-merged-branches:
# remove merged branches
git for-each-ref --format='%(refname)' refs/heads | while read branch; do
git rev-parse --quiet --verify "$branch" > /dev/null || continue # make sure it still exists
git symbolic-ref HEAD "$branch"
git branch -d $( git branch --merged | grep -v '^\*' | grep -v 'master' )
done
git checkout master
This checks out each branch in turn, finds all of the branches that have been merged into it, and deletes them.
This will only be effective after all of the proper merges have been recorded though. git-svn will record some of the merges during the import process. It uses the SVN and SVK merge information to do this, but sometimes this information isn't recorded, so I had to find the information myself. The first process I used to do this was by matching commit messages. The format of the SVK commit messages was specific enough I was able to extract information from them and match that to other commits 40.graft-merges-rev-matching. As example commit message:
r13301@evoc8 (orig r2696): dyfrgi | 2006-08-21 10:33:04 -0500
Change _cond_for_update_delete to handle more complicated queries through recursing on internal hashes.
Add a test which should succeed and fails without this change.
r13302@evoc8 (orig r2697): blblack | 2006-08-21 12:33:02 -0500
bugfix to Oracle columns_info_for
r13321@evoc8 (orig r2716): dwc | 2006-08-22 00:05:58 -0500
use ref instead of eval to check limit syntax (to avoid issues with Devel::StackTrace)
This basically was merging three commits from one branch into another. The piece of information I needed from a message like this was the latest revision number that had been merged in, in this case 2716. There were also cases where commit messages like this were copied into other SVK commit messages, so the relevant information would be idented. That could only be done if there weren't any unindented 'orig' notations. That resulted in the first section of the script:
my @merges = `git log --all --no-merges --format='%H' --grep='(orig r'`;
chomp @merges;
open my $fh, '>>', "$GIT_DIR/info/grafts";
print { $fh } "# Revision matching\n";
for my $commit (@merges) {
my $commit_data = `git cat-file commit $commit`;
my @matched = $commit_data =~ /^[ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
my ($parent_rev) = sort { $b <=> $a } @matched;
unless ($parent_rev) {
@matched = $commit_data =~ /^[ ][ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
($parent_rev) = sort { $b <=> $a } @matched;
unless ($parent_rev) {
@matched = $commit_data =~ /^[ ][ ][ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
($parent_rev) = sort { $b <=> $a } @matched;
unless ($parent_rev) {
warn "odd commit $commit. merge but wrong format\n";
next;
}
}
}
Ugly and copied and pasted obviously, but no real work has been put in to generalize it. Once that revision was found, I needed to find the commit that corresponded to it. In the simple case, this is a single statement:
my $parent_commit = `git log --all --format='%H' -E --grep='git-svn-id: .*\@$parent_rev '`;
chomp $parent_commit;
I also had code to attempt to resolve this case for if the parent revision touched multiple branches, but this wasn't needed in the end. It only had an impact when my initial import was incomplete.
With the parent commit found, the merge commit could be added to the grafts file, recording both its current parent and adding the new one.
This left a number of branches to be manually figured out. The first valuable piece of information was to find how it was deleted from SVN. That information wasn't actually maintained by the import, so I wrote a script (find-branch-deletion) to find the revision each branch was deleted by doing a binary search between the last revision in the branch and the latest revision.
For branches that I found that had no valuable information, I simply deleted them (50.delete-empty-branches). For branches that weren't merged but were deleted, I renamed them prefixed with 'trash/' (55.archive-deleted-branches). For branches that were merged, I needed to find the merge point. This usually consisted of finding some changes unique to the branch, then doing a search using Git's pickaxe search to find where else it existed. Once I figured out how it had been merged, I recorded this in the 41.graft-merges-manual file. Since the git commit hashes could easily change depending on the import process, I couldn't use them directly, so instead used various pieces of the commit messages that I knew were unique. For example:
git --no-pager log --format="%H %P $(git rev-parse doc_mods)" --grep='DBIx-Class/0.08/trunk@5014'
This records the commit hash and parent commit hashes corresponding to revision 5014 in trunk, adding the commit hash for the doc_mods branch as a second parent.
With this work done, the resolution of every branch had been determined and all of the merges were recorded. But many of the merges had extraneous commits information and made the history hard to work with, so I went about cleaning up them up, giving a better representation of the intentions of the merges instead of showing the particulars of the tools used.
Next: Cleaning up the merges
You may be interested in a tool I created recently to have a global view of a Subversion repository history: SVN Graph Branches