Results matching “svn”

Converting Complex SVN Repositories to Git - Part 4

Cleaning up and simplifying merges

After the previous steps, the git repository has an accurate history of what was done to the SVN repository. It is a direct translation though, and shows more the process and tools that were used, rather than developer intent. I proceeded to simplify how the merges were recorded to eliminate the convoluted mess that existed and make the history usable.

Two main classes of these problems existed. There were branches were merged one commit at a time, as that was one way of preserving the history in SVN. The other case was trunk being merged into a branch, and immediately merging that back into trunk. Some other issues match up with those two merge styles and the same cleanup will apply to them.

Here is a section of the history of the 'DBIx-Class-resultset' branch being merged, one commit at a time. Obviously not ideal, but you can mostly tell what is happening.

resultset-ugly.png

The merge of the 'DBIx-Class-current' branch was somewhat less straightforward. current-ugly-end.png

...

current-ugly-middle.png

...

current-ugly-start.png

This smaller example of the 'resultset_cleanup' branch helps show how these can be dealt with.

resultset_cleanup-before.png

If we search for merges, starting from the earliest point in the repository history, we will find the commit noted as 4. We don't want to remove the record of this branch being merged, so initially we will leave it alone. The next merge we find however, 1, makes the first redundant. There is no need to maintain the first merge now that we know that this one exists. This process continues forward, eventually resulting in a single merge commit for the branch.

The code for this is in 43.graft-merges-simplified.

# get a list of all of the merge commits and their parent commits, space separated
my @merges = `git log --all --merges --pretty=format:'%H %P'`;
# to record all of the commits we intend to alter
my %altered;
# to record all of the merges we've seen so far
my %merges;
# start at the earliest point
for my $merge ( reverse @merges ) {
    chomp $merge;
    my ($commit, @parents) = split / /, $merge;
    $merges{$commit} = \@parents;
    # checking our merge [1]
    # this repo only contains merges with two parents
    my ( $left_parent, $right_parent ) = @parents;
    # check if our first parent [3] is a merge
    if ( my $left_grandparents = $merges{ $left_parent } ) {
        # find the grandparent [4] on the opposite side of the merge [2]
        my $right_grandparent
            = `git show -s --pretty='format:%P' $right_parent | cut -d' ' -f1`;
        chomp $right_grandparent;
        # if it is the same as the grandparent ([4] again) on the left side
        if ($right_grandparent eq $left_grandparents->[1]) {
            # we know we want to simplify this merge
            $altered{$commit}++;
            # switch the left parent (was [2]) to the left grandparent [5]
            $parents[0] = $left_grandparents->[0];
            # our left parent shouldn't be part of the history anymore,
            #   so we don't want to match it
            delete $merges{ $left_parent };
            # nor do we need to change it
            delete $altered{ $left_parent };
        }
    }
}

# many of these merges exist only because they were calculated in previous steps
# we don't want duplicate grafts, so we simple comment out the old ones.
my $regex = '(?:' . (join '|', keys %altered) . ')';
system "perl -i -pe's/^($regex )/# \$1/' $GIT_DIR/info/grafts";

# record the grafts
open my $fh, '>>', "$GIT_DIR/info/grafts";
print { $fh } "# Simplified merges\n";
for my $commit ( keys %altered ) {
    print { $fh } join(q{ }, $commit, @{ $merges{$commit} }) . "\n";
}
close $fh;

# we're modifying these merge commits.  whatever their commit
# messages were initially won't be accurate anymore.
# later, when we rewrite the commit messages, we want to just
# record these as branch merges.
# this just keeps track of which commits we want to simplify the
# commit messages in this manner.

use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Sortkeys = 1;

@altered{ keys %$simplified_merges } = values %$simplified_merges;
open $fh, '>', "$BASE_DIR/cache/simplified-merges.pl";
print { $fh } Dumper(\%altered);
close $fh;

The end result is obviously much nicer.

resultset_cleanup-after.png

It turned out that while these calculations caught the majority of the cases, a couple complex, ugly cases were missed. The 'DBIx-Class-current' case was one of these. Rather than spend the extra effort to find an additional strategy to automatically detect such cases (if it was even possible), I manually figured out the best way to record the merges and put them in the 42.graft-merges-simplified-manual file.

Here we see a merge into a branch, followed immediately by a merge into trunk.

rsrc_in_storage-before.png

Another case that makes the history harder to follow. And while this example is relatively straightforward, cleaning up this type of merge helps in much uglier cases as well. The process for simplifying these merges may eliminate the commits our branches are referring to, but we don't have any need to maintain the branches that have been merged, so we delete them here (46.delete-merged-branches, the same script as 60.delete-merged-branches).

The 47.graft-merges-redundant script simplifies these. It follows a similar structure to the previous simplification script.

my @merges = `git log --all --merges --pretty=format:'%H %P'`;
my %altered;
my %merges;
for my $merge ( reverse @merges ) {
    chomp $merge;
    my ($commit, @parents) = split / /, $merge;
    my $f;
    # for each merge [1]
    $merges{$commit} = \@parents;
    # check each parent [2] in turn ([3] will be checked first, but fail
    #   a later test)
    PARENT: for my $p ( 0 .. 1 ) {
        my $parent = $parents[ $p ];
        # check against the other parent [3]
        my $check_ancest = $parents[ 1 - $p ];
        # we only care if it is merge
        my $ancest = $merges{ $check_ancest } || next;

        ANCEST: for my $c ( 0 .. 1 ) {
            # if the first parent [3] is also a parent of the second parent [2]
            if ($parent eq $ancest->[ $c ]) {
                $altered{$commit}++;
                # we don't need the current second parent [2], so switch
                # it to that commit's other parent [4]
                $parents[1 - $p] = $ancest->[1 - $c];
                # don't match or change the commit we are clipping out
                delete $merges{ $check_ancest };
                delete $altered{ $check_ancest };
                # and skip to the next commit
                last PARENT;
            }
        }
    }
}

The redundant merge is now gone.

rsrc_in_storage-after.png

The history simplification is now basically complete. Instead of the convoluted mess that resulted from a direct translation of the SVN repository, it now has a mostly understandable history showing what the developers intended, rather that the exact method they used to do so. All that is left to do is clean up the commit messages and attribution, fix the tags, and a few other minor cleanups.

Next: Commit message and other final cleanups, and baking in grafts

Converting Complex SVN Repositories to Git - Part 3

Resolving Branches and Calculating Merges

The most important part of the repository conversion I did was resolving all of the branches and calculating the merge points. The majority of the rest of the process is easily automated with other tools.

The main part of this section was determining what had happened to all of the branches. One of the important differences between Git and SVN is that if a branch is deleted in Git, any commits that only existed in that branch are permanently lost. With SVN, the deleted branches still exist in the repository history. git-svn can't delete branches when importing them, because that would be losing information. So all of the branches that existed throughout the history of the repository will exist in a git-svn import and must be dealt with.

There are four possibilities for what happens to branches. The simplest are the branches that currently exist. These we obviously want to maintain as branches. Some branches are merged then deleted. Once the merge is recorded, we can delete these branches in Git. Others don't have any real commits in them, consisting just of commits creating the branches and then being updated to the current trunk. These can just be deleted. The last are branches that existed and had real changes committed to them, but were then thrown away for various reasons. These can't be deleted without losing information, so I just filed them into a sub-directory 'trash'. Without knowing the full history of the project I couldn't know how valuable these branches were.

At the end of this process, the only branches that should have existed were the current branches, and anything marked as trash. So I created the unresolved-branches script, noting all of the current branches in it. It simply reports the branches that I hadn't found a resolution for.

Next, I used another part of nothingmuch's git-svn-abandon to delete all branches that had been merged into others 60.delete-merged-branches:

# remove merged branches
git for-each-ref --format='%(refname)' refs/heads | while read branch; do
    git rev-parse --quiet --verify "$branch" > /dev/null || continue # make sure it still exists
    git symbolic-ref HEAD "$branch"
    git branch -d $( git branch --merged | grep -v '^\*' | grep -v 'master' )
done

git checkout master

This checks out each branch in turn, finds all of the branches that have been merged into it, and deletes them.

This will only be effective after all of the proper merges have been recorded though. git-svn will record some of the merges during the import process. It uses the SVN and SVK merge information to do this, but sometimes this information isn't recorded, so I had to find the information myself. The first process I used to do this was by matching commit messages. The format of the SVK commit messages was specific enough I was able to extract information from them and match that to other commits 40.graft-merges-rev-matching. As example commit message:

 r13301@evoc8 (orig r2696):  dyfrgi | 2006-08-21 10:33:04 -0500
 Change _cond_for_update_delete to handle more complicated queries through recursing on internal hashes.
 Add a test which should succeed and fails without this change.
 r13302@evoc8 (orig r2697):  blblack | 2006-08-21 12:33:02 -0500
 bugfix to Oracle columns_info_for
 r13321@evoc8 (orig r2716):  dwc | 2006-08-22 00:05:58 -0500
 use ref instead of eval to check limit syntax (to avoid issues with Devel::StackTrace)

This basically was merging three commits from one branch into another. The piece of information I needed from a message like this was the latest revision number that had been merged in, in this case 2716. There were also cases where commit messages like this were copied into other SVK commit messages, so the relevant information would be idented. That could only be done if there weren't any unindented 'orig' notations. That resulted in the first section of the script:

my @merges = `git log --all --no-merges --format='%H' --grep='(orig r'`;
chomp @merges;

open my $fh, '>>', "$GIT_DIR/info/grafts";
print { $fh } "# Revision matching\n";
for my $commit (@merges) {
    my $commit_data = `git cat-file commit $commit`;
    my @matched = $commit_data =~ /^[ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
    my ($parent_rev) = sort { $b <=> $a } @matched;
    unless ($parent_rev) {
        @matched = $commit_data =~ /^[ ][ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
        ($parent_rev) = sort { $b <=> $a } @matched;
        unless ($parent_rev) {
            @matched = $commit_data =~ /^[ ][ ][ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
            ($parent_rev) = sort { $b <=> $a } @matched;
            unless ($parent_rev) {
                warn "odd commit $commit.  merge but wrong format\n";
                next;
            }
        }
    }

Ugly and copied and pasted obviously, but no real work has been put in to generalize it. Once that revision was found, I needed to find the commit that corresponded to it. In the simple case, this is a single statement:

my $parent_commit = `git log --all --format='%H' -E --grep='git-svn-id: .*\@$parent_rev '`;
chomp $parent_commit;

I also had code to attempt to resolve this case for if the parent revision touched multiple branches, but this wasn't needed in the end. It only had an impact when my initial import was incomplete.

With the parent commit found, the merge commit could be added to the grafts file, recording both its current parent and adding the new one.

This left a number of branches to be manually figured out. The first valuable piece of information was to find how it was deleted from SVN. That information wasn't actually maintained by the import, so I wrote a script (find-branch-deletion) to find the revision each branch was deleted by doing a binary search between the last revision in the branch and the latest revision.

For branches that I found that had no valuable information, I simply deleted them (50.delete-empty-branches). For branches that weren't merged but were deleted, I renamed them prefixed with 'trash/' (55.archive-deleted-branches). For branches that were merged, I needed to find the merge point. This usually consisted of finding some changes unique to the branch, then doing a search using Git's pickaxe search to find where else it existed. Once I figured out how it had been merged, I recorded this in the 41.graft-merges-manual file. Since the git commit hashes could easily change depending on the import process, I couldn't use them directly, so instead used various pieces of the commit messages that I knew were unique. For example:

git --no-pager log --format="%H %P $(git rev-parse doc_mods)" --grep='DBIx-Class/0.08/trunk@5014'

This records the commit hash and parent commit hashes corresponding to revision 5014 in trunk, adding the commit hash for the doc_mods branch as a second parent.

With this work done, the resolution of every branch had been determined and all of the merges were recorded. But many of the merges had extraneous commits information and made the history hard to work with, so I went about cleaning up them up, giving a better representation of the intentions of the merges instead of showing the particulars of the tools used.

Next: Cleaning up the merges

Converting Complex SVN Repositories to Git - Part 2

Initial Import into Git

Creating a mirror

SVN is slow, and git-svn is slower. The amount of network traffic needed by SVN makes everything slow, especially since git-svn needs to walk the history multiple times. Even if I made no mistakes and only had to run the import once, having a local copy of the repository makes the process much faster. svnsync will do this for us:

# create repository
svnadmin create svn-mirror
# svn won't let us change revision properties without a hook in place
echo '#!/bin/sh' > svn-mirror/hooks/pre-revprop-change && chmod +x svn-mirror/hooks/pre-revprop-change
# do the actual sync
svnsync init file://$PWD/svn-mirror http://dev.catalyst.perl.org/repos/bast/
svnsync sync file://$PWD/svn-mirror

Importing with git-svn

Next, we have to import it with git-svn:

mkdir DBIx-Class
cd DBIx-Class

05.import:

git init

git svn init \
    -TDBIx-Class/0.08/trunk \
    -ttags/DBIx-Class \
    -tDBIx-Class/tags \
    -bbranches/DBIx-Class \
    -bDBIx-Class/0.08/branches \
    -bDBIx-Class/0.08/branches/_abandoned_but_possibly_useful \
    -bbranches \
    --prefix=svn/ \
    file://$BASE_DIR/svn-mirror

git config svn.authorsfile $BASE_DIR/authors
git svn fetch --authors-prog=$BASE_DIR/author-generate

A number of parts go together for this. The most important part is the locations of all of the branches. The current branch locations (DBIx-Class/0.08/branches and .../abandonedbutpossiblyuseful) were simple. And trunk (DBIx-Class/0.08/trunk) would be tracked back past when it had been moved. But past branches wouldn't be found. For this, I manually searched through the repository for past branches. Another option for that would be searching the entire history for and files ending with the path 'lib/DBIx/Class.pm' and assuming that is a branch. With the configuration given, branches also get imported for other projects that kept their branches in the same directories. These can just be deleted after the fact.

The second part is defining an authors file. This lists the mappings between SVN user names and a name and email as used by Git. We don't have this information yet, so the author-generate script is used, which generates a fake name and records it. That recorded list of names will later be used to re-write the authors using the correct information.

The 'git svn fetch' operation takes many hours to run, but as long as the branch locations are correct, this only needs to be done once. Running 'svnsync sync' and 'git svn fetch' again will update the git repository with any later changes to the SVN repo 10.update. All of the steps past this are much faster, but are also destructive. At this stage I just created a backup of the Git repository to be restored as I made corrections to the later scripts.

Initial Cleanup

The next step is to remove some of the extra branches created during the import. There are some branches that existed in the same branch root but weren't actually part of the DBIx::Class project. The 20.delete-non-branches script removes these by searching through each branch and deleting any that don't contain the file lib/DBIx/Class.pm.

There are also some duplicate branches created when they were found in two different branch roots. These are labeled with an @ and revision number at the end. I initially made a script to delete all of these these duplicate branches if they were actually duplicates, and not different branches that had been given the same name (collapse-past-branches). I found that they were all duplicates though, so I ended up just deleting all of the branches marked with @ symbols (25.delete-past-branches).

The last step in the initial rough import was to create standard git branches and tags for all of the imported branches. The 30.fix-refs script does this work. Most of it is taken from nothingmuch's git-svn-abandon project, which does a similar task to my scripts, but without as much cleanup. For branches, all that is done is to create normal local branches rather than the svn/ prefixed remote branches created by the import. Because SVN doesn't differentiate between branches and tags, git-svn creates doesn't create real tags when importing. So the fix-refs script searches backward from the tag to find what commit it refers to and tags that. Due to the reorganization that had been done to the SVN repository this wasn't entirely adequate, so I had to manually fix some of the tags later.

The repository is now starting to resemble a real Git repository.

Next: Calculating the many merge points to record as grafts.

Converting Complex SVN Repositories to Git - Part 1

In May and June, I worked on converting the DBIx::Class repository from SVN to Git. I’ve had a number of people ask me to describe the process and show the code I used to do so. I had been somewhat busy with various projects, including working on the web client for The Lacuna Expanse, but I’ve finally had some time to write up a bit about it. The code I used to make the conversion is on my github account, although not in a form meant for reuse.

Having previously done the git conversion for WebGUI, JT Smith mentioned to me that the DBIx::Class developers wanted to move to git. The somewhat convoluted history of the DBIx::Class repository and the extensive use of SVK made it a bit more complex than the existing tools could handle automatically. I ended up using git-svn to do the import of the raw data, a set of scripts I wrote or modified from others, and a bit of manual digging to create a pretty accurate history of the project.

git-svn

git-svn is a tool included with git allowing you to work with SVN repositories using Git. While its bidirectional capabilities aren’t useful when just doing a conversion, it does a serviceable job importing the history into Git. The main problem areas are branch locations and merge tracking. For many projects, branch locations won’t present a problem. For DBIx::Class though, the repository layout had been changed a few times. This meant I had to search through the project history to find the old locations, but this was relatively easy to do. The larger problem, merge tracking, isn’t as easy to resolve. Newer versions of SVN will record extra information about merges, as will SVK. But this was an older repository, and in many cases the recorded merge information wasn’t adequate. Additional work was needed to track down the merges, or to smooth over the recorded ones.

Grafts and filter-branch

History in git is tracked by each commit listing its parent commits. Merges are represented by commits with multiple parents. Git’s storage model prevents you from altering commits directly without changing all of its descendants, but you can record an alternate set of parent commits using grafts. Grafts aren’t part of the normal repository data, and aren’t suitable for redistribution. They can be ‘baked in’ by the filter-branch command, allowing you to redistribute the result, as well as make any other changes to a commit.

Tracking down the branch locations, importing everything into git, and cleaning up commit messages was all relatively straightforward. Most of my effort was spent on creating all of the needed grafts. This involved creating scripts to automatically find merges missed by git-svn, making tools to find and fix merges that were recorded in convoluted ways, as well as manually tracking down what happened to almost every branch in the repository history. Some of this may not have strictly been needed, but the goal was to create a repository where you didn’t have to think about the fact that it previously had existed in SVN. I think the result is about as good as can be done at that.

Next: Initial import from SVN to Git

1

About Graham Knop

user-pic