Carp::Always::EvenObjects - Include stack traces with all errors/warnings

Edit: Since writing this, I've decided on a proper name. Devel::Confess is the name for this module going forward. Carp::Always::EvenObjects exists now only as a wrapper around Devel::Confess.

Carp::Always is a very useful module that will force all errors and warnings to include a full stack trace to help with debugging. However, it has some limitations. If an exception object is thrown rather than a string, the stack trace can't simply be appended to it. die, Carp, and Carp::Always just pass the object through unmodified. Some exception systems include stack traces in their objects, but for those that don't, this hurts the ability to debug. As more libraries use exception objects, this becomes more problematic (e.g. autodie).

With that in mind, I've written Carp::Always::EvenObjects. It works similarly to Carp::Always, but will also attach stack traces to objects. This is done by re-blessing the object into a subclass that knows how to include the stack trace when the object is stringified. Most exception systems the same as the originals.

It will also attach stack traces to plain non-object refs, although their use as exceptions is rather rare.

Normally you would use the module on the command line, as:

perl -MCarp::Always::EvenObjects script.pl

As a bonus, since the name is rather long, the dist includes the module Devel::Confess as an alias, allowing you to use the shorter:

perl -MDevel::Confess script.pl

or even

perl -d:Confess script.pl

Using system or exec safely on Windows

Passing a list of arguments to another program on Windows in perl is much more complicated than it should be. There are several different issues that combine that lead to this.

(mostly copied from a post I made on PerlMonks)

First is that argument lists are always passed as a single string in Windows, as opposed to arrays like on other systems. This is less of a problem than it appears, because 95% of programs use the same rules for parsing that string into an array. Roughly speaking, the rules are that arguments can be quoted with double quotes, and backslashes can escape any character.

The second issue is that cmd.exe uses different quoting rules than the normal parsing routine. It uses a caret as the escape character instead of backslash.

The result of this is that you can't create a string that will be treated the same for both of these cases. This becomes a larger problem, because perl switches between using cmd.exe vs calling directly based on if they have shell meta-characters in them. And that involves a third, different set of quoting rules. There isn't any good way to check which way perl is going to treat a command without reimplementing the code to detect them that exists inside perl. So here is a routine that will quote arguments correctly to use with system on Windows:

sub quote_list {
    my (@args) = @_;

    my $args = join ' ', map { quote_literal($_) } @args;

    if (_has_shell_metachars($args)) {
        # cmd.exe treats quotes differently from standard
        # argument parsing. just escape everything using ^.
        $args =~ s/([()%!^"<>&|])/^$1/g;
    }
    return $args;
}

sub quote_literal {
    my ($text) = @_;

    # basic argument quoting.  uses backslashes and quotes to escape
    # everything.
    if ($text ne '' && $text !~ /[ \t\n\v"]/) {
        # no quoting needed
    }
    else {
        my @text = split '', $text;
        $text = q{"};
        for (my $i = 0; ; $i++) {
            my $bs_count = 0;
            while ( $i < @text && $text[$i] eq "\\" ) {
                $i++;
                $bs_count++;
            }
            if ($i > $#text) {
                $text .= "\\" x ($bs_count * 2);
                last;
            }
            elsif ($text[$i] eq q{"}) {
                $text .= "\\" x ($bs_count * 2 + 1);
            }
            else {
                $text .= "\\" x $bs_count;
            }
            $text .= $text[$i];
        }
        $text .= q{"};
    }

    return $text;
}

# direct port of code from win32.c
sub _has_shell_metachars {
    my $string = shift;
    my $inquote = 0;
    my $quote = '';

    my @string = split '', $string;
    for my $char (@string) {
        if ($char eq q{%}) {
            return 1;
        }
        elsif ($char eq q{'} || $char eq q{"}) {
            if ($inquote) {
                if ($char eq $quote) {
                    $inquote = 0;
                    $quote = '';
                }
            }
            else {
                $quote = $char;
                $inquote++;
            }
        }
        elsif ($char eq q{<} || $char eq q{>} || $char eq q{|}) {
            if ( ! $inquote) {
                return 1;
            }
        }
    }
    return;
}

The information about the quoting rules on Windows is from the article Everyone quotes command line arguments the wrong way. I attempted to use this to improve ExtUtils::MakeMaker's quoting, but that also has to deal with Makefile quoting rules. Additionally, other parts of the code (or at least tests) assume that you can generate a string and have it work both when passed to system and when placed in a Makefile. I almost never use perl on Windows, so I eventually gave up on the effort.

Converting Complex SVN Repositories to Git - Part 4

Cleaning up and simplifying merges

After the previous steps, the git repository has an accurate history of what was done to the SVN repository. It is a direct translation though, and shows more the process and tools that were used, rather than developer intent. I proceeded to simplify how the merges were recorded to eliminate the convoluted mess that existed and make the history usable.

Two main classes of these problems existed. There were branches were merged one commit at a time, as that was one way of preserving the history in SVN. The other case was trunk being merged into a branch, and immediately merging that back into trunk. Some other issues match up with those two merge styles and the same cleanup will apply to them.

Here is a section of the history of the 'DBIx-Class-resultset' branch being merged, one commit at a time. Obviously not ideal, but you can mostly tell what is happening.

resultset-ugly.png

The merge of the 'DBIx-Class-current' branch was somewhat less straightforward. current-ugly-end.png

...

current-ugly-middle.png

...

current-ugly-start.png

This smaller example of the 'resultset_cleanup' branch helps show how these can be dealt with.

resultset_cleanup-before.png

If we search for merges, starting from the earliest point in the repository history, we will find the commit noted as 4. We don't want to remove the record of this branch being merged, so initially we will leave it alone. The next merge we find however, 1, makes the first redundant. There is no need to maintain the first merge now that we know that this one exists. This process continues forward, eventually resulting in a single merge commit for the branch.

The code for this is in 43.graft-merges-simplified.

# get a list of all of the merge commits and their parent commits, space separated
my @merges = `git log --all --merges --pretty=format:'%H %P'`;
# to record all of the commits we intend to alter
my %altered;
# to record all of the merges we've seen so far
my %merges;
# start at the earliest point
for my $merge ( reverse @merges ) {
    chomp $merge;
    my ($commit, @parents) = split / /, $merge;
    $merges{$commit} = \@parents;
    # checking our merge [1]
    # this repo only contains merges with two parents
    my ( $left_parent, $right_parent ) = @parents;
    # check if our first parent [3] is a merge
    if ( my $left_grandparents = $merges{ $left_parent } ) {
        # find the grandparent [4] on the opposite side of the merge [2]
        my $right_grandparent
            = `git show -s --pretty='format:%P' $right_parent | cut -d' ' -f1`;
        chomp $right_grandparent;
        # if it is the same as the grandparent ([4] again) on the left side
        if ($right_grandparent eq $left_grandparents->[1]) {
            # we know we want to simplify this merge
            $altered{$commit}++;
            # switch the left parent (was [2]) to the left grandparent [5]
            $parents[0] = $left_grandparents->[0];
            # our left parent shouldn't be part of the history anymore,
            #   so we don't want to match it
            delete $merges{ $left_parent };
            # nor do we need to change it
            delete $altered{ $left_parent };
        }
    }
}

# many of these merges exist only because they were calculated in previous steps
# we don't want duplicate grafts, so we simple comment out the old ones.
my $regex = '(?:' . (join '|', keys %altered) . ')';
system "perl -i -pe's/^($regex )/# \$1/' $GIT_DIR/info/grafts";

# record the grafts
open my $fh, '>>', "$GIT_DIR/info/grafts";
print { $fh } "# Simplified merges\n";
for my $commit ( keys %altered ) {
    print { $fh } join(q{ }, $commit, @{ $merges{$commit} }) . "\n";
}
close $fh;

# we're modifying these merge commits.  whatever their commit
# messages were initially won't be accurate anymore.
# later, when we rewrite the commit messages, we want to just
# record these as branch merges.
# this just keeps track of which commits we want to simplify the
# commit messages in this manner.

use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Sortkeys = 1;

@altered{ keys %$simplified_merges } = values %$simplified_merges;
open $fh, '>', "$BASE_DIR/cache/simplified-merges.pl";
print { $fh } Dumper(\%altered);
close $fh;

The end result is obviously much nicer.

resultset_cleanup-after.png

It turned out that while these calculations caught the majority of the cases, a couple complex, ugly cases were missed. The 'DBIx-Class-current' case was one of these. Rather than spend the extra effort to find an additional strategy to automatically detect such cases (if it was even possible), I manually figured out the best way to record the merges and put them in the 42.graft-merges-simplified-manual file.

Here we see a merge into a branch, followed immediately by a merge into trunk.

rsrc_in_storage-before.png

Another case that makes the history harder to follow. And while this example is relatively straightforward, cleaning up this type of merge helps in much uglier cases as well. The process for simplifying these merges may eliminate the commits our branches are referring to, but we don't have any need to maintain the branches that have been merged, so we delete them here (46.delete-merged-branches, the same script as 60.delete-merged-branches).

The 47.graft-merges-redundant script simplifies these. It follows a similar structure to the previous simplification script.

my @merges = `git log --all --merges --pretty=format:'%H %P'`;
my %altered;
my %merges;
for my $merge ( reverse @merges ) {
    chomp $merge;
    my ($commit, @parents) = split / /, $merge;
    my $f;
    # for each merge [1]
    $merges{$commit} = \@parents;
    # check each parent [2] in turn ([3] will be checked first, but fail
    #   a later test)
    PARENT: for my $p ( 0 .. 1 ) {
        my $parent = $parents[ $p ];
        # check against the other parent [3]
        my $check_ancest = $parents[ 1 - $p ];
        # we only care if it is merge
        my $ancest = $merges{ $check_ancest } || next;

        ANCEST: for my $c ( 0 .. 1 ) {
            # if the first parent [3] is also a parent of the second parent [2]
            if ($parent eq $ancest->[ $c ]) {
                $altered{$commit}++;
                # we don't need the current second parent [2], so switch
                # it to that commit's other parent [4]
                $parents[1 - $p] = $ancest->[1 - $c];
                # don't match or change the commit we are clipping out
                delete $merges{ $check_ancest };
                delete $altered{ $check_ancest };
                # and skip to the next commit
                last PARENT;
            }
        }
    }
}

The redundant merge is now gone.

rsrc_in_storage-after.png

The history simplification is now basically complete. Instead of the convoluted mess that resulted from a direct translation of the SVN repository, it now has a mostly understandable history showing what the developers intended, rather that the exact method they used to do so. All that is left to do is clean up the commit messages and attribution, fix the tags, and a few other minor cleanups.

Next: Commit message and other final cleanups, and baking in grafts

Converting Complex SVN Repositories to Git - Part 3

Resolving Branches and Calculating Merges

The most important part of the repository conversion I did was resolving all of the branches and calculating the merge points. The majority of the rest of the process is easily automated with other tools.

The main part of this section was determining what had happened to all of the branches. One of the important differences between Git and SVN is that if a branch is deleted in Git, any commits that only existed in that branch are permanently lost. With SVN, the deleted branches still exist in the repository history. git-svn can't delete branches when importing them, because that would be losing information. So all of the branches that existed throughout the history of the repository will exist in a git-svn import and must be dealt with.

There are four possibilities for what happens to branches. The simplest are the branches that currently exist. These we obviously want to maintain as branches. Some branches are merged then deleted. Once the merge is recorded, we can delete these branches in Git. Others don't have any real commits in them, consisting just of commits creating the branches and then being updated to the current trunk. These can just be deleted. The last are branches that existed and had real changes committed to them, but were then thrown away for various reasons. These can't be deleted without losing information, so I just filed them into a sub-directory 'trash'. Without knowing the full history of the project I couldn't know how valuable these branches were.

At the end of this process, the only branches that should have existed were the current branches, and anything marked as trash. So I created the unresolved-branches script, noting all of the current branches in it. It simply reports the branches that I hadn't found a resolution for.

Next, I used another part of nothingmuch's git-svn-abandon to delete all branches that had been merged into others 60.delete-merged-branches:

# remove merged branches
git for-each-ref --format='%(refname)' refs/heads | while read branch; do
    git rev-parse --quiet --verify "$branch" > /dev/null || continue # make sure it still exists
    git symbolic-ref HEAD "$branch"
    git branch -d $( git branch --merged | grep -v '^\*' | grep -v 'master' )
done

git checkout master

This checks out each branch in turn, finds all of the branches that have been merged into it, and deletes them.

This will only be effective after all of the proper merges have been recorded though. git-svn will record some of the merges during the import process. It uses the SVN and SVK merge information to do this, but sometimes this information isn't recorded, so I had to find the information myself. The first process I used to do this was by matching commit messages. The format of the SVK commit messages was specific enough I was able to extract information from them and match that to other commits 40.graft-merges-rev-matching. As example commit message:

 r13301@evoc8 (orig r2696):  dyfrgi | 2006-08-21 10:33:04 -0500
 Change _cond_for_update_delete to handle more complicated queries through recursing on internal hashes.
 Add a test which should succeed and fails without this change.
 r13302@evoc8 (orig r2697):  blblack | 2006-08-21 12:33:02 -0500
 bugfix to Oracle columns_info_for
 r13321@evoc8 (orig r2716):  dwc | 2006-08-22 00:05:58 -0500
 use ref instead of eval to check limit syntax (to avoid issues with Devel::StackTrace)

This basically was merging three commits from one branch into another. The piece of information I needed from a message like this was the latest revision number that had been merged in, in this case 2716. There were also cases where commit messages like this were copied into other SVK commit messages, so the relevant information would be idented. That could only be done if there weren't any unindented 'orig' notations. That resulted in the first section of the script:

my @merges = `git log --all --no-merges --format='%H' --grep='(orig r'`;
chomp @merges;

open my $fh, '>>', "$GIT_DIR/info/grafts";
print { $fh } "# Revision matching\n";
for my $commit (@merges) {
    my $commit_data = `git cat-file commit $commit`;
    my @matched = $commit_data =~ /^[ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
    my ($parent_rev) = sort { $b <=> $a } @matched;
    unless ($parent_rev) {
        @matched = $commit_data =~ /^[ ][ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
        ($parent_rev) = sort { $b <=> $a } @matched;
        unless ($parent_rev) {
            @matched = $commit_data =~ /^[ ][ ][ ]r\d+\@[^\n]+\(orig[ ]r(\d+)\)/msxg;
            ($parent_rev) = sort { $b <=> $a } @matched;
            unless ($parent_rev) {
                warn "odd commit $commit.  merge but wrong format\n";
                next;
            }
        }
    }

Ugly and copied and pasted obviously, but no real work has been put in to generalize it. Once that revision was found, I needed to find the commit that corresponded to it. In the simple case, this is a single statement:

my $parent_commit = `git log --all --format='%H' -E --grep='git-svn-id: .*\@$parent_rev '`;
chomp $parent_commit;

I also had code to attempt to resolve this case for if the parent revision touched multiple branches, but this wasn't needed in the end. It only had an impact when my initial import was incomplete.

With the parent commit found, the merge commit could be added to the grafts file, recording both its current parent and adding the new one.

This left a number of branches to be manually figured out. The first valuable piece of information was to find how it was deleted from SVN. That information wasn't actually maintained by the import, so I wrote a script (find-branch-deletion) to find the revision each branch was deleted by doing a binary search between the last revision in the branch and the latest revision.

For branches that I found that had no valuable information, I simply deleted them (50.delete-empty-branches). For branches that weren't merged but were deleted, I renamed them prefixed with 'trash/' (55.archive-deleted-branches). For branches that were merged, I needed to find the merge point. This usually consisted of finding some changes unique to the branch, then doing a search using Git's pickaxe search to find where else it existed. Once I figured out how it had been merged, I recorded this in the 41.graft-merges-manual file. Since the git commit hashes could easily change depending on the import process, I couldn't use them directly, so instead used various pieces of the commit messages that I knew were unique. For example:

git --no-pager log --format="%H %P $(git rev-parse doc_mods)" --grep='DBIx-Class/0.08/trunk@5014'

This records the commit hash and parent commit hashes corresponding to revision 5014 in trunk, adding the commit hash for the doc_mods branch as a second parent.

With this work done, the resolution of every branch had been determined and all of the merges were recorded. But many of the merges had extraneous commits information and made the history hard to work with, so I went about cleaning up them up, giving a better representation of the intentions of the merges instead of showing the particulars of the tools used.

Next: Cleaning up the merges

Converting Complex SVN Repositories to Git - Part 2

Initial Import into Git

Creating a mirror

SVN is slow, and git-svn is slower. The amount of network traffic needed by SVN makes everything slow, especially since git-svn needs to walk the history multiple times. Even if I made no mistakes and only had to run the import once, having a local copy of the repository makes the process much faster. svnsync will do this for us:

# create repository
svnadmin create svn-mirror
# svn won't let us change revision properties without a hook in place
echo '#!/bin/sh' > svn-mirror/hooks/pre-revprop-change && chmod +x svn-mirror/hooks/pre-revprop-change
# do the actual sync
svnsync init file://$PWD/svn-mirror http://dev.catalyst.perl.org/repos/bast/
svnsync sync file://$PWD/svn-mirror

Importing with git-svn

Next, we have to import it with git-svn:

mkdir DBIx-Class
cd DBIx-Class

05.import:

git init

git svn init \
    -TDBIx-Class/0.08/trunk \
    -ttags/DBIx-Class \
    -tDBIx-Class/tags \
    -bbranches/DBIx-Class \
    -bDBIx-Class/0.08/branches \
    -bDBIx-Class/0.08/branches/_abandoned_but_possibly_useful \
    -bbranches \
    --prefix=svn/ \
    file://$BASE_DIR/svn-mirror

git config svn.authorsfile $BASE_DIR/authors
git svn fetch --authors-prog=$BASE_DIR/author-generate

A number of parts go together for this. The most important part is the locations of all of the branches. The current branch locations (DBIx-Class/0.08/branches and .../abandonedbutpossiblyuseful) were simple. And trunk (DBIx-Class/0.08/trunk) would be tracked back past when it had been moved. But past branches wouldn't be found. For this, I manually searched through the repository for past branches. Another option for that would be searching the entire history for and files ending with the path 'lib/DBIx/Class.pm' and assuming that is a branch. With the configuration given, branches also get imported for other projects that kept their branches in the same directories. These can just be deleted after the fact.

The second part is defining an authors file. This lists the mappings between SVN user names and a name and email as used by Git. We don't have this information yet, so the author-generate script is used, which generates a fake name and records it. That recorded list of names will later be used to re-write the authors using the correct information.

The 'git svn fetch' operation takes many hours to run, but as long as the branch locations are correct, this only needs to be done once. Running 'svnsync sync' and 'git svn fetch' again will update the git repository with any later changes to the SVN repo 10.update. All of the steps past this are much faster, but are also destructive. At this stage I just created a backup of the Git repository to be restored as I made corrections to the later scripts.

Initial Cleanup

The next step is to remove some of the extra branches created during the import. There are some branches that existed in the same branch root but weren't actually part of the DBIx::Class project. The 20.delete-non-branches script removes these by searching through each branch and deleting any that don't contain the file lib/DBIx/Class.pm.

There are also some duplicate branches created when they were found in two different branch roots. These are labeled with an @ and revision number at the end. I initially made a script to delete all of these these duplicate branches if they were actually duplicates, and not different branches that had been given the same name (collapse-past-branches). I found that they were all duplicates though, so I ended up just deleting all of the branches marked with @ symbols (25.delete-past-branches).

The last step in the initial rough import was to create standard git branches and tags for all of the imported branches. The 30.fix-refs script does this work. Most of it is taken from nothingmuch's git-svn-abandon project, which does a similar task to my scripts, but without as much cleanup. For branches, all that is done is to create normal local branches rather than the svn/ prefixed remote branches created by the import. Because SVN doesn't differentiate between branches and tags, git-svn creates doesn't create real tags when importing. So the fix-refs script searches backward from the tag to find what commit it refers to and tags that. Due to the reorganization that had been done to the SVN repository this wasn't entirely adequate, so I had to manually fix some of the tags later.

The repository is now starting to resemble a real Git repository.

Next: Calculating the many merge points to record as grafts.