Please Don't Throw Me into that Briar Patch ...

By Buddy Burden on September 29, 2012 9:47 PM under CPAN, Git

[Quick note: This is mostly a Git story, although there’s plenty of links to Perl code and Perl discussions. It also turned out to be quite a bit longer story than I originally intended. But, even though it gets pretty deep into Git features, I think it should be interesting enough for anyone who has a CPAN module and uses Git. I hope.]

The Backstory: So, a couple months back, I look in my email and spy a message from Damian Conway. Now, some of you fine readers probably know Damian personally, and I’m sure I’ll get several comments pointing out how he’s just an ordinary bloke and all, but to you I say: hush! You may be all jaded and world-weary, but I’m still just a regular Perl schmoe, and when I see an email from The Damian addressed personally to me, I get all goosebumpy. So hush up and allow me my fanboy gushes.

Especially since the purpose of his email was to send me a patch for Method::Signatures (for which I’m a co-maintainer) and to tell me:

I adore Method::Signatures and am now using it in nearly all my own development...and in absolutely all my teaching.

Ooohh ... high praise indeed, from a Perl luminary. Coolio, man!

Anyways, this missive just happened to have the ill fortune to land at the same time that $work began to (uncharacteristically) demand a larger chunk of my life than normal. So I didn’t have time to review the patch too closely, but I certainly didn’t want to lose the opportunity to discuss a few larger design issues with Mr. Conway. Thus began a long, winding discussion among me, Damian, and Schwern (who must be given full credit as the original author of MS), both in email and in GitHub issues, and even spilling over into my blog. And I learned something about Damian: he thinks in code. So, every time he comes up with a new idea, he sends me a new patch. By the time it’s all over, and $work is settled enough that I can really start devoting some time to this issue, I’ve got six, fairly hefty patchfiles from Damian—some of them overwriting or cancelling out the direction of others, the first four of them patched against current HEAD (of master) on GitHub, and the last two as both patches against current HEAD and against the previous patch.

What to do, what to do?

A Brief Tangent on Contributions: At some point, Schwern noted that we’d finally come to a general agreement on basic design, and that he’d made Damian a contributor, so Damian should just go ahead and make a branch, and put his code on it. To which Damian replied:

Re branching repo: Probably won’t. Expect annoyingly complete patches. (I know. Not properly social. Not a team player. Not git-ish. But also not a new phenomenon. Every single report card I ever received included some variation of: “Does not play well with others”.)

Now, at the risk of putting words in The Damian’s mouth, I read this as “hey, I’ve been delivering patches to CPAN authors for years, and I’m not going to change now just because of the latest newfangled method.” Which leads me to ponder a bit on the relationship between author and contributor.

Like so many things in life, this involves a balance. The contributor has a feature they want added. The author wants people to fix their code for them (for free, even). This is a mutually beneficial arrangement, but there’s still all sorts of things to work out. Perhaps the author uses Dist::Zilla and the contributor doesn’t, or perhaps the author adheres to strict TDD and the contributor has sent no tests for the new feature. Or, maybe, the contributor has sent a patch when the author would really prefer a GitHub pull request.

Now the two can have some negotiation. In the TDD example, perhaps the author wants the new feature badly enough that they’re willing to write the tests themselves. Or perhaps the contributor wants the new feature badly enough that they’re willing to go back and add the tests upon request. But, the point is, everybody has to evaluate how bad they want it.

So ... how bad do I want a pack of new features and refactorings from The Damian? Yeah, you bet your ass.

The Meat of the Matter: So here’s what I suggested: I’ll make a branch, apply the patches to it, one at a time, review the code (not that I feel Damian’s code needs a lot of review, but, as he said: “Long and bitter experience indicates I’m every bit as fallible as a real human. Better check that code. ;-)”), make sure it all fits together, and then offer up the results for review.

This sounded simple enough at the time.

Of course, the tricky part is that, for the first four patches at least, they’re all patches against HEAD. That means that, while the first patch is no problem, the next three are going to be annoying, because the source I’m patching against isn’t the source originally used to generate the patchfile. Happily, I foresee this, and start to formulate a plan to work around it. Here’s how it went, and hopefully this is helpful to someone else out there.

First, I make a new branch off head, which I creatively name “damian_features”. Next, I apply patch #1 to my new branch. Since it was an exact copy of master’s HEAD, it applied cleanly ... once I remembered how to work patch. Been a while since I had any cause to apply a patchfile. Looking at the structure of the file helped me jog my memory: each file consists of a set of context diffs from several files. Each file is identified with a relative path. Within each file’s section are one or more “hunks”; that is, the individual context diffs themselves. The patch command (which, you may recall, was Larry Wall’s original claim to fame, before he invented Perl) simply finds the files indicated in the patchfile, then applies each “hunk” of context diffs, one at a time, to said files. If the source of the file being patched matches the source of the file the patchfile was generated from, the hunk “applies cleanly” and all is well. If not ... then you’ve got a problem to solve.

Since patchfiles are nearly always generated a directory above the source directory, but nearly always applied in the source directory, the filenames in the patchfile generally have an extra directory component in them that you need to strip off. Happily, patch can do that for you. In fact, it can strip off an arbitrary number of directories for you, but we’ll just tell it to strip one. The only other mildly tricky bit is that patchfiles are read from stdin, not as args:

    patch -p1 </tmp/Method-Signatures-Conway.patch

And it all works just fine. Next a simple git diff to review the changes—making notes of anything I want to change or ask about, but not actually changing anything yet—then I stage and commit. For the commit message, I make up a brief summary line, but I pull the guts of the message from Damian’s description of the patch in his email. Then I make any little tweaks I want to, as separate commits (that way Damian’s changes and my changes aren’t comingled). I could hold off on this until the end, but doing it in between each commit preserves the history better, and plus the next patch isn’t going to work straight up anyway, so there’s no advantage in waiting.

Okay, the next few patches aren’t going to apply cleanly, so let’s see how bad I want them. Looks like patch #3 pretty much wipes out everything patch #2 did (hey, when you’re thinking out loud in code, you’re going to go down a few blind alleys here and there), but, other than that, all the patches are pretty useful, at least in terms of showing a gradual progression to how we got to the end product. So I decide to toss patch #2 but keep the rest.

So, how do I proceed? Simple:

I switch back to the master branch.
I apply patch #3 there, which is the same HEAD the patchfile was generated from, so it will apply cleanly.
I then stash the changes.
Then I switch back to the damian_features branch.
Then I unstash.

Voilà. Now, this sounds simple(ish) in theory, but of course in practice it isn’t perfect. First, I want to stash any files which the patchfile adds as well as the ones it modifies. To do this, I need git stash -u, but that’s only available in git 1.7.7 or later, which I don’t actually have, so I have to upgrade git. That’s okay; I probably needed to do that anyway.

Next problem I run into is what happens when patch #1 adds a file, and then patch #3 adds the same file. This blows the unstash up (because it’s trying to add a file that already exists), and I have to move the existing file out of the way, then re-unstash, then do a manual diff to see if the two “new” files are identical or not. And, of course, if the first unstash added any really new files before it blew up, I have to remove those before the re-unstash or it’ll blow up again (this takes me a few tries to get right). Overall, this is an annoyance, but luckily none of the changes between patches to any such files are any big deal.

Final problem I run across is, what happens if patch #1 adds some code in a place, and patch #3 doesn’t touch the code in that same place? Well, the code really should be removed, because patch #3 doesn’t contain it, and it’s a full replacement for patch #1. But my method is going to leave it, because an unstash is essentially a merge, so it has no idea that code shouldn’t just stay right where it is. Luckily this only happens once throughout the process (if you think about it, it makes sense that this would be rare, since the patches are all messing with the same features), and I catch it easily.

So basically it’s just rinse and repeat this whole shebang for patch #4, and then patches #5 and #6 have versions against the previous patches, so I just apply those directly.

Now, you may be asking yourself: why go through all this rigamarole with the stashing and unstashing? If the patch doesn’t apply cleanly, you have a mess to clean up. On the other hand, if the unstash doesn’t merge cleanly, you have a mess to clean up there too. What’s the difference?

Well, when the patch doesn’t apply cleanly, it often keels over right there and doesn’t move on to the next file. Plus, what it generates is a file where as many of the hunks as it could apply cleanly are applied, the hunks it couldn’t apply cleanly are stuffed into a .rej file, and the original source is saved as a .orig file. Which is fine, as far as it goes. But it means that you have to pick the rejected hunks out of the .rej, find where they were supposed to go in the file based on the context (because the line numbers likely aren’t right if there are earlier hunks which were applied), insert the rejected hunk, which is actually a diff, which means it’s got little +‘s and -‘s in front of all the lines ... it’s just a little messy, is all. But if the unstash doesn’t apply cleanly, it’s just a merge conflict, and those are simple to fix.

At the end of this whole process, I’ve got 11 commits (5 substantial ones from Damian, and 6 little ones from me), and I review them all before I push:

    git log origin/damian_features..HEAD

That shows me all the commits that I haven’t pushed yet (that command is so handy and yet so hard to remember that I have it scripted). In my review, the first thing I notice is that I can’t really tell Damian’s commits from mine, at least not at a glance. I’ve done all the committing, so they all show up under my name. But that’s wrong: I really want them to show up under Damian’s name. And I know that git has the concept of the author and committer being separate, which is perfect for this application. If only I’d remembered to use the --author switch when I committed!

But, wait: this is git we’re talking about. I can fix this ... right?

Of course I can:

    git rebase -i HEAD^^^^^^^^^^^

Yes, that’s 11 carets, and, yes, there’s a way to abbreviate that, but I don’t recall it off the top of my head (I’m sure someone will supply it in the comments). You might think it should be 10 carets, because HEAD counts as one—at least I always think that—but it’s actually not. This puts me in an editor with all 11 commits listed, with “pick” in front of each. Now all I need to do is change “pick” to “edit” for each of the Damian commits. When I save that file, git reapplies each commit, stopping at each one that I changed to edit (actually, stopping just after committing each of those), then I just have to:

    git commit --amend --author "Damian's info here"
    git rebase --continue

with just a quick ZZ in between to resave the commit messages (which are fine), rinse and repeat 4 more times for the other Damian commits, and, Bob’s yer uncle, it’s all fixed. One last review to make sure it all looks good, then git push to GitHub and send Damian an email with a link to review (look at the commits on Sep 26, 2012, in case there happen to be more by the time you read this), and I’m all done.

Wrap-up: I hope you found something in this long post useful, or interesting, or at least entertaining. I found it to be an interesting and challenging exercise in making two different forms of contributing work together, and I bet it’s not the last time I’ll need to do that.

Certainly, as long as Damian is willing to keep sending me patches, I’ll keep integrating them. Look at the awesomeness we’ve gotten so far!

[Notes: 1) Excerpts from Damian’s emails used by permission. 2) If you’re interested in getting to use the actual code that Damian wrote for Method::Signatures, don’t worry: I’m going to be doing a bit more testing on different Perl versions, but you should see it hit CPAN soon. 3) If you didn’t catch the reference in the post title, you could read the classic American folktale online.]

9 comments

9 Comments

Joel Berger | September 30, 2012 4:39 AM | Reply

Perhaps I'm not following, but rather than go back to master, apply the patch, switch back to "damian features" and unstash, why not make several branches from master as "damian feature N" for N=1..4 (or 6), then apply the patches to each and then merge up? To me that sounds a little less "fly by the seat of your pants", but then again, I'm actually pretty bad at the more advanced git stuff (see: https://github.com/run4flat/App-Prima-REPL/pull/2). I'm curious if you think that that plan would be any better/different?

Ask Bjørn Hansen | September 30, 2012 9:56 PM | Reply

I'm with Joel – I'd have made a branch for each patch (or patch series), too. That way git can help you too conflicts, etc. If you don't want the merges in the history, you can always rebase/cherry-pick etc to clean that up.

Another tip is that if you set the GIT_AUTHOR_EMAIL and GIT_AUTHOR_NAME environment variables to "fake" the author when you run 'git commit'.

Did you mean "patch -p1" instead of "-d1"? (At least that's how I usually do it, maybe the Linux version of patch is different than the BSD one I have a man page for here).

Buddy Burden | September 30, 2012 11:06 PM | Reply

I hear you guys, but, remember: an unstash is a merge. Having them all in separate branchces and merging them all together doesn't seem to me like it would be any different in terms of outcome, but it does seem like it would be more effort/more steps. I agree that it's a perfectly viable alternate method, but I'm not sure I see it as advantageous in any way. But I could be missing something.

Another tip is that if you set the GIT_AUTHOR_EMAIL and GIT_AUTHOR_NAME environment variables to "fake" the author when you run 'git commit'.

True, although it would have been just as easy (perhaps even easier) to just toss in the --author switch, if I'd just remembered at the time. ;->

Did you mean "patch -p1" instead of "-d1"?

Aww, damn ... I sure did. Thanx for the catch; fixed.

Aristotle | October 1, 2012 4:07 AM | Reply

git checkout -b damian-features master
for p in *.patch ; do
    git checkout -b tmp
    git apply --index -p1 "$p" || break
    git checkout damian-features
    git checkout tmp -- . # magic
    git commit -m "$p"
    git branch -D tmp
    rm "$p"
done

Buddy Burden replied to comment from Aristotle | October 1, 2012 7:35 AM | Reply

Okaaaay ... not sure I followed all that, but let me see how far I get:

First off, I never knew that git could apply patches directly, so that's a handy one to have.

Now, secondly, when you say git checkout -b tmp, do you mean git checkout -b tmp master? or perhaps that's the default? Because if the tmp branch is created as a copy of the damian_features branch, I'm not sure how the git apply would work any better than the patch, so it seems like that || break would fire a lot ...

Finally, I think you have to explain the # magic because it flew right over my head. :-)

Joel Berger replied to comment from Buddy Burden | October 1, 2012 5:38 PM | Reply

I know its not conceptually any different, I just personally try not to mess with master unless I know everything is ready. Learned that the hard way too many times. Not saying your method doesn't work. But now I have learned tons about git from your post and the other comments!

Buddy Burden replied to comment from Joel Berger | October 1, 2012 6:27 PM | Reply

I just personally try not to mess with master unless I know everything is ready.

well, fair enough, but note that I'm never actually modifying master. That is, I modify my working copy while it happens to be a checkout of master, but then I promptly stash those changes away. There are never any commits or merges or anything else of a write nature on the master branch itself. So it feels pretty safe to me. YMMV, natch.

Aristotle replied to comment from Buddy Burden | October 2, 2012 5:59 AM | Reply

Now, secondly, when you say git checkout -b tmp, do you mean git checkout -b tmp master?

Oops. Yes I do. See, you’re following well enough to catch my mistake. :-)

Finally, I think you have to explain the # magic because it flew right over my head. :-)

It uses the “git checkout commit -- path” form of git-checkout to copy the given path from the given commit to the index and working tree. If the path is . and you are at the root of your working tree, the result is that your index and working tree contain an exact copy of the state of the commit you named, regardless of what was in your index and working tree before.

Instead of trying to find a way to concoct a diff that will apply to damian-features, you apply each patch to master and then paste a copy of that commit’s working tree into a commit on top of the damian-features branch.

Then you get to look at diffs between these snapshots.

Just make sure the patches are applied in the order they were created.

(Remember, Git does not store deltas. It stores a full tree snapshot for each commit and only computes diffs at display time. Actually this would work even it stored deltas, but I find it helpful to think of commits as tarballs which I can ask Git to diff against each other, and the fact that Git uses that model internally makes it easy to remember.)

Aristotle replied to comment from Buddy Burden | October 2, 2012 6:22 AM | Reply

Btw, this is the porcelain-based procedure. You can do this with plumbing too, and it may be illustrative to see how that looks:

git branch damian-features master
git checkout master
for p in *.patch ; do
    git apply --index -p1 "$p" || break
    TREE=$( git write-tree )
    COMMIT=$( git commit-tree -m "$p" -p damian-features "$TREE" )
    git symbolic-ref damian-features "$COMMIT"
    git reset --hard
    rm "$p"
done

This skips the entire index and working tree rigmarole. After the patch is applied to the index, the git write-tree creates a tree object from the index; the git commit-tree creates a commit object for that tree object, with the current tip of damian-features as its parent commit (and a commit message, obviously); finally, the git symbolic-ref sets the damian-features pointer to point to this new commit object.

So the damian-features branch just advanced by one commit.

Then the current state of the index is thrown away, and the whole thing can start over.

So each time through the loop, you start with a clean checkout of master, and you end with a commit pasted onto the top of the damian-features branch. When you think of a series of commits as a series of tarballs, there isn’t anything really unobvious about the procedure.

(To my mind this is actually much more straightforward and comprehensible than trying to get porcelain commands to give you the effect you are after. And if you add an echo in there you will see that TREE and COMMIT are SHA1 hashes, which you can pass to git show to look at the objects: nothing surprising is going on. But it may seem scarier to users who use Git only lightly.)

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Buddy Burden

14 years in California, 25 years in Perl, 34 years in computers, 55 years in bare feet.

More info »

Buddy Burden