Introducing Git::Database

Git::Database is yet another module I wrote to interact with Git. It wraps an OO-layer around Git objects (blobs, trees, commits, tags), in a way that's very similar to what Git::PurePerl does. It has no opinion on the actual means to get the data from the Git object database: that bit is done with the Perl Git wrapper of your choice.

At the moment, there's only one supported wrapper for fetching data from Git: my own Git::Repository. I already have branches with working code (i.e. passing all the relevant tests) for Git::Wrapper, Git::Sub, Git::PurePerl and even the venerable Git.pm, which I'll publish when they are more feature-complete. Git::Class is missing some critical feature I need to get the data from Git, and I couldn't figure out how to get the data using Git::Raw. Patches welcome!

The release is a version 0.001, as I expect the interface to have some rough edges that need some polishing. Depending on the feedback I receive, a version 1.000 should appear in a few months.

It's a long story...

Simply by looking at the dates of the various stages of this project, you can tell I'm a procrastinator and a perfectionist who will die with folders full of interesting ideas and not a single line of code written for any of them. ;-( Maybe I should blog about those, and let others run with them, if they have the time and inclination. (Obviously, one of these projects is a blogging tool...)

To be honest, I don't really need something like Git::Database: I just thought it was a cool idea that could provide some useful functionality. That also explains the slow pace of development.

Anyways, here's Git::Database lineage:

Git::Simple

The first commit in the Git-Database repository dates back from August 2013 (at the time, I called it Git::Simple, after a discussion with GETTY who was frustrated at how cumbersome it was to setup some test repositories for testing the interactions with Git for one of his projects). In a way, Git::Database is one of the yaks on the path to Git::Simple.

Glow

Most of the ideas and code for Git::Database come from another, older project:
Glow (for Git-Like Object Warehouse) was started in June 2012, as an attempt to force myself to learn Moose. One the main ideas was to use the Git principles (using signatures hashes as the identity of the objects, linking objects using those digests as pointers) to build some projects similar to Git.

One of my targets was a backup system, where the tree objects would support a larger scope of permissions (possibly extended attributes), and the storage would be handled by distributed key-value stores. Many people have tried to use Git as a backup system, but the limitations of Git (especially regarding large files, or the digest size, or the fact that commits miss some of the metadata that's interesting for backups) meant they sometimes had to work around Git. My idea was to steal the best ideas from Git to build a framework for making your own "Git-like" store, with slightly different objects (e.g. using SHA-256, or splitting large files in smaller chunks, like bup does (using the rsync algorithm)).

The nice trick was that the first store I'd write with Glow would be compatible with Git, which would make it easy to test against realistic data.

git-test-repo-tool

In January 2013, as I was trying to compare the speed of Glow with that of the other Perl Git wrappers (by having them perform the equivalent of git log on various repositories), Git::PurePerl choked on a commit (probably 8ead1bfe111085ef1ad7759e67340f074996b244 from git.git), which contained a mergetag. Since I wanted Git::Repository to be the best Perl Git wrapper available, I fixed Git::Repository::Log (which was at the time still part of the Git-Repository distribution) the next month.

A few months later, my perfectionist mind came up with the idea of building a pathological Git repository containing all the unusual things I could find in Git repositories (result here), in order to make it possible to easily feed some weird but realistic data to any Git-related code.

In fact, I wrote an unpublished predecessor of Git::Database (named GitDb) to properly parse the raw content data from Git objects and produce the test data for Git::Repository::Log. Some of that code ended up in a pull request for one of the many Git::PurePerl repositories.

Git::PurePerl

While working on Glow, I took a lot of inspiration (and code) from Git::PurePerl. One of the things that annoyed me the most with Git::PurePerl is the split between the Git::PurePerl::Object::* and Git::PurePerl::NewObject::* class hierarchies. NewObject represents objects that are not in the Git object database yet, and have to be put there via the put_object method. Object objects, on the other hand, represent objects that come from the git database (obtained with get_object). They don't even have the same interface!

I was certain that I could work with a single set of Object classes (i.e. get rid of NewObject), and use lazy builders to construct whatever piece of data I needed. With the digest and a git repository, I could build the content, and with the content I could build the digest or whatever other data was needed.

In fact, at some point I thought about providing patches for Git::PurePerl that would merge the Object and NewObject hierarchies into one. But that would be a major, and probably incompatible, change for Git::PurePerl. On the other hand, those objects could be useful for people using other Git wrappers, so why limit myself to Git::PurePerl?

Attributes like commit_info and tag_info were inspired by the directory_entries attribute of the Git::PurePerl tree objects.

Below are some graphs that show the data-generation paths that I first
designed in Glow and later ported (and fixed) to Git::Database:

  • blob

    ┌──▶ digest ───▶ content ─┬──▶ size
    └─────────────────────────┘
    
  • tree

                  ┌── directory_entries ◀──┐
    ┌──▶ digest ──┴──────▶ content ────────┼──▶ size
    └──────────────────────────────────────┘
    
  • commit

                  ┌── commit_info ◀──┐
    ┌──▶ digest ──┴───▶ content ─────┼──▶ size
    └────────────────────────────────┘
    
  • tag

                  ┌── tag_info ◀──┐
    ┌──▶ digest ──┴──▶ content ───┼──▶ size
    └─────────────────────────────┘
    

Basically, you can compute everything from the content, and you get the content from the Git database using the digest (if it actually points to something in the object database).

I couldn't have done it without...

People

I'm happy to repeat that the best ideas I ever had were the ones I bounced off other people. Having to defend an idea from other people who don't like it is a good way to either prove its internal strengths, or realize how much it does not work indeed. And it's also a good way to hear things I didn't think about myself and integrate them in the larger project.

The #moose and #london.pm IRC channels have been very helpful, especially with knocking off some of my not-so-great ideas (to be fair, I usually went to them asking for a beating, and I usually got what I asked for ;-). The design is much better thanks to those online discussions.

The last big push of design work on Git::Database happened in April 2016, on the sidelines of the QA Hackathon in Rugby. There I could enjoy chatting extensively (and in French) about Git::Database with Olivier Mengué (DOLMEN). He suggested that I split the "backends" interfaces between various sets of operations (reading, writing) on different types of objects (actual Git objects and references), so that any backend could still support a subset of roles (and the user could test for it) if for some reason they were unable to perform the whole set. His proposal was much better than my initial idea. He also talked me into making the Git::Database object much simpler than I had envisioned.

Olivier has made his own Git wrapper, Git::Sub, which I wanted to include in the backend selection as soon as I decided that Git::Database would not be a Git::Repository subclass anymore. It's a little different from the others wrappers on CPAN, as it's the only one that does not wraps the Git repository in a Perl object, but operates from the current working directory.

Tests

I'm using Git bundles for testing: this way I can manually create a Git repository with the specific features I want to test, and then bundle it in a small file so that I don't have to do that setup work ever again. At first I used git bundle unbundle to unpack the bundle, but that only printed the references on standard out, so more work was needed to make them point to the right places. Since Git version 1.6.5, git clone works with a bundle, and it will setup all the references too. I have another file holding a Perl data structure that contains all the parts I want to test. That way, adding new tests for a specific Git idiosyncracy is just a matter of building the appropriate bundle and the data structure containing the expected data.

My test suite will be complete when it ships with the pathological bundle, and all backends pass all the tests.

Leave a comment

About BooK

user-pic Pink.