Dist::Zilla, Pod::Weaver and bin

I use Dist::Zilla for managing my distributions. It's awesome and useful, although it lacks some bits of documentation every now and then; this lack is compensated in other ways, e.g. IRC.

I also use the PodWeaver plugin to automatically generate boilerplate POD stuff in the modules. Some time ago I needed to add some programs to a distribution of mine (which I also managed to forget at the moment, but this is another story), and this is where I got hit by all the voodoo.

The first program was actually a Perl program, consisting of a minimal script to call the appropriate run() method in one of the modules of the distro:

$ cat bin/perl-program 
#!/usr/bin/env perl
use prova;
prova->run();

This led Dist::Zilla to complain like this:

$ dzil build
[DZ] beginning to build prova
[DZ] guessing dist's main_module is lib/prova.pm
couldn't determine document name for bin/perl-program at ...

which isn't the best advice in the world (it actually complains about Pod::Weaver), but let's ignore it for the moment. After a bit of googling - or whatever, I actually don't remember - I found that there were basically two options:

  • put an explicit package declaration in the driver program, like this:

    $ cat bin/perl-program 
    #!/usr/bin/env perl
    package prova;
    use prova;
    prova->run();
    
  • put a comment with a PODNAME:

    $ cat bin/perl-program 
    #!/usr/bin/env perl
    # PODNAME: prova
    use prova;
    prova->run();
    

The latter seems slightly less dumb so I opted for it. I say slightly because IMHO the bottom line should be that the name is equal to the filename, but this is (again) another story. Yes, I know, it's open source and I can propose patches - did I say I'm not complaining?

Anyway, I then needed to add a shell script to the lot:

$ cat bin/script.sh 
#!/bin/bash
echo 'Hello, World!'

and again the error popped up:

$ dzil build
[DZ] beginning to build prova
[DZ] guessing dist's main_module is lib/prova.pm
[PodWeaver] [@Default/Name] couldn't find abstract in bin/perl-program
couldn't determine document name for bin/script.sh at ...

Now, I could use the PODNAME trick above:

$ cat bin/script.sh
#!/bin/bash
# PODNAME: script.sh
echo 'Hello, World!'

but it turns out - with little surprise - that the file is considered a Perl one, with the consequence that in the distribution it gets POD added to it:

#!/bin/bash
# PODNAME: script.sh
echo 'Hello, World!'

__END__
=pod

=head1 NAME

script.sh

=head1 VERSION

version 0.1.0

=head1 AUTHOR

Flavio Poletti <polettix@cpan.org>

=head1 COPYRIGHT AND LICENSE

blah blah blah...

=cut

The only thing is that - ehr... - bash does not like POD very much.

Google was not my friend in this case, but I found one in Dist::Zilla's IRC channel (which is #distzilla on irc.perl.org, by the way), which I think (hope) is Christopher J. Madsen, the original contributor of Dist::Zilla::Plugin::FileFinder::ByName.

He instructed me to use that module - which entered the core as of version 4.300003 - to obtain what I was after: keep the PodWeaver plugin working on Perl stuff, while ignoring shell stuff. But, more importantly, he adviced me about why.

Many plugins - including the PodWeaver one - rely upon the Finder role to do their work. This role is something that finds files to be used by the plugin; in the case of PodWeaver, the default is to take whatever module will be installed and whatever stuff is in the bin directory. OK, it can be a bit more complicated than this, but most of the times it's OK. This default is the same as if we configured the following in the dist.ini file:

[PodWeaver]
finder = :InstallModules
finder = :ExecFiles

It turns out that if we explicitly configure a finder all the defaults are wiped away, so - for example - if our bin directory contained shell scripts only we could be happy with this:

[PodWeaver]
finder = :InstallModules

In our case, anyway, this would disable PodWeaver for Perl programs as well, which is not acceptable. This is where FileFinder::ByName kicks in:

[FileFinder::ByName / BinNotShell]
dir = bin
skip = .*\.sh$

This plugin lets us create new finders. In the example above, we are creating a finder that finds stuff in the bin directory, skipping all files that end with .sh (note that skip sets a regular expression). At this point, we're ready for the configuration of the PodWeaver plugin:

[PodWeaver]
finder = :InstallModules
finder = BinNotShell

It's correct, the new finder does not want the initial colon. Now, when it's time for the Pod::Weaver plugin to find files, it will get all the modules that will be installed AND the files in the bin directory whose name does not end with .sh. Yay!

Exclusive Perl Archive Nook

I started working on epan, a (somewhat thin) wrapper around cpanminus to create a version of CPAN trimmed down to your needs for installing specific stuff.

This is what the cool guys probably call DPAN these days, but I found that the whole concept of DarkPAN revolves much around getting your private stuff into the "normal" Perl toolchain, while in this case I need to be able to easily install modules in machines that are out of Internet reach.

To start with an example, suppose you have to install Dancer and a couple of its plugins in a machine that - for good reasons - is not connected to the Internet. It's easy to get the distribution files for Dancer and the plugins... but what about the dependencies? It can easily become a nightmare, forcing you to go back and forth with new modules as soon as you discover the need to install them.

Thanks to cpanminus, this is quite easier these days: it can actually do what's needed with a single command:

      # on the machine connected to the Internet or to a minicpan
      $ cpanm -L xxx --scandeps --save-dists dists \
           Dancer Dancer::Plugin::FlashNote ...

which places all the modules in subdirectory dists (thanks to option --save-dists) with an arrangement similar to what you would expect from a CPAN mirror. Alas, on the target machine, you still have to make some work - e.g. you should collect the output from the invocation of cpanm above to figure out the order to use for installing the distribution files.

Additionally, the directory structure that is generated lacks a proper index file (located in modules/02package.details.txt.gz) so it would be difficult to use the normal toolchain.

epan aims at filling up the last mile to get the job done, providing you with a subdirectory that is ready for deployment, with all the bits in place to push automation as much as possible. So you can do this:

      # on the machine connected to the Internet or to a minicpan
      $ epan create Dancer Dancer::Plugin::FlashNote ...
      $ tar cvzf epan.tar.gz epan

transfer epan.tar.gz to the target machine and...

      # on the target machine
      $ tar xvzf epan.tar.gz
      $ cd epan
      $ ./install.sh

optionally providing an installation target directory:

      $ ./install.sh /path/to/local/perl

The epan directory that is generated should be compatible with other tools - in particular, the modules/02package.details.txt.gz file is generated, so all the toolchain (including cpan) should play nicely with it.

DotCloud::Environment

I'm in the process of releasing DotCloud::Environment, a module that should ease the developer's life with providing a unified entry point to get dotCloud's configurations for an application.

A typical case I had while playing with dotCloud was that I could easily deploy an application, but I had no simple way to setup a basic test environment in my development machine. This is unfortunate because it shifts all testing on the deployed infrastructure.

When you create an application on dotCloud, you're probably going to have some services that resolve to code you have to write, other ones that resolve to data you're going to populate or use. The link that allows a code service to access a data service is the file /home/dotcloud/environment.json (or its equivalent YAML representation /home/dotcloud/environment.yml), so you know where to look for when you are in the deployed environment.

What happens when you are in your development environment? You're basically on your own - you have to figure out that you're not in dotCloud, find some place where to put the configuration, etc. etc. In a few words: boring stuff that you don't need.

DotCloud::Environment's goal is to streamline and simplify the process, providing a unified interface to access dotCloud's configuration in whatever environment you are. It has a default mode that should fit the typical case, while letting you decide that your setup needs some customisations.

A Typical Directory Layout

(for some definition of typical, of course)

In order to keep your code clean, you will probably be dividing it depending on the functional block that will be deployed as a service in dotCloud. Suppose that you have a frontend service, a backend service and a database; you probably have the following directory layout:

project
+- dotcloud.yml
+- backend
|  | ...
|  +- lib
|     +- Backend.pm
+- frontend
|  | ...
|  +- lib
|     +- FrontEnd.pm
+- lib
   +- Shared.pm

Each service is put into a separate directory and all the code that they both use (e.g. functions to connect to databases) is put in a common lib directory.

Where Should I Put My Local Configuration?

The main goal is to let it find the right environment.json (or, equivalently, environment.yml) depending on the environment you are into. If you are in dotCloud there is actually no problem, because by default the right /home/dotcloud/environment.json file is selected; for your local development the best thing to do is to put the configuration file in the project's root directory, which becomes like this:

project
+- dotcloud.yml
+- backend
|  | ... 
|  +- lib
|     +- Backend.pm
+- frontend
|  | ... 
|  +- lib
|     +- FrontEnd.pm
+- lib
|  +- Shared.pm
|     
+- environment.json

Putting the file in that position lets DotCloud::Environment find it by default when no /home/dotcloud/environment.json file (or the equivalent YAML file) is found in the system. Which hopefully is the case of your development environment.

Of course you should customise this environment.json/environment.yml file to suit your needs in the development environment, following the same rules that dotCloud uses to generate it. It's quite straightforward so you should not have problems with this.

And Now?

Now you're ready to use DotCloud::Environment!

In your code services you will probably need to access the shared library only. DotCloud::Environment helps you find the directory to provide to use lib via the path_for function:

# -- in BackEnd.pm and FrontEnd.pm --
use DotCloud::Environment 'path_for';
use lib path_for('lib');
use Shared ...;

On the other hand, in your shared library you will probably need to access the actual configurations for the environment you are in. The most straightforward way to do this is via the dotvars function, that provides you an (anonymous on request) hash containing the relevant configurations for the service you need:

# -- in Shared.pm --
use DotCloud::Environment 'dotvars';

# ... when you need it... 
my $vars = dotvars('service-name');

For example, suppose that you want to implement a function to connect to a Redis service called redisdb and a function to connect to a MySQL service called sqldb:

use DotCloud::Environment 'dotvars';
sub get_redis {
   my $vars = dotvars('redisdb');  # getting an anonymous hash

   require Redis;
   my $redis = Redis->new(server => "$vars->{host}:$vars->{port}");
   $redis->auth($vars->{password});
   return $redis;
}
sub get_sqldb {
   my %vars = dotvars('redisdb'); # getting a hash
   my ($host, $port, $user, $pass)
         = @vars{qw< host port login password >}

   require DBI;
   my $dbh = DBI->connect("dbi:mysql:host=$host;port=$port;database=db",
                          $user, $pass, {RaiseError => 1});
}

Of course you can use dotvars directly in FrontEnd.pm and BackEnd.pm, but you will probably benefit from refactoring your common code to avoid duplications.

Minor Customisations

If you don't like how dotvars (or any other function in the functional interface) is named, you can take advantage of the fact that Sub::Exporter is used behind the scenes. This lets you do this:

use DotCloud::Environment
   dotvars => { -as => 'dotcloud_variables_for' };
my %vars = dotcloud_variables_for('my-service');

Major Customisations

DotCloud::Environment lets you play with an interface that is wider than just using path_for and dotvars, of course, in addition on not strictly requiring you to put the development configuration file in the suggested position. You're encouraged to take a look at the full documentation; in case you want to position your file in some fixed position, you can use the environment variable DOTCLOUD_ENVIRONMENT.

Conclusions

Do you like dotCloud? I do, and now I think that I'll find it a bit easier to develop stuff that is (dot)Cloud-ready (to use some buzz-expression).

Quick note on using module JSON

This Unicode stuff tried to drive me crazy, hopefully I'll record something useful here because the docs are a bit too intricated to understand.

The underlying assumption is that data is "moved" using utf8 encoding, i.e. files and/or stuff transmitted on the network contain data that is utf8 encoded. This boils down to the fact that some characters will be represented by two or more bytes if you look at the raw byte stream.

There are two ways you can obtain such a stream in Perl, depending on what you want to play with at your interface: string of bytes or string of characters. It's easy to see whether a string is the first or the second, because it suffices to call utf8::is_utf8:

$is_string_of_characters = utf8::is_utf8($string);

The name is not very happy, in my opinion, because it tells you whether Perl has activated its internal utf8 representation. I would have preferred to have something called is_characters or whatever.

There's a small corner case in which the test above is false but you are dealing with a string of characters: it's the plain ASCII case, in which the string of bytes and the string of characters are exactly the same. But you understand that this is not a problem.

Independently of how you want to deal at the interface, anyway, we will always assume that you will be using strings of characters in your program, i.e. if your string contains accented characters (for example) the test above will be true.

String of Bytes

If you have a string of bytes that represents valid utf8, your data is already in the right shape to be transmitted and/or saved without doing further transformations. In this case, to save the file you have to set it in raw mode, so that you eliminate the possibility of doing additional transformations on it:

binmode $outfh, ':raw';
print {$outfh} $string_of_bytes;

The same applies for stuff that you want to read, of course:

binmode $infh, ':raw';
$string_of_bytes = do { local $/; <$infh> }; # poor man's slurp

String of Characters

If you have a string of characters, either you pass to the string of bytes representation using Encode::encode and revert to to the previous case:

my $string_of_bytes = encode('utf8', $string_of_characters);

or you tell your interface to do this transformation for you by setting up the proper encoding:

binmode $outfh, ':encoding(utf8)';
print {$outfh} $string_of_characters;

The same goes for your inputs:

binmode $infh, ':encoding(utf8)';
$string_of_characters = do { local $/; <$infh> };

and, of course, if you want your string of characters from a raw bytes representation you can use Encode::decode:

my $string_of_characters = decode('utf8', $string_of_bytes);

The funny thing is that this Encode::decode lets you decide from which encoding you start, not only utf8, but always starting from the assumption that you start from the raw representation. The same applies to Encode::encode, of course; anyway, in my opinion new stuff should stick to utf8 so I'll not dig this further.

What About JSON?

Now that we have set the baseline:

  • all internal stuff will be using Perl's internal Unicode support, which means strings of characters, which means that scalars containing stuff outside ASCII will have their flag set;
  • communications with the external world will be done using the utf8 encoding;

we can finally move on to using the JSON module properly. We have to cope with four different case, depending on the following factors:

  • format: the JSON representation is a string of bytes or a string of characters?
  • direction: are we converting from JSON to Perl or vice-versa?

We'll never say this too many times: in all cases, all the string scalars in the Perl data structure will always be strings of characters (which might appear from utf8::is_utf8 or not depending on whether they contain data outside ASCII or not as we already noted).

String of Bytes

If you're playing with raw bytes on the JSON side, decode_json and encode_json are the functions for you:

$data_structure = decode_json($json_as_string_of_bytes);
$json_as_string_of_bytes = encode_json($data_structure);

String of Characters

On the other hand, if your JSON scalar is (or has to be) a string of characters, you have to use from_json and to_json:

$data_structure = from_json($json_as_string_of_characters);
$json_as_string_of_characters = to_json($data_structure);

Summary

A code fragment is worth a billion words. For stuff that you read:

$json = read_in_some_consistent_way();
$data_structure = utf8::is_utf8($json)
   ? from_json($json)
   : decode_json($json);

For stuff that you have to write:

$json = want_characters()
   ? to_json($data_structure)
   : encode_json($data_structure);

As a final note, if you want to be precise in your new projects you should always stick to using the utf8 Perl IO layer, in order to properly enforce checks on your inputs and forget about encoding issues in your outputs. This means of course that you end up using from_json/to_json.

git for me

I use git for my own projects, and the day-by-day stuff does not involve much messing with remote repositories. Anyway, it's a fantastic tool to keep track of the code and avoid losing anything.

getting started (installation apart!)

In the directory for the new project, just init:

  git init
  Initialized empty Git repository in ...

At this point, there's nothing tracked, so after editing some files you can add whatever you want to track, either by filename or by directory name. The fastest thing to do is probably:

  git add .

Note that at this point nothing has actually been put inside the tracking system - you'll have to commit your change:

  git commit

You'll be asked to enter a commit log line and possibly something more, but it's higly probable that at this stage you'll want to always use the same message like this:

  git commit -m 'initial import'

We'll talk a bit more about commit in seeking and committing below.

excluding stuff

When there are files in your directory that you don't want to track (e.g. some tar archive), git will keep telling you about them. You can turn this whining off by adding elements to the .git/info/exclude file:

  echo '*.tar.gz' >> .git/info/exclude
  echo '*.swp'    >> .git/info/exclude

seeking and committing

To see what's going on:

  git status

or, to have an idea of the changes:

  git diff --color

In the last case, you can also provide some file names to restrict the diffing:

  git diff --color path/to/file

After you have an idea of the log line to write, you can commit:

  git commit path/to/file -m "your log line here"

Without the -m parameter you enter interactive mode to provide a log line and some more log details.

To commit all files just use -a instead of passing the file name.

If you want to select all files in the commit, you can first add them:

  git add file1
  ...
  git add file2
  git add path/to/file3

and then just commit:

  git commit -m 'some complex commit'

tag

Sometimes it's useful to set a tag in certain stable conditions. This can be done automatically by Dist::Zilla, for example.

To add a tag at the current HEAD:

  git tag name-of-tag

If you want to put a tag in the past, just find out the SHA1 digest of the commit with the log subcommand:

  git log
  ...
  commit 1a9529d56959f7ad9287faaf3b143649516fa63f
  Author: Flavio Poletti <flavio@polettix.it>
  Date:   Mon May 31 17:36:22 2010 +0200

     turned complete-check in pure Perl
  ...

and use that:

  git tag name-of-tag 1a9529d56959f7ad9287faaf3b143649516fa63f

You don't usually have to use the full SHA1, as long as git can be sure of what you're referring to. Usually some 8-10 characters suffice:

  git tag name-of-tag 1a9529d569

getting a previous version of a file

If you want to get some previous version of a file, you first have to seek the right version. You can either get a commit or a tag, then:

  git checkout name-of-tag path/to/file

You can use the commit SHA1 instead of name-of-tag. If you want to refer to some near version, you can refer to HEAD:

  git checkout HEAD^ path/to/file   # one step back
  git checkout HEAD^^ path/to/file  # HEAD - 2
  git checkout HEAD~5 path/to/file  # HEAD - 5

branching

If you want to make some experiments without messing up with the main trunk you can create a branch. To fork at the current point, you can just create the branch and switch to it:

  git branch branch-name-here
  git checkout branch-name-here

To do these two operations in a single step (which is what you want 99% of the times) just add -b to checkout:

  git checkout -b branch-name-here

You can also branch somewhere in the past, i.e. at a tag or specific SHA1:

  git checkout -b branch-name-here name-of-tag

When you're happy with the branch status and your modifications, you can merge them back in the main trunk:

  git checkout master
  git merge branch-name-here

At this point, you can also get rid of the branch if you want:

  git branch -d branch-name-here

If you don't want to get all the changes in a branch, but just some of them, you can cherry-pick instead. Cherry-picking means picking exactly the commits that you want, so you have to know them beforehand (e.g. via the log command, see example above in the tag section):

  git cherry-pick 1a9529d569
  git cherry-pick c8878c0b

In this case, the master branch and the new branch will not be aligned, so git will complain if you try to get rid of the new branch. If you're positive to get rid of it anyway, you can use -D instead of -d:

  git branch -D branch-name-here

Note that there's no turning back, so be sure that you can actually get rid of the branch.

aggregating commits

When you're merging a set of changes from a branch (either merge-ing or cherry-pick-ing) you might want to aggregate them all under a single commit, i.e. you might not be interested in the whole history for the change (which might include lots of try-and-error commits) but you would rather set a single, comprehensive log message.

In this case, what to do depends on the commits you're interested into. If you want to do a merge from some branch-name-here, then you can use the --no-ff switch (which is explained perfectly here):

  git merge --no-ff branch-name-here -m 'Added frozzbuzz feature, yay!'

On the other hand, if you just want to cherry-pick some modifications, --no-commit will be of help:

  git cherry-pick --no-commit 1a9529d569
  git cherry-pick --no-commit c8878c0b
  git commit -a -m 'Added long-wanted frozzbuzz feature, yay!'

In cherry-pick this option can be abbreviated to -n. Beware that merge has a --no-commit option too, but it does not seem to work (or it does not do what one would expect, anyway), so stick to --no-ff in the merge case.

[Edit] I know that there is another method that is more streamline and clean, but I just don't remember where I did see it at the moment!

remote repositories

It's not unusual that I have to use more than one computer - I have two at work and at least another one at home. You could argue that using three laptops is a bit weird, but this is life.

One thing that I tend to do quite early is duplicate a repository that I create locally on one of the three computers in a remote server, in order to be able to work on it from other computers when needed. For this, I found a blog post that is very useful, I'll try to mirror some of the contents here.

We're assuming that you already have your local repository local:/home/foo/project and you have ssh access to the remote server. First of all create a bare repository in the remote server:

remote:/home/bar$ mkdir project
remote:/home/bar$ cd project
remote:/home/bar/project$ git init --bare

Now you can perform other initialisation stuff, e.g. permissions, configurations, etc.

Back on the local computer:

local:/home/foo/project$ git remote add origin ssh://user@remote:/home/bar/project
local:/home/foo/project$ git push origin master

Of course you can also push whatever other branch you're interested into replicating in the remote server. At this point you need to set the branch as tracking the remote one, which can be done in several ways but I always stick to the suggestion in the link above:

local:/home/foo/project$ git checkout origin/master
local:/home/foo/project$ git branch -f master origin/master
local:/home/foo/project$ git checkout master

Again, repeat the first two steps for whatever branch you need. Done!

Some time ago I read that there should be also some other method, but one is sufficient and I don't find where!