Perl Archives

Graphics::Potrace

A few months ago I released Graphics::Potrace, that provides Perl bindings to the potrace library. So, if you want to convert rasters into vectors from Perl... you know where to go.

origami envelopes

I've always been fond of origami, and in some periods I also had time to fold some as a hobby. Alas, this is not the case any more... most of the times.

I'm also proud to produce my greeting cards for birthdays and occasions, when I remember to actually make one (which happens seldom... but happens). Some time ago I stumbled upon a neat design for an origami envelope - although I don't remember where I saw it, I've found a couple of web sites that include it (e.g. here). So... two of "my" things coming together...

Then I'm fond of Perl, of course. So why not kicking it in and use it to add an image to the back of the envelope... automatically?

Parse::RecDescent and number of elements read on the fly

I recently had to develop a small parser for some coworkers and I turned to Parse::RecDescent for handling it. The grammar was not particularly difficult to address, but it had some fields that behave like arrays whose number of elements is declared dynamically, so it's not generally possible to use the repetition facilities provided by Parse::RecDescent, because they require the number of repetitions to be known beforehand.

Logging in Dancer

I don't remember whether I blogged about Dancer::Logger::Log4perl or not, but a recent post by Ovid in Dancer's mailing list made me think that it would fit his use case. Unfortunately it seems that some of my messages did not make it into the mailing list (I didn't find them in the archived thread, anyway), so I'm blogging it here for a wider audience to bother.

If I understood Ovid's needs correctly, he needs an additional logging level in Dancer that allows to pass messages whatever the log level. Apart from the semantic level considerations here, the use case seemed to fit perfectly with using Log::Log4perl because it has a wide range of logging levels and the possibility to separate different logs in different parts.

A possible proof-of-concept is the following example:

#!/usr/bin/env perl
use strict;
use warnings;

use Dancer;
use Log::Log4perl qw< :easy >;

setting log4perl => {
   tiny => 0,
   config => '
      log4perl.logger                      = DEBUG, OnFile, OnScreen
      log4perl.appender.OnFile             = Log::Log4perl::Appender::File
      log4perl.appender.OnFile.filename    = sample-debug.log
      log4perl.appender.OnFile.mode        = append
      log4perl.appender.OnFile.layout      = Log::Log4perl::Layout::PatternLayout
      log4perl.appender.OnFile.layout.ConversionPattern = [%d] [%5p] %m%n
      log4perl.appender.OnScreen           = Log::Log4perl::Appender::ScreenColoredLevels
      log4perl.appender.OnScreen.color.ERROR = bold red
      log4perl.appender.OnScreen.color.FATAL = bold red
      log4perl.appender.OnScreen.color.OFF   = bold green
      log4perl.appender.OnScreen.Threshold = ERROR
      log4perl.appender.OnScreen.layout    = Log::Log4perl::Layout::PatternLayout
      log4perl.appender.OnScreen.layout.ConversionPattern = [%d] >>> %m%n
   ',
};
setting logger => 'log4perl';

get '/' => sub {
   warning 'just a plain warning here';
   error 'just a plain error here';
   content_type 'text/plain';
   return "normal here\n";
};

get '/special' => sub {
   debug 'inside special...';
   ALWAYS 'this is a special BUSINESS-LOGIC message!';
   warning 'inside special...';
   content_type 'text/plain';
   return "special here\n";
};

dance();

Having the full power of Log::Log4perl's configuration capabilities allows sending all the log messages to a file, while keeping the most important ones on the screen. Additionally, the usage of the ALWAYS stealth logger - which sends messages at the OFF logging level - also allows being sure that they are - guess what? - ALWAYS emitted whatever the log level.

As a nice add-on, these messages (that are business-logic related) can be visually distinguished on the terminal from e.g. error messages: in the example, ERRORs are shown in red, while ALWAYS messages are shown in light green. This is an example of what is shown on the screen:

$ perl sample.pl -p 3333

Dancer 1.3093 server 32177 listening on http://0.0.0.0:3333
== Entering the development dance floor ...
[2012/03/28 19:08:03] >>> just a plain error here
[2012/03/28 19:08:04] >>> this is a special BUSINESS-LOGIC message!

and, of course, the file contains the full log:

[2012/03/28 19:23:18] [ INFO] loading Dancer::Handler::Standalone handler
[2012/03/28 19:23:18] [ INFO] loading handler 'Dancer::Handler::Standalone'
[2012/03/28 19:23:25] [ INFO] request: GET / from 127.0.0.1
[2012/03/28 19:23:25] [ INFO] Trying to match 'GET /' against /^\/$/ (generated from '/')
[2012/03/28 19:23:25] [ INFO]   --> got 1
[2012/03/28 19:23:25] [ WARN] just a plain warning here
[2012/03/28 19:23:25] [ERROR] just a plain error here
[2012/03/28 19:23:25] [ INFO] response: 200
[2012/03/28 19:23:28] [ INFO] request: GET /special from 127.0.0.1
[2012/03/28 19:23:28] [ INFO] Trying to match 'GET /special' against /^\/$/ (generated from '/')
[2012/03/28 19:23:28] [ INFO] Trying to match 'GET /special' against /^\/special$/ (generated from '/special')
[2012/03/28 19:23:28] [ INFO]   --> got 1
[2012/03/28 19:23:28] [DEBUG] inside special...
[2012/03/28 19:23:28] [  OFF] this is a special BUSINESS-LOGIC message!
[2012/03/28 19:23:28] [ WARN] inside special...
[2012/03/28 19:23:28] [ INFO] response: 200

So, if you want all the power of Log::Log4perl while playing with Dancer... be sure to check Dancer::Logger::Log4perl out!

Sets operations

To help some coworkers I whipped up a program to perform set operations in Perl. It's quite basic but it's been pretty effective so far and it's on github.

Sets are assumed to be files where each line is a different element. It is assumed that equal lines are either not present or can be filtered out with no consequence. The inner working assumes that at a certain point the input files are sorted, and in general the external sort program is used automatically, which limits the applicability in some platforms.

The three basic operations that are supported are union, intersection and difference.

# intersect two files, also with "intersect", "i",
# "I" (uppercase "i") and "^"
sets file1 & file2

# union of two files, also with "union", "u", "U",
# "v", "V" and "|"
sets file1 + file2

# subtraction of second file from first one, also
# with "minus", "less" and "\"
sets file1 - file2

Other operations, e.g. symmetric difference, can be obtained with a combination of the predefined ones. Operations can be grouped and in general the expression will be provided as a single string to avoid the shell to creep in:

# symmetric difference, alternative 1
sets '(file1 - file2) + (file2 - file1)'

# symmetric difference, how you probably saw it somewhere
sets '(file1 + file2) - (file1 & file2)'

Operations associate from left to right, so the first group above is not needed. Anyway I usually prefer to be explicit.

As anticipated, at some point the program needs to work with a sorted input. The basic motivation for the program is handling operations on files with a few million elements, so putting all the stuff in memory is not an option; on the other hand, sort is quite efficient and reinventing the wheel is not an option as well!

Sorting is usually handled automatically with a call to the external sort utility (with the -u option, because sets are assumed to not contain duplicates); anyway, this can be a time consuming activity that is not necessary if you already know that your inputs are sorted, so you can tell the program when this is actually the case:

sets -s sorted-file1 ^ sorted-file2

When sorting is performed, it is usually done on the fly without saving the intermediate sorted files. They can be useful for following sets operations, or when the same input is used multiple times (e.g. in the case of the symmetric difference examples above), so it is possible to save the sorted files with the same name and a suffix appended:

sets -S .sorted '(file1 - file2) + (file2 - file1)'

If the sorted version of a file is found (i.e. file1.sorted and/or file2.sorted in the example above) it is used with no further sorting, speeding things up automatically.

Sometimes inputs might come from different platforms, so the line terminator would be different. In our case we don't need leading or trailing whitespaces, so there is a trimming options to avoid problems:

sets -t file1-unix - file2-dos

If you think that it can be useful for you, it's possible to download a bundled version that does not need external modules installed anywhere: enjoy sets!

Dist::Zilla, Pod::Weaver and bin

I use Dist::Zilla for managing my distributions. It's awesome and useful, although it lacks some bits of documentation every now and then; this lack is compensated in other ways, e.g. IRC.

I also use the PodWeaver plugin to automatically generate boilerplate POD stuff in the modules. Some time ago I needed to add some programs to a distribution of mine (which I also managed to forget at the moment, but this is another story), and this is where I got hit by all the voodoo.

The first program was actually a Perl program, consisting of a minimal script to call the appropriate run() method in one of the modules of the distro:

$ cat bin/perl-program 
#!/usr/bin/env perl
use prova;
prova->run();

This led Dist::Zilla to complain like this:

$ dzil build
[DZ] beginning to build prova
[DZ] guessing dist's main_module is lib/prova.pm
couldn't determine document name for bin/perl-program at ...

which isn't the best advice in the world (it actually complains about Pod::Weaver), but let's ignore it for the moment. After a bit of googling - or whatever, I actually don't remember - I found that there were basically two options:

  • put an explicit package declaration in the driver program, like this:

    $ cat bin/perl-program 
    #!/usr/bin/env perl
    package prova;
    use prova;
    prova->run();
    
  • put a comment with a PODNAME:

    $ cat bin/perl-program 
    #!/usr/bin/env perl
    # PODNAME: prova
    use prova;
    prova->run();
    

The latter seems slightly less dumb so I opted for it. I say slightly because IMHO the bottom line should be that the name is equal to the filename, but this is (again) another story. Yes, I know, it's open source and I can propose patches - did I say I'm not complaining?

Anyway, I then needed to add a shell script to the lot:

$ cat bin/script.sh 
#!/bin/bash
echo 'Hello, World!'

and again the error popped up:

$ dzil build
[DZ] beginning to build prova
[DZ] guessing dist's main_module is lib/prova.pm
[PodWeaver] [@Default/Name] couldn't find abstract in bin/perl-program
couldn't determine document name for bin/script.sh at ...

Now, I could use the PODNAME trick above:

$ cat bin/script.sh
#!/bin/bash
# PODNAME: script.sh
echo 'Hello, World!'

but it turns out - with little surprise - that the file is considered a Perl one, with the consequence that in the distribution it gets POD added to it:

#!/bin/bash
# PODNAME: script.sh
echo 'Hello, World!'

__END__
=pod

=head1 NAME

script.sh

=head1 VERSION

version 0.1.0

=head1 AUTHOR

Flavio Poletti <polettix@cpan.org>

=head1 COPYRIGHT AND LICENSE

blah blah blah...

=cut

The only thing is that - ehr... - bash does not like POD very much.

Google was not my friend in this case, but I found one in Dist::Zilla's IRC channel (which is #distzilla on irc.perl.org, by the way), which I think (hope) is Christopher J. Madsen, the original contributor of Dist::Zilla::Plugin::FileFinder::ByName.

He instructed me to use that module - which entered the core as of version 4.300003 - to obtain what I was after: keep the PodWeaver plugin working on Perl stuff, while ignoring shell stuff. But, more importantly, he adviced me about why.

Many plugins - including the PodWeaver one - rely upon the Finder role to do their work. This role is something that finds files to be used by the plugin; in the case of PodWeaver, the default is to take whatever module will be installed and whatever stuff is in the bin directory. OK, it can be a bit more complicated than this, but most of the times it's OK. This default is the same as if we configured the following in the dist.ini file:

[PodWeaver]
finder = :InstallModules
finder = :ExecFiles

It turns out that if we explicitly configure a finder all the defaults are wiped away, so - for example - if our bin directory contained shell scripts only we could be happy with this:

[PodWeaver]
finder = :InstallModules

In our case, anyway, this would disable PodWeaver for Perl programs as well, which is not acceptable. This is where FileFinder::ByName kicks in:

[FileFinder::ByName / BinNotShell]
dir = bin
skip = .*\.sh$

This plugin lets us create new finders. In the example above, we are creating a finder that finds stuff in the bin directory, skipping all files that end with .sh (note that skip sets a regular expression). At this point, we're ready for the configuration of the PodWeaver plugin:

[PodWeaver]
finder = :InstallModules
finder = BinNotShell

It's correct, the new finder does not want the initial colon. Now, when it's time for the Pod::Weaver plugin to find files, it will get all the modules that will be installed AND the files in the bin directory whose name does not end with .sh. Yay!

Exclusive Perl Archive Nook

I started working on epan, a (somewhat thin) wrapper around cpanminus to create a version of CPAN trimmed down to your needs for installing specific stuff.

This is what the cool guys probably call DPAN these days, but I found that the whole concept of DarkPAN revolves much around getting your private stuff into the "normal" Perl toolchain, while in this case I need to be able to easily install modules in machines that are out of Internet reach.

To start with an example, suppose you have to install Dancer and a couple of its plugins in a machine that - for good reasons - is not connected to the Internet. It's easy to get the distribution files for Dancer and the plugins... but what about the dependencies? It can easily become a nightmare, forcing you to go back and forth with new modules as soon as you discover the need to install them.

Thanks to cpanminus, this is quite easier these days: it can actually do what's needed with a single command:

      # on the machine connected to the Internet or to a minicpan
      $ cpanm -L xxx --scandeps --save-dists dists \
           Dancer Dancer::Plugin::FlashNote ...

which places all the modules in subdirectory dists (thanks to option --save-dists) with an arrangement similar to what you would expect from a CPAN mirror. Alas, on the target machine, you still have to make some work - e.g. you should collect the output from the invocation of cpanm above to figure out the order to use for installing the distribution files.

Additionally, the directory structure that is generated lacks a proper index file (located in modules/02package.details.txt.gz) so it would be difficult to use the normal toolchain.

epan aims at filling up the last mile to get the job done, providing you with a subdirectory that is ready for deployment, with all the bits in place to push automation as much as possible. So you can do this:

      # on the machine connected to the Internet or to a minicpan
      $ epan create Dancer Dancer::Plugin::FlashNote ...
      $ tar cvzf epan.tar.gz epan

transfer epan.tar.gz to the target machine and...

      # on the target machine
      $ tar xvzf epan.tar.gz
      $ cd epan
      $ ./install.sh

optionally providing an installation target directory:

      $ ./install.sh /path/to/local/perl

The epan directory that is generated should be compatible with other tools - in particular, the modules/02package.details.txt.gz file is generated, so all the toolchain (including cpan) should play nicely with it.

DotCloud::Environment

I'm in the process of releasing DotCloud::Environment, a module that should ease the developer's life with providing a unified entry point to get dotCloud's configurations for an application.

A typical case I had while playing with dotCloud was that I could easily deploy an application, but I had no simple way to setup a basic test environment in my development machine. This is unfortunate because it shifts all testing on the deployed infrastructure.

When you create an application on dotCloud, you're probably going to have some services that resolve to code you have to write, other ones that resolve to data you're going to populate or use. The link that allows a code service to access a data service is the file /home/dotcloud/environment.json (or its equivalent YAML representation /home/dotcloud/environment.yml), so you know where to look for when you are in the deployed environment.

What happens when you are in your development environment? You're basically on your own - you have to figure out that you're not in dotCloud, find some place where to put the configuration, etc. etc. In a few words: boring stuff that you don't need.

DotCloud::Environment's goal is to streamline and simplify the process, providing a unified interface to access dotCloud's configuration in whatever environment you are. It has a default mode that should fit the typical case, while letting you decide that your setup needs some customisations.

A Typical Directory Layout

(for some definition of typical, of course)

In order to keep your code clean, you will probably be dividing it depending on the functional block that will be deployed as a service in dotCloud. Suppose that you have a frontend service, a backend service and a database; you probably have the following directory layout:

project
+- dotcloud.yml
+- backend
|  | ...
|  +- lib
|     +- Backend.pm
+- frontend
|  | ...
|  +- lib
|     +- FrontEnd.pm
+- lib
   +- Shared.pm

Each service is put into a separate directory and all the code that they both use (e.g. functions to connect to databases) is put in a common lib directory.

Where Should I Put My Local Configuration?

The main goal is to let it find the right environment.json (or, equivalently, environment.yml) depending on the environment you are into. If you are in dotCloud there is actually no problem, because by default the right /home/dotcloud/environment.json file is selected; for your local development the best thing to do is to put the configuration file in the project's root directory, which becomes like this:

project
+- dotcloud.yml
+- backend
|  | ... 
|  +- lib
|     +- Backend.pm
+- frontend
|  | ... 
|  +- lib
|     +- FrontEnd.pm
+- lib
|  +- Shared.pm
|     
+- environment.json

Putting the file in that position lets DotCloud::Environment find it by default when no /home/dotcloud/environment.json file (or the equivalent YAML file) is found in the system. Which hopefully is the case of your development environment.

Of course you should customise this environment.json/environment.yml file to suit your needs in the development environment, following the same rules that dotCloud uses to generate it. It's quite straightforward so you should not have problems with this.

And Now?

Now you're ready to use DotCloud::Environment!

In your code services you will probably need to access the shared library only. DotCloud::Environment helps you find the directory to provide to use lib via the path_for function:

# -- in BackEnd.pm and FrontEnd.pm --
use DotCloud::Environment 'path_for';
use lib path_for('lib');
use Shared ...;

On the other hand, in your shared library you will probably need to access the actual configurations for the environment you are in. The most straightforward way to do this is via the dotvars function, that provides you an (anonymous on request) hash containing the relevant configurations for the service you need:

# -- in Shared.pm --
use DotCloud::Environment 'dotvars';

# ... when you need it... 
my $vars = dotvars('service-name');

For example, suppose that you want to implement a function to connect to a Redis service called redisdb and a function to connect to a MySQL service called sqldb:

use DotCloud::Environment 'dotvars';
sub get_redis {
   my $vars = dotvars('redisdb');  # getting an anonymous hash

   require Redis;
   my $redis = Redis->new(server => "$vars->{host}:$vars->{port}");
   $redis->auth($vars->{password});
   return $redis;
}
sub get_sqldb {
   my %vars = dotvars('redisdb'); # getting a hash
   my ($host, $port, $user, $pass)
         = @vars{qw< host port login password >}

   require DBI;
   my $dbh = DBI->connect("dbi:mysql:host=$host;port=$port;database=db",
                          $user, $pass, {RaiseError => 1});
}

Of course you can use dotvars directly in FrontEnd.pm and BackEnd.pm, but you will probably benefit from refactoring your common code to avoid duplications.

Minor Customisations

If you don't like how dotvars (or any other function in the functional interface) is named, you can take advantage of the fact that Sub::Exporter is used behind the scenes. This lets you do this:

use DotCloud::Environment
   dotvars => { -as => 'dotcloud_variables_for' };
my %vars = dotcloud_variables_for('my-service');

Major Customisations

DotCloud::Environment lets you play with an interface that is wider than just using path_for and dotvars, of course, in addition on not strictly requiring you to put the development configuration file in the suggested position. You're encouraged to take a look at the full documentation; in case you want to position your file in some fixed position, you can use the environment variable DOTCLOUD_ENVIRONMENT.

Conclusions

Do you like dotCloud? I do, and now I think that I'll find it a bit easier to develop stuff that is (dot)Cloud-ready (to use some buzz-expression).

Quick note on using module JSON

This Unicode stuff tried to drive me crazy, hopefully I'll record something useful here because the docs are a bit too intricated to understand.

The underlying assumption is that data is "moved" using utf8 encoding, i.e. files and/or stuff transmitted on the network contain data that is utf8 encoded. This boils down to the fact that some characters will be represented by two or more bytes if you look at the raw byte stream.

There are two ways you can obtain such a stream in Perl, depending on what you want to play with at your interface: string of bytes or string of characters. It's easy to see whether a string is the first or the second, because it suffices to call utf8::is_utf8:

$is_string_of_characters = utf8::is_utf8($string);

The name is not very happy, in my opinion, because it tells you whether Perl has activated its internal utf8 representation. I would have preferred to have something called is_characters or whatever.

There's a small corner case in which the test above is false but you are dealing with a string of characters: it's the plain ASCII case, in which the string of bytes and the string of characters are exactly the same. But you understand that this is not a problem.

Independently of how you want to deal at the interface, anyway, we will always assume that you will be using strings of characters in your program, i.e. if your string contains accented characters (for example) the test above will be true.

String of Bytes

If you have a string of bytes that represents valid utf8, your data is already in the right shape to be transmitted and/or saved without doing further transformations. In this case, to save the file you have to set it in raw mode, so that you eliminate the possibility of doing additional transformations on it:

binmode $outfh, ':raw';
print {$outfh} $string_of_bytes;

The same applies for stuff that you want to read, of course:

binmode $infh, ':raw';
$string_of_bytes = do { local $/; <$infh> }; # poor man's slurp

String of Characters

If you have a string of characters, either you pass to the string of bytes representation using Encode::encode and revert to to the previous case:

my $string_of_bytes = encode('utf8', $string_of_characters);

or you tell your interface to do this transformation for you by setting up the proper encoding:

binmode $outfh, ':encoding(utf8)';
print {$outfh} $string_of_characters;

The same goes for your inputs:

binmode $infh, ':encoding(utf8)';
$string_of_characters = do { local $/; <$infh> };

and, of course, if you want your string of characters from a raw bytes representation you can use Encode::decode:

my $string_of_characters = decode('utf8', $string_of_bytes);

The funny thing is that this Encode::decode lets you decide from which encoding you start, not only utf8, but always starting from the assumption that you start from the raw representation. The same applies to Encode::encode, of course; anyway, in my opinion new stuff should stick to utf8 so I'll not dig this further.

What About JSON?

Now that we have set the baseline:

  • all internal stuff will be using Perl's internal Unicode support, which means strings of characters, which means that scalars containing stuff outside ASCII will have their flag set;
  • communications with the external world will be done using the utf8 encoding;

we can finally move on to using the JSON module properly. We have to cope with four different case, depending on the following factors:

  • format: the JSON representation is a string of bytes or a string of characters?
  • direction: are we converting from JSON to Perl or vice-versa?

We'll never say this too many times: in all cases, all the string scalars in the Perl data structure will always be strings of characters (which might appear from utf8::is_utf8 or not depending on whether they contain data outside ASCII or not as we already noted).

String of Bytes

If you're playing with raw bytes on the JSON side, decode_json and encode_json are the functions for you:

$data_structure = decode_json($json_as_string_of_bytes);
$json_as_string_of_bytes = encode_json($data_structure);

String of Characters

On the other hand, if your JSON scalar is (or has to be) a string of characters, you have to use from_json and to_json:

$data_structure = from_json($json_as_string_of_characters);
$json_as_string_of_characters = to_json($data_structure);

Summary

A code fragment is worth a billion words. For stuff that you read:

$json = read_in_some_consistent_way();
$data_structure = utf8::is_utf8($json)
   ? from_json($json)
   : decode_json($json);

For stuff that you have to write:

$json = want_characters()
   ? to_json($data_structure)
   : encode_json($data_structure);

As a final note, if you want to be precise in your new projects you should always stick to using the utf8 Perl IO layer, in order to properly enforce checks on your inputs and forget about encoding issues in your outputs. This means of course that you end up using from_json/to_json.

To Depend Or Not To Depend

When starting an application, I always strive to reduce dependencies as much as possible, especially if I know that there's some way that does not take me too much time to go.

It turns out that I chose one of the worst possible examples. Check out the comments below and - more importantly - check out lib documentation!

This happens when I use lib, for example to add ../lib to @INC. For my own stuff I usually head towards Path::Class:

use Path::Class qw( file );
use lib file(__FILE__)->parent()->parent()->subdir('lib')->stringify();

but when I want to reduce dependencies I revert to File::Spec::Functions (which is CORE since perl 5.5.4 and now part of PathTools):

use File::Spec::Functions qw( splitpath splitdir catdir catpath );
my $path;
BEGIN {
   my ($volume, $directories) = splitpath(__FILE__);
   my @dirs = splitdir($directories);
   push @dirs, qw( .. lib ); # ../lib - naive but effective
   $path = catpath($volume, catdir(@dirs));
}
use lib $path;

Alas, how I miss the Path::Class solution!

About Flavio Poletti

user-pic I blog about Perl.