February 2014 Archives

Stupid Lucene Tricks: Storing Non-Documents

Lucene's search capabilities are so powerful that it is tempting to store more than documents -- and that is OK. Here are some hints to make storing non-documents easier:

  • Do you want to allow phrase searches on your fields? A drawback of allowing phrase searches occurs when you keep the synonyms for a field value in that same field for ease of searching (which may well be the right strategy for the Lucene default field). For example, if you are indexing information about sugar beets, you could end with many synonyms about the "sugariness" of the beets when you care mostly about …

1-line Endianness Detection in the C Preprocessor

Yeah, it's evil (or at least chaotic), but...

Go see 1-line Endianness Detection in the C Preprocessor.

(As someone who had a write a C preprocessor (we needed a consistent preprocessor across several architectures), I appreciate this trick.)

Xerces-C++ for Validating Against Multiple Schemas

Xerces-C++ is Apache's C++ implementation of the Xerces XML parser. It turns out that it ships with a simple example program, stdinparse, that can validate your XML (which many tools do) against multiple schemas simultaneously (which few Open Source tools do).

A sample command line could be:

$ ./stdinparse -n -s  /tmp/10.5072__FK250925-xerces.xml

POE::Session object_states: handlers are sub names not CODEREFs

This works as expected:

    sub _poll_start {
        my $self = $_[OBJECT];
        'object_states' => [
            $self => {
                '_start' => '_poll_start',
                'Work'   => '_poll_work',
                '_stop'  => '_poll_stop',

This, on the other hand, calls the handlers but without filling the @_ array:

    sub _poll_start {
        my $self = $_[OBJECT];    # $self will be undef
        'object_states' => [

pmtools v2.0.0 - Now with pmtools::new_pod_iterator()!

pmtools (Perl Module Tools) v2.0.0 has been unleashed upon the unsuspecting world.

v2.0.0 accommodates when the POD (.pod) file is separate from the module (.pm) file. (I gather that this is the case in upcoming Debian.) As I had to modify both pman and podpath for this change, it was easier to just push that functionality into an iterator-generator routine in the pmtools module itself (the first time the pmtools module has contained any useful code.) I only have 1 data point (my con…

Set::Jaccard::SimilarityCoefficient v0.5.1

Set::Jaccard::SimilarityCoefficient lets you calculate the Jaccard Similarity Coefficient for either arrayrefs or Set::Scalar objects.

Briefly, the Jaccard Similarity Coefficient is a simple measure of how similar 2 sets are. The calculation is (in pseudo-code):

count(difference(SET-A, SET-B)) / count(union(SET-A, SET-B))

There is a Jaccard Similarity Coefficient routine already in CPAN, but it is specialized for use by Text::NS…

About Mark Leighton Fisher

user-pic Perl/CPAN user since 1992.