YAML vs INI (Again) and the plan for yet another INI module

For the past several years, I'd been set on YAML as the format for configuration file. It's human-readable, pretty, portable, and support arbitrary data structures. But for future projects, I'm planning to use INI format. Why?

First of all, YAML is "too complex" for users. There are subtle syntaxes, like the requirement for list separator character (,) or the mapping character (:) to be followed by space. And then there are object literals like Yes/No/true/false, ~, date/time, etc. And the various ways to do heredocs. It would take at the very least an hour to explain the syntax to first timers, and days to familiarize with it.

Second, and this is more important for me, there are no round-trip parsers/emitters for YAML. You can't modify data without reformatting the whole file and removing all comments.

INI doesn't have these problems. It's pretty readable, it's familiar to most users including Windows users, and it's easy to write round-trip parser for it. But INI has a different set of problems. One, there is no standard/specification. Each implementation will differ in some ways and have different features. And two, it has restrictions in encoding data structures.

This brings me to a plan for an INI reader/writer module that I can use for my projects. It needs to the following features, and so far I've not found one on CPAN which satisfies these requirements. (But maybe I can patch an existing one instead of starting from scratch).

  • Round-trip. It needs to preserve formatting, including comments, blank lines, indentation, spacing between the "=", and so on.

  • Support storing deep array and hashes.

  • Support arbitrary section names, property names, & property values (e.g. property name containing "=", section name containing "]" or newlines).

So far here's the specification for the INI format:

  • UTF-8?

  • Comments are only allowed at the beginning of line (with/without indent) and not allowed after property value.

  • Property name/value outside section is allowed (since they are common) and will be assumed to be in a section (configurable).

  • Quoting using double quotes (") is allowed in section name, property name, and property value. All problematic characters should be escaped. Example:

    ; a section with empty name
    [""]
    "property name containing = and \"" = value
    property2 = "value\n\0"
    
  • Whitespace before property and section name, or before/after equal sign is allowed/ignored. To include whitespace, use quoting.

  • Duplicate sections are allowed, property names will be merged, the later sections having the precedence.

  • Deep hashes/arrays should be represented by section containing multiple paths:

    ["hash" "subhash" 0]
    name=value
    
  • Arrays are represented by duplicate property names:

    names=John
    names=Paul
    names=James
    
  • To differentiate between a string and an array of single elements, use a comment:

    A string, "val".

    name=val
    

    An array of one element: ["val"].

    ; array=1
    name=val
    
  • To differentiate between an empty string and an array of zero elements, use a comment:

    A empty string, "".

    name=
    

    An array of zero element: [].

    ; array=0
    name=
    
  • To specify an empty hash, use a section:

    ["hash"]
    
  • To specify null value and differentiate it from empty string, use a comment:

    ; null
    name=
    

This way, the INI files can contain arbitrary data structures, just like JSON, except that the basic data structure must be hash of hashes (HoH).

UPDATE 2011-09-30 02:22 UTC: Thanks for all the comments and suggestions, but note that what I need is a human-readable/-editable format AND a parser/emitter which preserves comments and formatting. I put a lot of comments in my config files, and they are as equally valuable as the config themselves.

UPDATE 2011-11-04: My implementation is at Config::Ini::OnDrugs, it's still very early and incomplete, but you can see the updated specification there.

28 Comments

No. Just no. Use Config::General.

Generally, I store the configuration in a *.pm file. Any maintenance programmer should know enough Perl to change it. If the user has to change it, I create a (G)UI for them.

Storing the configuration is a different format means you have to include a parser and a writer for it; a wasted effort if only programmers are going to see it.

The project I spend most of my time on, WebGUI, uses JSON for configuration. There's a Config::JSON module for easy reading and manipulation, and most developers nowadays understand javascript.

Hi Steven

"It would take at the very least an hour to explain the syntax to first timers, and days to familiarize with it.". Yep. That's what killed YAML - A /beginner/ can't just sit down and type it.

I, and I suspect, many others, have returned to the INI file style.

"* Support arbitrary section names, property names, & property values (e.g. property name containing "=", section name containing "]" or newlines).". Nope. This is offering pathological flexibility.

"* Arrays are represented by duplicate property names:". Why? Because someone else did it that way, IIRC?

Why not:

name=[one
two
three]

with the '[' and ']' also possible on a separate line.

This more-or-less eliminates the YAML-like, and ridiculous, complexity of your rules about distinguishing between a scalar and an array.

Hence: An empty string:
name=
and an empty array:
name=[]
and a null
name=\0
and an array containing a null:
name=[\0]

Really, it's not that difficult :-)).

What I miss in INI parsers, and you don't mention, is nested sections. So I use is a global section containing the name of the section to process. Eg:

[global]

# host:
# o Specifies which section to use after the [global] section ends.
# o Values are one of localhost || webhost.
# o Values are case-sensitive.
#
# Warning:
# o This file is processed by Config::Tiny.
# o See App::Office::Contacts::Util::Config.
# o So, do not put comments at the ends of lines.
# o 'key=value # A comment' sets key to 'value # A comment' :-(.

host=localhost

[localhost]

# Template stuff
# --------------
# This a disk path.

tmpl_path=/dev/shm/html/assets/templates/app/office/contacts

# CSS stuff
# ---------
# This is a URL.

css_url=/assets/css/app/office/contacts

# Javascript stuff
# ----------------
# This is a URL.

yui_url=/assets/js/yui

[webhost]

# TBA.


Cheers
Ron

For me the one thing all formats beyong pure Perl code are missing is the ability to define variables:

    $WEB_ROOT= '/var/www';
    $CSS= "$WEB_ROOT/styles";

Whether it's INI, JSON or YAML, I haven't found a way to do this properly. I actually patched an old version of YAML.pm to allow for this, but it was very limited, with a simple regexp that replaced $var by the value associated with the var key:

    ---
    bar: $foo/bar
    foo: bar

Which doesn't look that good anyway as YAML outputs keys in alphanumeric order.

Also for complex data structures you then have to manage scope.

I wonder if there is a module that would do this, or if it would be worth writing one. And if this could be applied to a nicer format than YAML (I am not too familiar with INI)

A project of mine resolves a similar issue. Each top-level key in the config file is a variable name. Instead of declaring variables, as you propose, it has a list of abbreviations that get substituted, even in the abbreviations list. This allows the following config to be properly handled:


abbreviations:
24-stereo: s24_le,2,frequency,i
frequency: 44100
devices:
jack:
signal_format: f32_le,N,frequency
record_format: 24-stereo
A list of variables is used to determine that 'record_format' is a scalar and 'devices' is a hash and that both are legal keys.

Not finding anything appropriate on CPAN, I ended up writing my own code for assigning a list of variables from the deserialized reference. (Surprising, as I would think this requirement to be common.)

I thought Shlomi stated somewhere that Config::IniFiles preserved comments? Might want to check with him. He might be able to add some bits that you need.

Do none of the thousand config file packages preserve formatting? I thought that Config::IniFiles had a switch that allowed that.

Hi Steven

I meant sections entirely nested within sections.

I realize this adds complexity, which I'm uneasy about, but I feel this corresponds to real world usage enough to be appropriate.

The other way to emulate it is by naming sections:
[outer]
...
[outer.inner]
...

But you've still go the problem of how to treat the /next/ section: Is it at the level of [outer] or is it nested?

Data-only perl config isn't actually that hard for non-programmers to understand. It is basically JSON with '=>' instead of ':'. I've used it for years and had no complaints from our customer deployment support crew.

These days I tend to use JSON for files that aren't developer-specific since so many people are familiar with it. Disallow /**/ comments and transliteration from JSON to perl is a doddle - slurp in your json and translate : to =>, // to #, true to 1 etc. 'eval' your string (Ignore the PBP police on this one) and life is good.

True - utf8 presents more of a challenge. Personally I've not had to deal with utf8 config files.

The only worthwhile design constraint I see here is round-tripping. But even the very desire to write the configuration file is somewhat telling. It is possible that you merely want to write GUI to assist in configuration, of course, or to remember the state of a few checkboxes – and that is fine. But it could also mean you are trying to use the configuration file as a persistent store for some internal state data structures, in which case of course you want a capable format so that your code won’t have to perform mapping – it’s much better for you if the user has to!

The right thing to do for configuration is use vanilla INI and live with its limitations. Sometimes that means spending some time coming up with a way to express a configuration using just the means of INI, but that is time well spent. It will make your configuration easier to use.

To persist internal state too complex to fit a configuration format, you should really be serialising it in some other format, not writing it into your configuration.

Configuration is not state serialisation.

Why be so prescriptive about "the" right thing to do? Would trying to configure something as sophisticated as apache, e.g. multiple virtual hosts & mod_rewrite rules, with ini-style notation, really be time well spent?

I'd say it would be pretty hard to nail down exactly what configuration is. You could argue the perl interpreter is infinitely configurable - it's run-time behaviour can be completely determined by configuration files, usually referred to as 'scripts'.

If you just want to a programmatic means to update values in a config file you may not need anything more sophisticated than perl 1 liners + (bash|your shell of choice). It's all just text after all.

I am prescriptive because I’ve seen what works and what doesn’t.

mod_rewrite is an awful programming language masquerading as configuration. Have you seen its documentation? You get to branch and loop by using hard to write syntax. There are also variables, though you can only assign and use them in very limited ways. If you want the user to write code, which for this use case he has to do for any need beyond the trivial, then use a programming language. Don’t try to gin up a hackneyed mini-language within the confines of a configuration file syntax, all the while ignoring all lessons of language design. The result will be painful for you and painful for your users.

When I wrote Plack::Middleware::Rewrite I was pleasantly surprised at how easy it was to offer all the more advanced features of mod_rewrite with almost no effort (except the proxying features), and yet still the “configuration” is more readable and far more easily writeable, esp. in expressing more complicated intents.

You could argue the perl interpreter is infinitely configurable – it’s run-time behaviour can be completely determined by configuration files, usually referred to as “scripts”.

Answered like an architecture astronaut. Yes, you can nitpick all distinctions out of existence. Control flow is nothing but GOTOs and IFs too. Yet I doubt that you think in terms of IF and GOTO when you structure your programs. Why not? All you achieve in so nitpicking is lose all sense of how to think of things.

Configuration is not program state. Sure, there’s a thin strip of grey area in the middle, but don’t let that little grey engulf the large expanses of black and white to either side.

I *think* you have described the standard perl JSON module:
http://search.cpan.org/~makamaka/JSON-2.53/lib/JSON.pm

...give or take the choice of formatting (ie json style vs ini style)? It has claimed round trip of the file?

Also JSON is reasonably easy for beginners to write and is approximately (give or take an argument) a subset of yaml.

Apache config style is also nice - I don't know if there is a round trip config library for that?

Why not invest some time looking at adding round trip of comments to an appropriate library that is already close rather than starting again?

@Steven Haryanto: I'll be interested to see how you resolve the tradeoff between capability and simplicity for the end user. For me, the laziness factor also figures in: I'd rather let my users write Perl code than come up with a whole new mini-language.

Leave a comment

About Steven Haryanto

user-pic A programmer (mostly Perl 5 nowadays).