A deep dive into the Perl type systems

People usually don't think about Perl's type system. Some would even mistakenly claim it doesn't have one. It is, however, a most unusual one that doesn't really look like anything else.

What is a type anyway? And what is a type system? I'm not going to precisely define it here, that's for academics, but generally speaking a type is a fundamental property of a variable or value that determines what operations can and can not be done with it and what invariants it must hold. In a strong type system it is a stable trait: it can't change over the lifetime of the value/variable.

In some type systems containers are typed (such as C) and values really don't exist separate from containers. In other typed systems containers are typeless but values are typed (e.g. Python, Javascript, …). There are languages where both values and containers are typed (e.g. Java, C#), typically this means that the container constrains the values in it.

Contrary to what you might expect, Perl has the latter sort of typesystem, but with a twist.

The Perl static type system

When you got up this morning you might not have expected to read that Perl is a statically typed language, but here we are. To be precise it is a closed, strong and statically typed language. There are exactly 7 types, all variables/values are exactly one of them, and they can never change into another type. Though only 5 of them are directly accessible so one could argue there are only 5 true types.

Scalar

Typically identified by the sigil $, this is the most famous type as it does most of the work. Almost all operations in the language work on scalars, it provides an enormous flexibility.

Arrays

Arrays are identified by the @ sigil. They're an ordered series of scalars, and support various operations typical of arrays such as indexing using the [] postfix operator, push, pop, splice, … that all do what anyone would expect from them.

Hashes

Hashes (also known as maps, dictionaries or associative arrays in other languages) are identified by the % sigil. They're an unordered mapping of string to scalars. Its values can be accessed using the {} postfix operator, and it supports the expected operations for a mapping: keys, values, delete, exists, etc…. Like arrays they work more or less the same as they do in other dynamic languages.

Subs

Also known as functions. They have a sigil (&), but they are omitted when defining them (sub foo {}) and usually also when calling the function (foo(1) versus &foo(1)). These values can't be used to access these values directly; the only thing you can really do with it is to call it. Instead you might encounter these values as a reference to a sub.

Globs

The last type to actually have a sigil (*) are globs. This is probably the most confusing type as it's unique to Perl. A glob contains its own name, as well as one of each other sigiled types (scalar, array, hash, sub) and two opaque values: io and format. The symbol table is essentially a hash of globs. So a package variable @Foo::bar can also be accessed as @{ *Foo::bar{ARRAY} }. Much of this is vestigial behavior from perl version 4 and older, and unless you're hacking the symbol table you should have no need of messing with them.

IOs

These contain file handles. They have no sigil and can only really be used opaquely through their glob (or via a reference, but no one does that). All operations on them are typically done through that glob (or a reference to such a glob). They support various operations that you'd expect from a file handle (e.g. readline, printf, …).

Formats

Formats are another unique type in Perl. They're a special kind of sub that is attached to an IO via the glob, used via the format keyword. You're probably better off knowing as little as possible about them.

Common operations

All sigiled types can be used with the referencing operator \ (e.g. my $foo = \&function). All accessible types can be used in both scalar and list context (e.g. scalar(@array) will return the number of entries in the array), and @array = %hash will return a list of alternating keys and values); some would consider that type weakness. All accessible types can be used as arguments to functions (though that usually means list context). Other than that there are no operations common across all types.

The dynamic type system

Of course, none of the above is what you were thinking of when I first said type system. What you were probably thinking about was types of scalars. Once again, this actually works differently from other systems.

Fundamentally though, there are only 7 operations you can do on scalars, all the others can be composed of them.

Primitives

Firstly there are the three primitive conversions:

  • Get an integer value.
  • Get a floating point value.
  • Get a string value. This value might be internally latin1 or utf8, this should not matter to the language; when it does that is known as The Unicode Bug.

All scalars are usable as integers, floats and strings, this is a fundamental trait of the scalar type. If any scalar is lacking any of these, the missing primitive will be generated from one of the others (e.g. float from int, int from string). It might give a warning because it's unhappy about those conversions (e.g. «Argument "a" isn't numeric in addition»), but it will continue nonetheless.

This is why perl has two sets of comparison operators (e.g. == versus eq), and a concatenation operator (.) separate from addition (+). Perl puts that polymorphism in the operators instead of the values. When people complain that perl is weakly typed they usually complain about this triple valuedness. I'd argue it's not weakly typed, just unusually typed.

Assignment

Any scalar can be assigned to. This replaces the value entirely. In reality of course there are various mutating operations for efficiency reasons, but conceptually the model doesn't actually require mutators; creating a new value and assigning it to the old should always be equivalent to mutating the old.

Definedness

Any scalar can be either defined or undefined. An undefined is a value that doesn't have any primitive value. As such, you can do nearly nothing without it warning warnings, other than checking definedness.

Package lookup

Package lookup what's used by method invocation(-> and the isa operator (since perl 5.34)) to determine what class should be used for method calls on a scalar.

Dereferencing

Dereference it. There is a dereferencing operator for each of the 5 sigils (e.g. @{ ... }; values may support any subset of it (including all or none).

That's it

Literally all scalar operations in Perl can be composed out of combinations of these. E.g. numeric operations like + will numify their operands and create a new numeric scalar. An operation like print will take the string value and print that to the handle.

Subtypes

As I said before, Perl is polymorphic on operators not on values; it will generally use the appropriate value for that operation. That said, some values consistently behave differently from other values, and can be thought of as a subtype for that reason. Any value is strictly one of these three subtypes, and it can't change its subtype without an assigning/mutating operation to that variable.

References

A references are indirections to a value of one of the 7 basic types of the language, or in recent perls potentially other opaque types (e.g. regexes or the new object representation in feature class). They will numify to an integer representation of their address (e.g. 0x57b6255ae580) and stringify to something derived from that (e.g. "ARRAY(0x57b6255ae580)"). They are always defined, and how they invoke and dereference is dependent on the value they refer to.

Only the dereferencing operator matching their referent's type is allowed, this ensures type-safety; trying to use any of the other 4 will throw an exception. Dereferencing isn't allowed for types that don't have a sigil (IO, format and opaque types).

Method invocation on references is only allowed when its referent is blessed (a lot of documentation suggests it's a trait of the reference itself, this is incorrect). Any type of value can be blessed (even formats, though some would call that cursed instead). A blessing is literally nothing more than tagging a value as belonging with a specific class. On method invocation, the method is looked up using that class as its starting point.

Blessed references can support operator overloading for both primitive operations and dereferencing.

Fake globs

Fake globs are essentially a vestigial feature from perl4, I don't even think they're even documented. They work sort of like references to globs, except they will accept all dereferencing operators (and use the appropriate glob slot them for). It will stringify to the glob's full sigiled name (e.g. "*Foo::bar") and are always defined. I think you can work with Perl for decades without ever encountering or needing any of these, but they exist in weird corners.

Other

Anything else is other. Other values may contain any combination of the three slots: integer, float and string (e.g. (1.0, "1")). Usually the values logically match with each other, but that does not necessarily have to be the case (most famously in $! they do not). An undefined value is any value that has none of these three set; it's equivalent to (0, "") except it will warn when used as either of them. These are the only values for which the defined operator returns false.

If use strict (or actually use strict 'refs') is not enabled, these values can be used with any of the dereferencing operators for symbolic dereferencing (e.g. @{ "Foo::Bar" }) using their string value. When strict is enabled (they should be in 99.9% of all code) symbolic lookup is forbidden.

The dynamic type system, level two

Of course, none of the above is what you were thinking of when I first said type system. What you actually want to think about is objects.

One of the issues with calling this a type system is that it's actually mutable, which suggests it's a weak type system at best, but you can just easily argue it's part of the value instead of the type. We treat it as if it's a strong type because people don't actually rebless objects; the only exceptions I know of are cases where an object is reblessed into a subtype (e.g. upgrading an IO::Socket to IO::Socket::SSL), which is just sensible enough to not break people's Liskov expectations.

Fundamentally Perl's OO-by-blessing is one of the most minimal OO systems that has ever been invented (though). With just blessing and the method call operator it gives us useful types; but it doesn't really give much a system.

3 Comments

Thanks, Leon. You've done a great service with this - for Perl practitioners as well as skeptics and newcomers.

I found one sentence I had to read a few times before understanding, under the "Assignment" heading:

"Any scalar can be assigned to. This replaces the value entirely. In reality of course there are various mutating operations for efficiency reasons, but conceptually the model doesn't actually require them creating a new value and assigning it to the old should always be equivalent to the old."

It took me a bit to get that when the second sentence says "should always be equivalent to the old" you're referring to the type not having to be equivalent.

E.g.: $foo = 'Hi'; $foo = [ qw/how are you/ ]; # No prob

Did I get to the correct understanding?

I was amused by your mention of formats. One of my first paid Perl projects (Perl 4) was an online listing for a used machinery dealer, and it used Perl formats to create columns for hundreds of machines and parts. Once I had the format designed (where you're suggesting people might not want to go), the output was included with <pre> tags and updated so quickly the dealer would get confused as whether they were looking at it locally or online.

And very amusing: I still use a module based on blessing the whole glob/symbol table. It's a very powerful tool for me - typically the hash handles transient data like successive records from an iterator, the array is a buffer for capturing things for use later, the IO is for quick output or caching, and the sub dispatches that object's template parser (using the data in the hash). Not published - you're looking at fear of being laughed out of CPAN ;-)

Anyway, many thanks for your clear explication!

Leave a comment

About Leon Timmermans

user-pic