The Coro situation

Since my recent participation at the QA Hackathon I have become aware that rather more people than I expected do not know the specifics of this situation. Fewer than I expected have heard of it at all, even, although there appears to be some general awareness at the “something happened with that” level at least.

However, the situation is being used to characterise Marc Lehmann whenever his name comes up (and it comes up rather more often than I would expect or consider necessary).

To give a clear picture of the facts and to avoid repeating that exercise every time I have a related conversation, here is an outline of where we are and how we got here.

(Thanks to Andreas König, Graham Knop, and Peter Rabbitson for proofreading drafts of this article and verifying the stated facts.)

Objectively

Here is a timeline of relevant events:

  • In late 2007, Coro gets a new feature: per-coroutine die/warn handlers.

  • Various bugs with the feature are reported and fixed, e.g. not working on perls earlier than 5.8.8. Tests slowly accumulate.

  • One of these bugs is that the Perl-level $SIG{__WARN__} and XS-level PL_warnhook can diverge (and likewise for __DIE__). The workaround requires hooking into a very low-level point in the perl internals.

  • In Perl 5.21.7, as part of fixing the semantics of constant inlined subs, the incidental lvalue-ness of a core macro gets broken in patch 9b7476d7a269.

  • In Perl 5.21.8, in a pursuit for memory savings, the ability to hook into these internals is removed with patch c910fead7893. This results in a savings of 4 kilobytes total per perl process.

  • Coro is found to be broken. An issue is raised before the release of 5.22.0, but discussion fizzles out.

  • 5.22.0 is released with both patches that break Coro.

  • perl #125439 is filed, about the imperfect synchronisation between $SIG{__FOO__} and PL_foohook, but aside from Corion’s valiant but not yet successful attempt, little energy is put into fixing it.

  • Coro is now in a place where the options are as follows:

    1. Make the tests and workaround conditional on perl version and accept being buggy with no recourse for some evidently non-hypothetical users, which implies headaches for testing and surprises for users moving from version to version.

    2. Refuse to build on perls that suffer from both perl #125439 and c910fead7893.

  • The author of Coro picks option 2. He also provides an unofficial release of perl 5.22 called “stableperl” as a courtesy for users with no way out nor the skill and inclination to figure out and apply the patches necessary to unbreak Coro under 5.22 and following.

  • At opening of the next Perl release cycle, in 5.23.0, 9b7476d7a269 is corrected by patch 73949fca082f to restore the lvalue-ness of the macro. This leaves only one change still breaking Coro, so during most of the 5.23 cycle, it can be unbroken by applying the following patch, reproduced here in its entirety:

    --- i/perl.h
    +++ w/perl.h
    @@ -5586 +5586 @@
    -#define EXT_MGVTBL EXTCONST MGVTBL
    +#define EXT_MGVTBL EXT MGVTBL
    
  • It is decided that the patch shown above cannot be applied in 5.22.1, because that would violate the strict binary compatibility promise for point releases of perl within a stable version series. Regrettable/unfortunate though this may be, it is true, and not a simple technicality.

  • However, no move is ever made to apply it during the 5.23 development series.

  • Reini Urban publishes an unofficial release of Coro in which he takes option 1, i.e. it works on perl 5.22 at the cost of being buggy for some users. [See Ctrl-F PERL_VERSION at the linked diff.]

  • In 5.24.0-RC3, the ability to hook those internals is still absent, so the situation for Coro remains unchanged, implying that it will remain broken for at least another release cycle.

  • A call is made to discuss whether the author of Coro has forfeited his ownership of the namespace.

  • Effects of the decision to keep that change are still cascading out, e.g. miyagawa/Task-Plack#3.

Interlude

The still-unreverted patch was also reported (in perl #123687) to affect mod_perl 2, where it was fixed as documented in RT #101962, resulting in patch 82827132efd3. The solution was to increase the complexity of the hooking code, so that it can inject the same workaround as before but into a different point of the internals where it has a more limited scope. Contrast the amount of volunteer effort with the 4kb per process savings.

In all this, mod_perl 1 went entirely unaccounted for. It is unknown whether any users have tried to use it with perl 5.22, which can be expected not to work. Note also that c910fead7893 removes a comment which mentions that some SWIG bindings require the ability to hook those internals which just got removed.

Subjectively

People have been speaking of Marc Lehmann “refusing” to support 5.22 in Coro. It has been suggested that he could try to find a workaround.

But for one, if it is fixable, it is highly non-obvious how. The way mod_perl 2 was fixed does not appear applicable to Coro due to difference in the specifics of the two situations. This is implicitly acknowledged by Corion’s comment in perl #125439:

I think Perl should take the code from Coro to make $SIG{__WARN__} always write+read PL_warnhook

… and further evidenced by the fact that the unofficial release of Coro does not have a workaround despite the expertise of Reini Urban.

And for another, even if a workaround were possible, consider that Coro was working around a bug to begin with. That workaround got broken. It is not clear that a different workaround would be future-proof, outside of p5p blessing one as such. Is it on Coro’s author to keep working around broken workarounds for the same bug whenever p5p breaks another one? Or is it on p5p to either fix the underlying bug or else avoid breaking its workarounds?

To be fair, the monkeypatching performed by Coro and mod_perl hooking into that table is very global: it affects every perl interpreter within a process (of which there can be several when perl is embedded in another program). Locking down this hook point is a sensible architectural decision in the abstract. This is rumoured to have been the reason the patch never got reverted. Indeed, for most use cases, it is already possible to hook into other points with more limited scope. This is how mod_perl was saved. But it’s not clear that every existing use case already has cleaner alternatives, and even the ones available are not currently well understood – not even by the developers who locked down the hook point. Consequently there was no attempt to help downstream developers who had previously employed monkeypatches: no attempt was made to survey existing use cases, no documentation was added about clean and future-proof alternatives to existing monkeypatches, no helper functions to make those approaches more approachable… no aid of any kind. Such efforts would have been part of a carefully executed plan to promote cleaner coöperation among low-level extensions. Instead the commit message demonstrates the motivation for the patch prior to ex post facto justifications of its benefit.

Note this alternative sequence in which events could have played out:

  • When the 4kb-saving patch is found to break Coro and mod_perl, it gets reverted for the time being

  • A report about the underlying imperfect hook synchronisation bug is filed and a fix is written

  • A new major version of perl with the fix is released

  • The author of Coro is approached to say “your workaround has been made unnecessary in perl 5.$recently and later”

  • When the decision is made that this patch shall be re-applied, an effort is made to prepare the ecosystem (the surveying and documentation as mentioned above), during which the author of Coro is approached to say “the workaround should now be conditionalised because it will break in 5.$soon”

Nothing in that sequence is infeasible. It would have delayed the 4k-saving patch for a few releases, and in exchange it would have prevented any and all suffering inflicted on at least the maintainers and users of software built on Coro.

I close with a recent quote from the pumpking of Perl 5:

perlpolicy […] demands that we keep backward compatibility as much as possible. This is a huge deal for Perl 5, for a lot of reasons.

For one, it means that we don't knowingly break things unless we believe there's a good reason – and we’ve been trying to tighten up “good reason” a bit over time. This means that it's almost always safe to upgrade your perl without unpleasant surprises. The “incompatible changes” section of the changelog file should be short and fairly exhaustive.

Emphasis mine.

3 Comments

Of course, the main objective for p5p has to be keeping perl (the interpreter) stable and usable while advancing its features.

But the next most important objective simply has to be enabling the community to make use of perl to the greatest extent possible in any way that does not violate the prime directive.

The main intention of the original patch apparently was some - imho rather negliable - memory usage reduction and not fixing some serious problem (the code perhaps needed some improvement but it wasn´t actually "broken"). Now when it turned out the change broke some valuable part of the ecosystem, instead of saying "whoops, sorry, let´s revert that and try again later with the insight we just gained" they´re seriously still discussing if they should revert that patch?!

See the somewhat fascinating discussion at http://www.nntp.perl.org/group/perl.perl5.porters/2016/05/msg236174.html

There´s some argumentation that "Coro is still broken when reverting the patch" - but at the same time apparently nobody is really sure how that could be fixed. Perhaps inside Coro alone? Perhaps in perl but without breaking the compatibility within the 5.24.x release cycle? Why the heck should that keep the patch from being reverted?!

Either postpone the release of 5.24.0 and investigate further or revert the patch to not intentionally slam the door in the face of Coro for the whole 5.24.x release cycle.

For the record: I agree that the benefit of the original 5.21-series consting change has been massively outweighed by the cost to date, if only because of how much heat this matter has generated.

However, I'm pleased to note that making Coro work with released versions of Perl 5.22 and 5.24 is apparently relatively straightforward regardless, at least assuming that this Debian patch works as advertised:

https://anonscm.debian.org/cgit/pkg-perl/packages/libcoro-perl.git/tree/debian/patches/coro-5.22.patch?id=dcd10ba9611b38e0ef19ef597b42b635fa23d942

Leave a comment

About Aristotle

user-pic Waxing philosophical