575 Pull Requests in Three Weeks: What Happens When AI Meets CPAN Maintenance
On March 17th, I installed a bot called koan on my personal Claude account. It's designed to monitor your four-hour usage limits, maintain a queue of "missions," and efficiently use your credits — even while you're sleeping.
Three weeks later, I had reviewed and merged 575 pull requests across 20 CPAN repositories, cut 58 releases across 17 of them, and caught memory leaks, security holes, and long-standing bugs that nobody had gotten around to fixing in years.
Some people think that means I let an AI commit unchecked code to critical Perl infrastructure. I want to explain what actually happened.
The Numbers
Here's what those three weeks produced:
| Repository | PRs Merged | Releases |
|---|---|---|
| XML-Parser | 95 | 8 |
| IPC-Run | 68 | 3 |
| YAML-Syck | 64 | 7 |
| Net-Jabber-Bot | 38 | 3 |
| Net-ACME2 | 38 | 1 |
| Net-Ident | 33 | 5 |
| Crypt-RIPEMD160 | 33 | 4 |
| IO-Stty | 30 | 4 |
| Razor2-Client-Agent | 27 | 2 |
| Tree-MultiNode | 24 | 2 |
| IO-Tty | 22 | 7 |
| Business-UPS | 20 | 2 |
| Test-MockFile | 17 | 2 |
| Safe-Hole | 17 | 3 |
| Tie-DBI | 15 | 1 |
| Regexp-Parser | 11 | 3 |
| Sys-Mmap | 9 | 1 |
| Template2 | 6 | — |
| Crypt-OpenSSL-RSA | 4 | — |
| CDB_File | 4 | — |
| Total | 575 | 58 |
Those are the kind of numbers that make people nervous, and I understand why. Let me explain why they shouldn't be.
Every PR Was Reviewed by a Human
I need to say this plainly: every single pull request was reviewed and merged by me. Not rubber-stamped. Reviewed.
The koan bot submitted fixes, improved Kwalitee across the repositories, and worked to ensure CI covered as much surface area as possible so that little could merge without being tested. But it didn't have merge access. I was the bottleneck by design.
There was real pushback during review. When fixes looked wrong, I said so. When explanations were missing, I asked for them. When an approach was the wrong one, I rejected it. This was not a case of an AI firehosing code into production. It was a case of a maintainer using an AI to generate candidate fixes at a pace that would have been impossible for one person to write — but not impossible for one person to review.
That's the distinction people keep missing.
The Failures Told Us What We Didn't Know
Here's the part that actually gets interesting. The CPAN Testers matrix tracks test results across Perl versions, operating systems, and configurations. When we shipped releases, some of them failed. Look at the data:
| Dist | Before | After Fix |
|---|---|---|
| IPC-Run | 103 FAIL (20260322.0) | 0 FAIL (20260327.0) |
| IO-Stty | 20 FAIL (0.05) | 0 FAIL (0.08) |
| Net-Ident | 25 FAIL (1.26) | 0 FAIL (1.27) |
| YAML-Syck | 52 FAIL (1.38) | 0 FAIL (1.40) |
| IO-Tty | 18 FAIL + 31 UNK (1.20–1.21) | 1 FAIL (1.27) |
| XML-Parser | 11 FAIL (2.49) | 0 FAIL (2.57) |
Were there regressions? Yes. IPC-Run 20260322.0 shipped with 103 failures. That's because the AI-generated changes exposed CI gaps we didn't know existed — configurations we weren't testing, platforms we hadn't considered. Five days later it was at zero. IO-Stty went from 20 failures down to zero across four releases. YAML-Syck spiked at 1.38, was fixed by 1.40, spiked again at 1.42 with 86 failures on a different issue, and was clean again by 1.43.
The failures weren't the problem. The failures were the signal. They showed us where our CI was incomplete, and the rapid release cadence meant we could respond in days instead of months.
What We Actually Fixed
This wasn't just reformatting code and updating boilerplate. Across these repositories, we found and addressed:
- Memory leaks that had been lurking for years
- Security vulnerabilities that no one had audited for
- Long-standing bugs that users had reported but no one had time to fix
- More complete implementations of features people had requested for years but nobody had done
- CI blind spots — entire platforms and Perl configurations that were never being tested
Many of these modules are deep infrastructure. XML-Parser, IPC-Run, YAML-Syck, IO-Tty — these aren't hobby projects. They're load-bearing walls in the Perl ecosystem. The work that got done in three weeks would have taken a solo maintainer the better part of a year, assuming they had the time at all.
The Reaction
The volume of activity got attention, and not all of it was positive. Some people looked at the PR count and concluded it must be AI slop — untested, unreviewed code flooding CPAN. Gentoo's packagers nearly banned my modules on the assumption that I was blindly shipping AI-generated code.
I'd encourage anyone with that concern to look at the actual diffs, the CI results, and the review comments. They're all public. If a specific change is wrong, let's talk about it — that's how open source is supposed to work.
What's worth noting is the double standard. The Perl community routinely accepts drive-by patches from complete strangers. Nobody demands that a first-time human contributor prove their code wasn't generated by autocomplete or copied from Stack Overflow. But attach the label "AI" and suddenly the code quality of the entire module is in question.
"It was generated by AI" is not a technical objection. The code either works or it doesn't.
AI_POLICY.md
In response to the concerns, we now ship an AI_POLICY.md in our repositories. You can read the full document, but it comes down to one line:
AI assists. Humans decide.
The document lays out exactly how AI is used: analyzing issues, generating draft PRs, surfacing context from the codebase. And it makes explicit what should already be obvious — every pull request, whether AI-drafted or human-authored, is reviewed by a human maintainer before merge. AI drafts are treated the same way you'd treat a junior contributor's first attempt: useful raw material that still needs experienced eyes.
We wrote this policy not because we had to, but because transparency matters. If AI is going to be part of open source maintenance — and it already is, whether projects acknowledge it or not — then the community deserves to know how it's being used.
The Real Question
The question isn't whether AI should be involved in open source maintenance. It already is. The question is whether maintainers are going to be honest about it and put guardrails in place, or whether it's going to happen quietly with no review process at all.
I chose the transparent path. I reviewed every PR. I shipped an AI policy. I responded to regressions within days. I'm accountable for every line that shipped, the same way I've been accountable for these modules for years.
575 pull requests. 58 releases. Memory leaks found. Security holes closed. CI gaps filled. Bugs fixed. Features completed.
The code speaks for itself.
Nicely Done !!!! This is a great example of AI assistance and disclosure!
Im curious to know the dollar amount spent on AI to achieve these results. Will you be doing this again on a set of different libraries in the future ?
Nice. I've been using AI to clean up some of my code, and I learned some things. It's not an existential threat if you don't let it be. It can help you get better and unclutter technical debt that weighs on you so long you forget about it.
This is very interesting, and I hope it leads somewhere positive.
What are your thoughts on addressing the problem of "trained ignorance" in LLMs?
False negatives when looking for security issues can be pretty nasty, I think.
Maybe if one uses multiple independently trained models and/or a second pair of human eyes to review the code, this risk can be minimized?
The $100 monthly claude account was enough. $20 could have been used but it would have taken longer.
> What are your thoughts on addressing the problem of "trained ignorance" in LLMs?
Not sure what you mean here. Making sure the code is well documented and POD-ed certainly makes claude more knowledgeable. In one case I caused a regression in "false" behavior with XML::Parser which I had to revert. this caused downstream outages but ended with tests and a more documented API.
> False negatives when looking for security issues can be pretty nasty, I think.
Claude only hi-lights the possibility of security issues. But it does so clearly in the PR. When it does so, it is instructed to ALSO provide a unit test that clearly shows a way to reproduce the issue so that any researcher or reviewer can review it and judge for themselves. Just like a real human submitting a security issue, it's up to me and security teams to decide if this is something worthy of concern.
> Will you be doing this again on a set of different libraries in the future ?
If people are open to it. I'm currently looking at offering to help with LibXML but I consider this to be a decision of the current maintainer if they want the help.
When you review at breakneck speeds like 190 PRs per week, what you approve is stuff like this:
https://github.com/cpan-authors/XML-Parser/pull/118#issuecomment-4165348882
… and this:
https://github.com/cpan-authors/XML-Parser/pull/140#issuecomment-4165195217
So with all that, <intonated>I need to say this plainly: each of these pull requests was reviewed and merged by you. Not rubber-stamped. Reviewed.</intonated>
[Silence]
Yes. So would I.
Ship first. Talk later.
> So with all that, I need to say this plainly: each of these pull requests was reviewed and merged by you. Not rubber-stamped. Reviewed.
Correct! I also remember the decision I made on both of these and I determined they were warranted. One of them broke downstream and I was able to quickly fix it. I now have documentation in the code base explaining why that behavior exists.
It's when the LLM is (purposefully) trained to accept a pattern of text as good, when it is in fact bad.
e.g. Let's say you're training a special model for detecting security issues, but you make sure that during the training that two-arg open() calls are NOT flagged as a potential security issue.
Anyone using this model later to find issues in their own code, will now NOT be made aware of this, and if they themselves are ignorant of the problem, and just a little too trusting/sloppy in their review, then the issue will remain in the code. This is a "false negative" - the LLM says "there's nothing wrong here" (it's a negative claim), when it in fact *is* to the contrary (it's also a false statement)
Unsafe code that is assumed to be safe is scary.
Yes, but what happens if the human purposefully chooses to not make you aware of an issue, while at the same time being thorough in reporting all the other issues it finds?
Now there is one issue left, that they know of and you don't. (Remember, you're only reviewing what is submitted, and they didn't submit anything for this remaining issue).
False positives are _really_ scary, when one allows an unknown third party in - especially if this third party is untrusted.
In the next iteration of this, may I suggest doing development releases? Let CPAN Testers chew on the code for a bit before you put it into production. That way production code doesn't take a hit if there is an unexpected problem.
I noticed your bot sending you pull requests and was meaning to ask some questions about it bit you've saved me the effort.
This is very impressive and exactly what is needed. I noticed one such pull requested which got rid off a modules outdated indirect object syntax. My reaction was whatever a metaphorical fist pump looks like when you're just sitting at your computer and don't actually move.
Absolutely, make AI do all that boring stuff! I have been having it take a crack at various outstanding requests on modules and merging good results. I have also simply pointed it at my modules and asked it for a deep code review.
Good quality coding agents (claude etc) are effectively combine harvesters. I am certain farmers resisted them once upon a time, but that is long gone. We can do much more now and much quicker. Our imagination limits us, and like someone trying to rewind a DVD, we need to shake off the habits of old technology.
Folks may recall that in "The Mythical Man-Month: Essays on Software Engineering" circa 1975, "automatic programming" was described as coming - and here it is.
Anyway this is fabulous. It would be great to match unused credit via something like the late "pull request club".
Tangentially - "these aren't hobby projects. They're load-bearing walls in the Perl ecosystem" - sounds like a cliche that AI would generate. But anyway i'm all for AI generated code like this and AI generated blog posts that are well written - just need to adjust your AGENTS.md to avoid annoying cliches (at least you didn't say "learnings" or "concrete")
> In the next iteration of this, may I suggest doing development releases?
> Let CPAN Testers chew on the code for a bit before you put it into production.
The issue so far has not been my own module's code it's been downstream deps. CPAN Testers doesn't cover this and definitely doesn't point out incompatibilities with devel versions against downstream modules. So the question I keep asking when people bring this up is:
How would this help?
> just need to adjust your AGENTS.md to avoid annoying cliches (at least you didn't say "learnings" or "concrete")
It's what I get for not being more pedantic with a French person's English pluralisms. :)
Trust me I actually do correct these several times a week!
I'm not sure where you got this idea. CPAN Testers covers whatever individual testers decide to cover, which certainly includes downstream deps of trial releases. But more importantly: how would it hurt?
Having some mechanism to determine downstream breakage would be a net win with our without coding agents.
It would be even more helpful with an automated mechanism to communicate changes to downstream authors.
As far as I can tell, other languages are now pinning everything and using bots to move the pins when tests pass. This has the major upside of each project itself opting-in to tracking upstream changes - not just being blasted with automated break notifications from some system they haven't decided to care about (or even know about).
Pros and cons - but a problem nonetheless.
And fwiw I think the perl community has overvalued compatibility to our peril. Rather than aversion to breaking changes, I know we can come up with mechanisms to make changes painless and manage the risks (rather than avoiding the risks).
I don't know how development releases would help you. I do know that just because one of my modules passes all tests on my box does not mean it will pass everywhere. If the problem is truly only downstream failures, maybe development releases wouldn't help.
But maybe what this means is that we need more testing infrastructure -- something analogous to "Blead Breaks CPAN," but for CPAN itself, not
perl.I have a lot of concerns about using LLMs.
The sheer volume of code changes they can submit seems overwhelming. That's a lot to review, and it seems that bugs can slip through. I've seen some daft changes show up in codebases due to AI.
There has also been some research in poisoning LLMs so that can insert security holes in code, not to mention years of badly-written/insecure code posted online that they have been trained on.
There are also some serious legal and ethical concerns about using them:
Do the PRs contain code snippets from other code with incompatible licenses?
Or worse, will YOUR code end up as PRs for other projects? Will it end up in non-open source/commercial projects?
These tools also require a lot of behind-the-scenes exploitive human labour in countries like India or Kenya.
They also require data centres that consume large amounts of water and electricity, which affects the communities that these data centres are in.
And there's the political agenda of a lot of AI-boosters who seem to have read dystopian sci fi stories and thought these were cool worlds they wanted to build. Using LLMs is seen by many as contributing to world where creating and skilled work is replaced by automated slop.
> I do know that just because one of my modules passes all tests on my box does not mean it will pass everywhere.
One of the critical pieces we learned about rapid development was assuring a complex CI workflow. If you review one of the actions for XML-Parser at https://github.com/cpan-authors/XML-Parser/actions/runs/24453321541, you'll see we test, all versions of perl, with/without LWP, 3 downstream packages, fedora, macos, ubuntu, and 3 flavors of bsd. There are minor things we cannot get without cpan testers but even if those fail there, they don't install so I don't see a high risk as long as we fix them quickly.
> ... something analogous to "Blead Breaks CPAN," but for CPAN itself, not perl.
Correct I think this is what people expected me to somehow have checked prior to release. It simply was never a part of our release process and the good thing about rapid release was it was easy to fix (and add to CI) once I realized there was a problem!