AI Contributions to CPAN: The Copyright Question

This is a developer's perspective, not legal advice. I'm not a lawyer. What follows is my personal reading of publicly available licenses, policy documents, and one court decision. If you're making decisions about your module's legal exposure, talk to a lawyer.

The open source world is actively debating whether to accept AI-assisted contributions. QEMU, NetBSD, and Gentoo have said no. A lot of CPAN maintainers haven't written down a policy but have a reflex — if a PR looks AI-assisted, close it without a real review.

Two very different concerns sit underneath that reflex, and they usually get mashed together:

  • Quality. AI can produce plausible-looking code that doesn't actually work, hallucinates APIs, or subtly misunderstands the problem. This wastes reviewer time.
  • Copyright. AI might reproduce memorized training material the contributor has no right to relicense. The output itself may not even be copyrightable. The contributor may not actually own what they're submitting.

This article is about the second one. The quality concern is real, but you already know how to handle it — you read the PR, you run the tests, you ask questions. It's the same concern you have about any contributor. The copyright concern is the structurally novel one, and it's the question where CPAN's particular history matters.

I maintain CPAN modules, and for a while I've been running AI-assisted modernization tools (koan is one example of the category) against the modules I care for. Across the cpan-authors GitHub organization — a collection of repositories where many CPAN modules are collaboratively maintained — hundreds of those PRs have been merged, every one with a human in the loop: a maintainer reading the diff, running the tests, and taking full responsibility. My argument isn't that nothing has ever been wrong. It's that the copyright concerns people raise about AI-assisted contributions are not a new kind of concern for CPAN. They are the same concern CPAN has always carried, under a different name.

The reflex to reject AI-assisted PRs on copyright grounds quietly smuggles in an assumption: that human-authored CPAN contributions came with a clean bill of provenance, and AI is the thing threatening to pollute the water. That assumption does not survive five minutes of honest thought about how CPAN actually works.

What CPAN actually verifies

CPAN's rule for accepting new code is, in full: the author uploads a tarball, declares a license, and we trust them. Many modules pick "same terms as Perl," but plenty use Artistic 2.0, MIT, Apache 2.0, or something else — and a surprising number ship with an ambiguous or missing license line.

There's no DCO. There's no CLA. PAUSE confirms that you are who you say you are. It does not confirm that you own what you just uploaded. When the license is "same terms as Perl" — still the most common declaration on CPAN — it points to a moving target: Artistic 1.0 plus GPL 1-or-later. That pair has been tested in U.S. court basically once, in Jacobsen v. Katzer (Fed. Cir. 2008), and never that I know of outside the U.S. The part of Artistic that reliably works is the warranty disclaimer — and every common CPAN license (Artistic 2.0, MIT, Apache 2.0, GPL) has a disclaimer of its own. Those disclaimers appear, to me, to be doing most of the real legal work for CPAN today.

If you've maintained a module for a while, you already know the quiet risks sitting in the index:

  • Code an employer actually owns, uploaded by an employee who never got permission. Some of this has been in widely-used distributions for decades.
  • Code copied from a book, a paper, or a reference implementation, with the original copyright never mentioned.
  • Modules ported from another language without telling — or asking — the original author.
  • Abandoned modules whose authors can't be reached and whose ownership chain is, in practice, lost.
  • License tags like "same terms as Perl 5.6.1" that point at a license text nobody has read in twenty years, and whose present-day legal force I personally wouldn't want to bet on.

None of this has blown up into a real copyright fight. That's not because CPAN is clean. It's because the licenses CPAN actually uses — Artistic/GPL, Artistic 2.0, MIT, Apache — are all permissive enough that, as I read them, a plaintiff would have a hard time showing they lost anything. It's because every one of those licenses comes with a warranty disclaimer that blocks most ways of suing in the first place. And it's because the culture of the ecosystem does not go looking for that kind of fight.

This is not a knock on CPAN. It's the cost of a thirty-year-old volunteer archive that chose breadth over lawyering. The trade has held up fine.

But it's the baseline you need to admit before you reject an AI-assisted PR on sight, on copyright grounds alone.

AI is not a new kind of copyright risk

Once you look honestly at how CPAN handles contributions, the AI copyright debate stops being about whether to let something dirty into a clean system. It becomes about whether to add one more flavor of provenance uncertainty to a system that has always run on several.

The copyright concerns people raise about AI are real:

  • The model might reproduce copyrighted code it was trained on, without attribution.
  • Parts of the output may not be copyrightable at all, which weakens the contributor's ability to license it to you.
  • The contributor may not have a clear legal claim to what they're granting.

Now compare those to the copyright concerns that have always been there for CPAN:

  • The contributor's employer might own the code under a work-for-hire arrangement and never agreed to release it.
  • Part of the module might be a near-copy of a GPL-incompatible reference implementation from a textbook or another project.
  • Ported-from-another-language code may have the original author's copyright still attached to it in ways the contributor never addressed.
  • Abandoned modules may have copyright holders who are unreachable and whose heirs could, in theory, assert rights tomorrow.

It's the same shape. We're not sure the human who signed for this code had the legal authority to grant the license they claimed, and if they didn't, we have no efficient way to find out. The source of that uncertainty has changed — it used to be a textbook or an employer, now it might also be a training set — but the failure mode is the same, and the way CPAN handles it is the same: the human takes responsibility, the warranty disclaimer catches what falls through, and the permissive license protects downstream users.

That's why the pragmatic camp in open source — Linux kernel, Red Hat, Apache, GitLab, OpenInfra — has landed where it has: allow AI-assisted contributions with commit-trailer disclosure (GitHub-native projects commonly use Co-authored-by:; the kernel uses Assisted-by:) and full human responsibility. The strict-ban camp — QEMU, NetBSD, Gentoo — quietly assumes the pre-AI baseline was clean. For projects with DCOs and CLAs, that assumption is at least defensible. For CPAN, it isn't.

What actually changes: volume

There is one thing AI genuinely changes about the copyright picture, and it's the thing every CPAN maintainer should care about: volume.

CPAN's background rate of problematic code — borrowed from a book, copied out of work-for-hire, lifted from an incompatible-license reference implementation, and now potentially reproduced from a model's training set — has never been zero, and systematic provenance review has never been the norm on CPAN. What AI changes is that one contributor with a decent model and a decent tool can produce PRs at a rate no individual ever could before. Ship more, roll the dice more.

That's the change worth naming. Not a new kind of risk. Just more throws of the same dice — which justifies some proportional vigilance, not a categorical rejection.

How to actually evaluate an AI-assisted PR

When an AI-assisted PR lands on your module, the quality checks are the ones you already know how to run — does the change make sense, do the tests pass, does the style match the repo, does the contributor engage when you ask questions. None of that is AI-specific.

The copyright-specific checks are small and cheap:

  • Look for the obvious copy tells. Anything with copyright headers, SPDX tags, or author-name strings in it should bounce — those are signs of memorized training material. A quick GitHub search for any distinctive identifier or string literal will catch the worst cases. This scales with the volume.
  • Expect disclosure. Ask contributors to mark AI-assisted commits with a Co-authored-by: trailer — the standard GitHub convention, and one Red Hat names alongside Assisted-by: and Generated-by: as an acceptable option in its Nov 2025 analysis. The point is transparency: a reader of the log should be able to see which commits had AI help. Only humans should appear in Signed-off-by: lines; an AI has no standing to certify the DCO.
  • Hold the human accountable, not the tool — and be clear about which human. Who is asserting the license grant depends on the workflow. When an external contributor opens a PR, they are the one claiming the right to license their submission; the merge accepts that claim. When you are running an AI tool yourself against modules you maintain, the assertion and the acceptance collapse into a single merge click, and you are the accountable human on both sides. Third-party AI-assisted PRs sit in the normal contributor-plus-maintainer model; self-operated tools compress it. Either way, a human with legal standing is accountable somewhere. If you can't explain what you're merging — because you didn't review it, or because the contributor can't explain it either — don't merge it.

The difference between a PR you merge and one you reject, on copyright grounds, isn't whether a model was involved. It's whether a human with legal standing to grant the license took responsibility for doing so — at submission, at merge, or both. That test is the same test CPAN has been running, implicitly, for thirty years.

One argument to retire

"CPAN was in the training data, so AI output is CPAN-compatible." Let it go. We don't know what's in Claude's training data, or GPT's, or anyone else's. We do know these models have read Perl from all over GitHub (including incompatible licenses), from Stack Overflow (CC BY-SA), from books, blogs, and anywhere else crawlable. "Trained on CPAN" is a comforting story, but we can't confirm it, and resting an argument on it makes the argument weaker, not stronger.

The honest claim is smaller and more useful. The contributor has reviewed the output. The contributor is accountable. The warranty disclaimer catches the residual risk the same way it has for three decades. That is what the kernel, Red Hat, Apache, and OpenInfra policies all quietly say. It's the claim you can defend — and it holds whether the tool in the contributor's hand is Claude, GPT, a generous colleague, or a textbook.

Closing

The copyright concern with AI-assisted contributions is real, but it isn't new. CPAN has been running on the same legal structure for thirty years — permissive licenses, warranty disclaimers, contributor accountability — and that structure handles AI-assisted contributions the same way it handles every other kind of provenance uncertainty CPAN has quietly carried since the beginning. The risk profile is the same. Only the volume is new.

Do the copyright-specific checks when the volume warrants them. Look for the obvious copy tells. Ask for Co-authored-by: disclosure. Hold the contributor accountable for the license grant they're making. Trust the warranty disclaimer that's been doing the heavy lifting the whole time.

And update your own internal picture of CPAN while you're at it: it was never a clean room on provenance, and pretending otherwise was always a story we told ourselves.


3 Comments

I've been goofing around with another angle. What do I do when I commit code that I asked something to generate and then merely direct its improvement rather than typing out code myself?

For example, I was evaluating Claude Code, so I asked it to write a program to find all CLAUDE.md files on CPAN. I think I asked Claude for two improvements, like "no, I want every result" and "include the repo in the path". I made one code suggestion: use the printf arguments to specify the field widths.

If the other side was a human but I was responsible for the commit, what would I do? Well, I'd set the GIT_AUTHOR_* variables and commit it. This is more common than you think since I get contributions as patch files or code in a GitHub comment. It's also common that I don't have good author info, and sometimes trawl contributions or other repos to find other commits to see what info they used so the new commit is associated with the rest of the participation. I might record myself as a Co-Author if I think my contributions were meaningful, but otherwise, I'm just the committer.

Ignoring everything else, and aware that I'm doing this for the giggles, I made Claude Code (noreply@anthropic.com) the author, and hey, there's a GitHub user already!

And, like any other contribution, I do no work to check if there is any sort of copyright violation.

Part 2: New York State

A logn time ago, the Perl standard library ran into some problems because the employer of one of the New York State authors asserted ownerwship of parts of Carp and Encode, and it was completely legal and above board. Ben Tilly explains it all on Perlmonks, and there's a Slashdot story.

Although I have not delved into this to see how it changes the situation, but New York State passed in 2023 a series of laws to loosen employer rights and strengthen employee rights in inventions.

Most CPAN authors probably have no idea if they actually legally own the code they wrote, even if they think they morally own it. And, the plurality of jurisdictions involved really makes this intractable, where the code may violate the copyright laws in one jurisdiction but not another. CPAN does nothing to respond to jurisdictions. It's the same everywhere.

There is likely a lot of code on CPAN that "shouldn't" be there legally, and is only unchallenged because the true legal copyright holder does not know about it, does not care about it, or allowed it despite it lacking a notice for the true copyright holder apart from the human author.

I know I have some code on CPAN that sprang from works-for-hire. I always ask first, but I've never gone further than an oral agreement which I probably wouldn't be able to prove in court (mostly because I couldn't recall when or where or who).

However, I also do a fair amount of government work in unclassified contexts (so no a priori disclosure restrictions), which often means my output isn't copyrightable (see Learn about copyright and federal government materials for example). That might mean, for example, that I can't impose any license restrictions on that output.

SQLite is famously public domain because D. Richard Hipp made it that way. But, in some jurisdictions, I think the lack of an owner causes different problems (given enough jurisdictions, there will always be one that is the odd ball).

"memorized training material" is the tell you do not really understand how machine - or human - learning works.

AI does music. Who pays all these authors the AI has learned from?

Yeah. Are they paid by other (human) musicians who explicitly state to have been influenced (and one can hear it sometimes) by them?

Not in the general case. Of course - if there are "verbatim passages of N notes" - that may already be a different story that has and will keep the courts busy.

What I find really funny is the section "Look for the obvious copy tells". One such "tell" is — (as compared to -) and the article is full of it. Clearly AI written, yet I am missing the "Co-Authored by" disclaimer. (missing it != requiring it)

By the way, I may be wrong. I started to put — in my handwritten texts myself when I feel so because it actually is the typographically correct thing to do and people going all apeshit over it is just bonus.

This is a good article and it is well written.

About licensing.

I have on this occasion and this occasion encouraged people to relicense their code to well known permission licenses.

Some open source licenses have significant compliance burdens on the user - AGPL being a good example.

Bespoke and obscure licenses make your code potentially unusable by anyone employed by a company with a well developed foss license policy (i.e they usually have an approved list, a forbidden list, and the rest requires legal to care enough to review it).

So AI to the side, the CPAN ecosystem could be better at helping people navigate licenses. Right now it doesn't at all.

Leave a comment

About Todd Rinaldo

user-pic I blog about Perl.