AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way

yoasif@fedia.io · 2 months ago

AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way

slacktoid@lemmy.ml · 2 months ago

The way I see it is, and not saying this isn’t a valid concern, it that it still doesn’t help with code maintenance. Just cause you can create it doesn’t mean you can maintain it. Many companies moved to open source (not free software) cause of the financial incentives of security and long term maintainability of the codebase. Think of how much better say tensorflow and pytorch got because it was opensource. The engineers of Google and meta could make it what were their reasons for open sourcing it? I doubt these reasons have changed with ai. Cause nothing beats free Q&A testing.

Lime Buzz (fae/she)@beehaw.org · 2 months ago

I know this is about copyright/copyleft and stuff. However, I’m tired of seeing previously good and working projects start degrading in performance.

We recently switched from one media player to two because the former contained LLM regurgitated code and the replacement didn’t. Guess which one worked better? If you guessed the two replacements you’d be correct.

I actually asked some projects to move when Microsoft showed their hand and lack of ethics and complicity with both LLM (copilot specifically) gobbling up data and genocide.

The maintainers seemed unbothered by that and more concerned with their ability to be found, which to me says a lot about what priorities they have.

I deleted my github account and won’t be back, and in the meantime hope that open source projects realise that moving to a non AI infested coding host/forge like codeberg or self-hosted forgejo will be better in the long run. I don’t hold much hope but we’ll see.

All in all this has been a mess, both for users of open source software and coders.

Users because the regurgitated code does create errors and degrades performance. The degradation of performance could be said to be true of the maintainers/coders too as the more they rely on LLMs/ML the less belief they will have in their own work, and they will lose those skills even if they don’t lose that belief as has been well documented now.

It’s a sad state of affairs and I hope the ‘AI’ bubble bursts and we can go back to having decently coded programs etc again, in the mean time I’ll keep seeking out those that have explicitly said no to regurgitated code and promoting them over those who sycophantically support or rely on the bullshit machines.

Buelldozer@lemmy.today · 2 months ago

This is a fast path to open source irrelevancy, since the US copyright office has deemed LLM outputs to be uncopyrightable.

This is a misunderstanding of US Copyright. Here’s a link to the compendium so you can verify for yourself.

Section 313 says “Although uncopyrightable material, by definition, is not eligible for copyright protection, the Office may register a work that contains uncopyrightable material, provided that the work as a whole contains other material that qualifies as an original work of authorship…”

This means that LLM created code that’s embedded in a larger work may be registered.

Section 313.2 says “Similarly, the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author.”

Meaning that LLM created code CAN be registered as long as an author has some creative input or intervention in the process. I’d posit that herding an LLM system to create the code definitely qualifies as “creative input or intervention”. If someone feels it isn’t then all they need to do is change something, literally anything, and suddenly it becomes a derivative work of an uncopyrighted source and the derivative can then be registered (to a human) and be subject to copyright.

In short, it’s fine. Take a breath.

LukeZaz@beehaw.org · 2 months ago

In short, it’s fine. Take a breath.

Ehhhhh, that depends on how you take it. Personally, no, I’m not very worried about the legal aspect. But,

It’s still LLMs. FOSS communities have been better than average, but that bar is a low one considering coders generally have been using LLMs most of all. And LLM usage is reckless, not to mention presently harmful in numerous ways. (And yes, this means the latest models too. “Looks good” doesn’t mean it is good.) I’d just as soon FOSS not use the tech at all.

chicken@lemmy.dbzer0.com · 2 months ago

The only portions of the work that can be copyrighted are the actual creative work the person has put into the work.

Ok, but it’s not like everyone is documenting exactly which parts are generated, curated, or human written.

Maintainers cannot prevent the LLM code from being incorporated into closed source projects without reciprocity

Say someone incorporates GPL code without attribution, and gets sued for doing so. They try to make the argument in court that the source material they used is not copyrighted, because of AI. Won’t they have to prove that the parts they used were actually AI output for this defense to work? It isn’t like people are going around ignoring the copyright on things in general if they look like they were probably generated with AI, that isn’t enough to be safe from prosecution, because you usually can’t know the exact breakdown. It seems like preventing this loophole from being used would be as simple as keeping it ambiguous and not allowing submissions that positively affirm being entirely AI generated.

yoasif@fedia.io · 2 months ago

I don’t really think we need to go down the copyfraud path to see that AI code damages copyleft projects no matter what - we know that some projects are already accepting AI generated code, and they don’t ask you to hide it - it is all in the open.

chicken@lemmy.dbzer0.com · 2 months ago

AI code damages copyleft projects no matter what - we know that some projects are already accepting AI generated code, and they don’t ask you to hide it - it is all in the open.

I don’t see how that follows or contradicts what I’m saying though. They could hide it, easily. Even if they don’t hide it, how useful would it really ever be to only use the portions of the codebase that have been labelled as having been AI generated? Can one even rely on those labels? Making use of the non-copyrightability of AI output to copy code in otherwise unauthorized ways does not seem like a straightforward or legally safe thing to do. That’s especially the case because high profile proprietary software projects also make heavy use of AI, it doesn’t seem likely the courts will support a legal precedent that strips those projects of copyright and allow anyone to use them for whatever. So basically I’m not at all convinced about the idea that AI code damages copyleft projects, it seems unlikely to be a problem in practice.

yoasif@fedia.io · 2 months ago

Making use of the non-copyrightability of AI output to copy code in otherwise unauthorized ways does not seem like a straightforward or legally safe thing to do. That’s especially the case because high profile proprietary software projects also make heavy use of AI, it doesn’t seem likely the courts will support a legal precedent that strips those projects of copyright and allow anyone to use them for whatever.

I think what may happen in practice could be worse - basically if we can’t tell whether some code is the work of a human, but the project accepts AI code, if there we forego the analysis of whether something was produced by a human, the entire project may be deemed public domain – perhaps after a certain date (when LLM contributions were welcomed).

Beyond that, by integrating LLM code into those projects, the projects are signifying assent to their works to be consumed by LLMs for infringement of the whole work - not just the LLM produced portions - it is hard to be doctrinaire about adherence to the open source license when the maintainers themselves are violating it.

We may see a future where copyrights for works become more like trademarks - if you don’t make any attempt to protect your work from piracy, you may simply lose the right to contest its theft.

Obviously, it is as you say - today the courts may smile upon a GPL project where a commercial vendor copied and released as their own without sharing alike - but if they instead say that they copied the work into their LLM and produced a copy without protections (as chardet has done), the courts might be less willing to afford the project copyright protections if the project itself was making use of the same copyright stripping technology to strip others’ work to claim protections over copied work.

Besides which, “authored by Claude” seems like a pretty easy way to find public domain code, and as Malus presents, the only code that may ultimately be protected is closed source code - you can’t copy it if you don’t have the source.

The diversion of “people may try to pass of LLM code as their own” is a nice diversion, but ancillary to the existing situation where projects are incorporating public domain code as licensed. We can start there before we start worrying about fraud.

chicken@lemmy.dbzer0.com · 2 months ago

but if they instead say that they copied the work into their LLM and produced a copy without protections (as chardet has done), the courts might be less willing to afford the project copyright protections if the project itself was making use of the same copyright stripping technology to strip others’ work to claim protections over copied work.

ianal but does it even work like that? Is there any specific reason to think it does? I don’t believe you really get credit for purity and fairness vibes in the legal system. Same goes for the idea that code where it is ambiguous whether it is AI output could be considered public domain, seems kind of implausible, is there actually any reason to think the law works that way? If it did, then any copyrighted work not accompanied by proof of human authorship would be at risk, uncharacteristic for a system focused on giving big copyright holders what they want without trouble.

the only code that may ultimately be protected is closed source code - you can’t copy it if you don’t have the source.

There is no way, leaks happen, big tech companies have massive influence, a situation where their code falls into the public domain as soon as the public gets their hands on it just isn’t realistic. I feel suspicious that many of these concerns are coming from a place of not wanting LLM code in open source projects for other reasons, rather than the existence of a strong legal case that it represents a real and serious threat to copyleft licensing.

yoasif@fedia.io · 2 months ago

ianal but does it even work like that? Is there any specific reason to think it does? I don’t believe you really get credit for purity and fairness vibes in the legal system. Same goes for the idea that code where it is ambiguous whether it is AI output could be considered public domain, seems kind of implausible, is there actually any reason to think the law works that way? If it did, then any copyrighted work not accompanied by proof of human authorship would be at risk, uncharacteristic for a system focused on giving big copyright holders what they want without trouble.

I’m mostly just playing along with your thought experiment. As I said, we know that projects are already accepting LLM code into projects that are nominally copyleft.

There is no way, leaks happen, big tech companies have massive influence, a situation where their code falls into the public domain as soon as the public gets their hands on it just isn’t realistic.

If that is the case, is chardet 7.0.0 a derivative work of chardet, or is it a public domain LLM work? The whole LLM project is fraught with questions like these, but it seems that the vendors at least are counting on not copying leaked software and instead copying open source code that is publicly hosted.

Why is it okay to strip copyright from open source works but not from leaked closed source works?

We know that Disney is suing to protect its works - if it is true that LLM outputs are transformative, they should lose, as should any vendor whose leaked code was “transformed” by an LLM.

chicken@lemmy.dbzer0.com · 2 months ago

If that is the case, is chardet 7.0.0 a derivative work of chardet, or is it a public domain LLM work? The whole LLM project is fraught with questions like these

I think the reimplementation stuff is a separate question because the argument for it working looks a lot stronger, and because it doesn’t have anything to do with the source material having LLM output in it. Also if this method holds as legally valid, it’s going to be easier to just do that than justify copying code directly (which would probably have to only be copies of the explicitly generated parts of the code, requiring figuring out how to replace the rest), which means it won’t matter whether some portion of it was generated. I don’t see much reason to think that a purist approach to accepting LLM code will offer any meaningful protection.

I’m mostly just playing along with your thought experiment. As I said, we know that projects are already accepting LLM code into projects that are nominally copyleft.

So what though? If they aren’t entirely generated, you can’t make a full fork, and why would a partial fork be useful? If it isn’t disclosed what parts are AI, you can’t even do that without risking breaking the law.

yoasif@fedia.io · 2 months ago

I think the reimplementation stuff is a separate question because the argument for it working looks a lot stronger, and because it doesn’t have anything to do with the source material having LLM output in it. Also if this method holds as legally valid, it’s going to be easier to just do that than justify copying code directly (which would probably have to only be copies of the explicitly generated parts of the code, requiring figuring out how to replace the rest), which means it won’t matter whether some portion of it was generated.

Is it a separate question, though?

Both works are copyrighted, one is just copyrighted as “all rights reserved” (our leaked commercial code) and the rest is licensed as LGPL. We’re putting both pieces of code inside the LLM and then asking the LLM to make a new version.

What makes the action of leaking different from the act of putting it on the web? Rights are reserved in either case.

If they aren’t entirely generated, you can’t make a full fork, and why would a partial fork be useful?

Well, people are contributing to copyleft codebases expecting that when people build on their work, that work (the derivative works) are also licensed in the same way. You don’t need to fork for the value to be lost. People expected virality to be part of their contribution, and clearly the new derivative works are partially non-copyleft.

Beyond that, as more of the codebase is LLM produced, the less of it is protected by the copyleft license, until we have a ship of Theseus situation where the codebase is available, but no longer copyleft. That is clearly not what was intended by e.g. the GPL. Just look at the Stallman quote in post.

t3rmit3@beehaw.org · edit-2 2 months ago

This is a fast path to open source irrelevancy, since the US copyright office has deemed LLM outputs to be uncopyrightable.

Open source != copyrighted. Public domain source code is also open source.

I hate this trend I see of the FOSS movement retreating from the foundational principle that it started on: Free Sharing of Software.

Not shareware, not ‘libre but not gratis’, not ‘buy me a coffee to get access to the code on my patreon’, not ‘free to look at but not to use as source code’: free period. Libre and gratis.

These non-lawyers traipsing in to make claims about the effect of AI on open source licensing are giving me big “I release my code but only if I can 1) get paid for it and 2) control who and how it’s used” vibes. That’s what’s ‘hollowing-out’ open source.

value leaks out of the project

What value? Value to whom? The value of source code is what it does, i.e. the program it compiles or is interpreted into. That doesn’t change by someone else using it differently than you. Google taking Linux and spinning off Android doesn’t “hurt” Linux. It doesn’t decrease the ‘value’. There’s no universal counter out there that says, “this GPLv2 attribution appears more than someone else’s, so therefore this project is more valuable”, that is being eroded if a company goes and uses it without reprinting the license notice as well. OSS licenses have never prevented that.

I said it before the last time FOSS came up, and I’ll say it again:

FOSS is about propagating software to as many people as possible, to help as many people as possible. It’s not about creating legal barriers to diminish the power of corporations; making tools available to people that are better and cheaper will do that naturally (and you were never going to beat the corpo lawyers anyways trying to enforce licenses).

If your zeal to prevent corporations from ever misusing FOSS leads you to remove some aspect of it (free, open, or source), then you’ve cut off your nose to spite your face.

yoasif@fedia.io · 2 months ago

I said I was focusing on copyleft, cool that you ignored the entire post though. 😑

t3rmit3@beehaw.org · 2 months ago

Perhaps you should have titled the post “AI Code Hollowing Out Copyleft Ecosystem”, then, unless you’re intentionally trying to conflate Open Source with Copyleft (you are, based on your other blog posts). But I remember seeing your post about the “social contract” of OSS last December, and you are in fact exactly who my comment is about:

Copyleft is a reactionary movement from people who turned into the beast they hated in trying to fight it. “Permissive” licenses are FOSS. Copyleft is certainly maybe OSS, but it’s not “Free” (as in either “libre” or “gratis”) if some other person can mandate both that you do something, and what you do. If usage of something is contingent on payment (including payment via feel-good attribution), it’s not free.

I’ll add here: FOSS is also not about some one-sided “covenant” where a creator believes the users of said freely-given software owe them something (money, gratitude, or even just ‘reciprocity’ and attribution). If you’re in OSS for the fuzzy feeling you get when someone forks your repo, or the conviction that OSS contribs are intrinsically good in some nebulous way, it’s no wonder you’re hung up on seeing a transactional return on your labor instead of just knowing it’s out there maybe helping someone, somewhere.

yoasif@fedia.io · 2 months ago

I know what copyleft licenses are about, that was covered in the post - if you read it. If you are saying that you are making long comments without reading the post, great I guess, but not super interesting (to me).

I’m not really interested in getting into an argument around license choice because I wasn’t advocating for any particular license (like you seem to be).

t3rmit3@beehaw.org · 2 months ago

If you are saying that you are making long comments without reading the post

I didn’t say that. I read the post, but did not click through to the other post from December to realize it was part of a pattern.

I’m not really interested in getting into an argument around license choice

That’s fine, you are not beholden to respond to me.