AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way

yoasif@fedia.io · 2 months ago

AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way

chicken@lemmy.dbzer0.com · 2 months ago

but if they instead say that they copied the work into their LLM and produced a copy without protections (as chardet has done), the courts might be less willing to afford the project copyright protections if the project itself was making use of the same copyright stripping technology to strip others’ work to claim protections over copied work.

ianal but does it even work like that? Is there any specific reason to think it does? I don’t believe you really get credit for purity and fairness vibes in the legal system. Same goes for the idea that code where it is ambiguous whether it is AI output could be considered public domain, seems kind of implausible, is there actually any reason to think the law works that way? If it did, then any copyrighted work not accompanied by proof of human authorship would be at risk, uncharacteristic for a system focused on giving big copyright holders what they want without trouble.

the only code that may ultimately be protected is closed source code - you can’t copy it if you don’t have the source.

There is no way, leaks happen, big tech companies have massive influence, a situation where their code falls into the public domain as soon as the public gets their hands on it just isn’t realistic. I feel suspicious that many of these concerns are coming from a place of not wanting LLM code in open source projects for other reasons, rather than the existence of a strong legal case that it represents a real and serious threat to copyleft licensing.

yoasif@fedia.io · 2 months ago

ianal but does it even work like that? Is there any specific reason to think it does? I don’t believe you really get credit for purity and fairness vibes in the legal system. Same goes for the idea that code where it is ambiguous whether it is AI output could be considered public domain, seems kind of implausible, is there actually any reason to think the law works that way? If it did, then any copyrighted work not accompanied by proof of human authorship would be at risk, uncharacteristic for a system focused on giving big copyright holders what they want without trouble.

I’m mostly just playing along with your thought experiment. As I said, we know that projects are already accepting LLM code into projects that are nominally copyleft.

There is no way, leaks happen, big tech companies have massive influence, a situation where their code falls into the public domain as soon as the public gets their hands on it just isn’t realistic.

If that is the case, is chardet 7.0.0 a derivative work of chardet, or is it a public domain LLM work? The whole LLM project is fraught with questions like these, but it seems that the vendors at least are counting on not copying leaked software and instead copying open source code that is publicly hosted.

Why is it okay to strip copyright from open source works but not from leaked closed source works?

We know that Disney is suing to protect its works - if it is true that LLM outputs are transformative, they should lose, as should any vendor whose leaked code was “transformed” by an LLM.

chicken@lemmy.dbzer0.com · 2 months ago

If that is the case, is chardet 7.0.0 a derivative work of chardet, or is it a public domain LLM work? The whole LLM project is fraught with questions like these

I think the reimplementation stuff is a separate question because the argument for it working looks a lot stronger, and because it doesn’t have anything to do with the source material having LLM output in it. Also if this method holds as legally valid, it’s going to be easier to just do that than justify copying code directly (which would probably have to only be copies of the explicitly generated parts of the code, requiring figuring out how to replace the rest), which means it won’t matter whether some portion of it was generated. I don’t see much reason to think that a purist approach to accepting LLM code will offer any meaningful protection.

I’m mostly just playing along with your thought experiment. As I said, we know that projects are already accepting LLM code into projects that are nominally copyleft.

So what though? If they aren’t entirely generated, you can’t make a full fork, and why would a partial fork be useful? If it isn’t disclosed what parts are AI, you can’t even do that without risking breaking the law.

yoasif@fedia.io · 2 months ago

I think the reimplementation stuff is a separate question because the argument for it working looks a lot stronger, and because it doesn’t have anything to do with the source material having LLM output in it. Also if this method holds as legally valid, it’s going to be easier to just do that than justify copying code directly (which would probably have to only be copies of the explicitly generated parts of the code, requiring figuring out how to replace the rest), which means it won’t matter whether some portion of it was generated.

Is it a separate question, though?

Both works are copyrighted, one is just copyrighted as “all rights reserved” (our leaked commercial code) and the rest is licensed as LGPL. We’re putting both pieces of code inside the LLM and then asking the LLM to make a new version.

What makes the action of leaking different from the act of putting it on the web? Rights are reserved in either case.

If they aren’t entirely generated, you can’t make a full fork, and why would a partial fork be useful?

Well, people are contributing to copyleft codebases expecting that when people build on their work, that work (the derivative works) are also licensed in the same way. You don’t need to fork for the value to be lost. People expected virality to be part of their contribution, and clearly the new derivative works are partially non-copyleft.

Beyond that, as more of the codebase is LLM produced, the less of it is protected by the copyleft license, until we have a ship of Theseus situation where the codebase is available, but no longer copyleft. That is clearly not what was intended by e.g. the GPL. Just look at the Stallman quote in post.