I am disappointed in Software Heritage.
They made this statement on using their archive as an AI training dataset: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/?ref=openml.fyi
These seem like good principles. But they are not actually sufficient to respect our work. And the third is too weak, and appears to be providing a figleaf for extractive behavior.
Software Heritage Statement on Large Language Models for Code
Our mission at Software Heritage is to collect, preserve, and make publicly available the entire body of software, in the preferred form for making modifications to it.www.softwareheritage.org
see shy jo
in reply to see shy jo • • •By the way, I'd love for someone to tell me I've gotten some or all of this wrong! I really want to not lose my respect for SWH.
(No interest in debating LLM-as-copyright laundring here or ever tho. Or with any apologists for any corporations.)
James Just James
in reply to see shy jo • • •Stefano Zacchiroli
in reply to James Just James • • •Thanks James. Hello @joeyh .
I'd love to hear how you think the principles can be made stronger. (Disclosure: I've contributed inputs to those principles, but I'm not the decision maker.)
For context, my general take is that: given code LLMs exist anyway, we (= free software activists) need them to be free/open (in its various parts) to create more free software.
see shy jo
in reply to Stefano Zacchiroli • • •That assumes 1) they will be useful 2) they will be necessary and 3) that software generated by any recombinations of free software is itself free software.
Perhaps the first two are debatable, but the third is not under current law. So are you actually saying that you think that the copyright law foundation of free software will be upset by AI to the point that it will be feasible to accept a patch consisting of LLM generated code?
Stefano Zacchiroli
in reply to see shy jo • • •we don't know yet about (1) and (2) (there is preliminary science about it, but results are inconclusive on pros/cons). What we know is that code LLMs are now a tool that developers use, together with IDEs, compilers, etc.
We don't want free software to be at disadvantage wrt proprietary software in not having access to them.
The interesting question is how do we build a FOSS-friendly code LLM. (Bonus point: how do we make it *disadvantage* proprietary software.) →
Gergely Nagy 🐁
in reply to Stefano Zacchiroli • • •One major way I think an open source LLM could have over proprietary ones is to be opt-in, rather than opt-out.
That would put a lot of minds at ease, and unlike the proprietary ones, would be considerably more FLOSS-friendly.
I'd like to think that not turning FLOSS developers against the LLM would be beneficial for everyone involved.
Stefano Zacchiroli
in reply to Gergely Nagy 🐁 • • •I completely agree with the final goal you state. I'm trying to explore (unrelated to SWH) what are the requirements for a FOSS-friendly code LLMs.
But note that part of the problem is that "FOSS" contains very diverse group of people and goals. For instance, it seems to me that once again the lax/permissive vs copyleft split plays an important role here.
mirabilos
in reply to Stefano Zacchiroli • • •as someone very strongly in the BSD camp when it comes to licencing, I have a very strict ML/LLM policy.
The output is a function of the input, therefore its licence must be honoured. The licence gifts the work to the public under the small, not onerous, attribution requirement, so this attribution requirement, for which the authors give up so much of their rights, has a very high importance and must be honoured very strictly, more so than any individual requirement in a more complex licence (like Apache 2, Creative Commons, GPL family, EUPL, etc).
Stefano Zacchiroli
in reply to mirabilos • • •I'm very interested in your BSD camp take! In what form would you like to have attribution in the generated output?
One (silly) way to achieve that, would be to create a huge file with *all* attributions from the training set, and *always* emit it for any output no matter what. It would be impractical (which might be a feature! if we are anti code LLM no matter what). But if we ignore practicality for a moment, would you consider it acceptable?
mirabilos
in reply to Stefano Zacchiroli • • •that’s what other build systems for huge things do, e.g. for Android-based images (just the other day I saw my stuff apparently shows up in the Amazon Dot thing), and that would be fine.
As to what form, the licence says to reproduce it.
Stefano Zacchiroli
in reply to mirabilos • • •mirabilos
in reply to Stefano Zacchiroli • • •which part of that, the result? The concatenator (I’m hesitant to call it builder)? They basically have the equivalent of d/copyright (except less well done and nowhere near machine-readable) in every module and concatenate those on image build, and it shows up on the about box of Android, of in-car entertainment systems, etc. or on websites with docs for the images.
Only those modules that went into the build ofc as Shamar said the others are not relevant.
Shamar
in reply to Stefano Zacchiroli • • •@zacchiro
It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.
In the notorious case of GitHub Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.
Same should happen for any transformation of such code (reordered lines, symbols renames and so on).
If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.
As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.
In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any other programmer directly doing the same.
Yet your proposal is reasonable for the LLM itself: it's a derived work of all the s
... show more@zacchiro
It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.
In the notorious case of GitHub Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.
Same should happen for any transformation of such code (reordered lines, symbols renames and so on).
If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.
As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.
In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any other programmer directly doing the same.
Yet your proposal is reasonable for the LLM itself: it's a derived work of all the sources used to statistically program it, so it should be attributed to all the original authors and should strictly respect each of the source licenses as any other derivative work.
This would not be anti-LLM, just good sense: expensive automatic proxies should never put who control them above the law.
@mirabilos
Shamar
in reply to Stefano Zacchiroli • • •@zacchiro
It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.
In the notorious case of #GitHub #Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.
Same should happen for any transformation of such code (reordered lines, symbols renames and so on).
If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.
As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.
In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any oth
... show more@zacchiro
It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.
In the notorious case of #GitHub #Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.
Same should happen for any transformation of such code (reordered lines, symbols renames and so on).
If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.
As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.
In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any other programmer directly doing the same.
Yet your proposal is reasonable for the #LLM itself: it's a derived work of all the sources used to statistically program it, so it should be attributed to all the original authors and should strictly respect each of the source licenses as any other derivative work.
This is not anti-LLM, just good sense: expensive automatic proxies should never put who control them above the law.
@mirabilos