Skip to main content


I am disappointed in Software Heritage.

They made this statement on using their archive as an AI training dataset: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/?ref=openml.fyi

These seem like good principles. But they are not actually sufficient to respect our work. And the third is too weak, and appears to be providing a figleaf for extractive behavior.

in reply to see shy jo

By the way, I'd love for someone to tell me I've gotten some or all of this wrong! I really want to not lose my respect for SWH.

(No interest in debating LLM-as-copyright laundring here or ever tho. Or with any apologists for any corporations.)

This entry was edited (8 months ago)
in reply to James Just James

Thanks James. Hello @joeyh .

I'd love to hear how you think the principles can be made stronger. (Disclosure: I've contributed inputs to those principles, but I'm not the decision maker.)

For context, my general take is that: given code LLMs exist anyway, we (= free software activists) need them to be free/open (in its various parts) to create more free software.

in reply to Stefano Zacchiroli

That assumes 1) they will be useful 2) they will be necessary and 3) that software generated by any recombinations of free software is itself free software.

Perhaps the first two are debatable, but the third is not under current law. So are you actually saying that you think that the copyright law foundation of free software will be upset by AI to the point that it will be feasible to accept a patch consisting of LLM generated code?

in reply to see shy jo

we don't know yet about (1) and (2) (there is preliminary science about it, but results are inconclusive on pros/cons). What we know is that code LLMs are now a tool that developers use, together with IDEs, compilers, etc.

We don't want free software to be at disadvantage wrt proprietary software in not having access to them.

The interesting question is how do we build a FOSS-friendly code LLM. (Bonus point: how do we make it *disadvantage* proprietary software.) →

in reply to Stefano Zacchiroli

One major way I think an open source LLM could have over proprietary ones is to be opt-in, rather than opt-out.

That would put a lot of minds at ease, and unlike the proprietary ones, would be considerably more FLOSS-friendly.

I'd like to think that not turning FLOSS developers against the LLM would be beneficial for everyone involved.

in reply to Gergely Nagy 🐁

I completely agree with the final goal you state. I'm trying to explore (unrelated to SWH) what are the requirements for a FOSS-friendly code LLMs.

But note that part of the problem is that "FOSS" contains very diverse group of people and goals. For instance, it seems to me that once again the lax/permissive vs copyleft split plays an important role here.

in reply to Stefano Zacchiroli

as someone very strongly in the BSD camp when it comes to licencing, I have a very strict ML/LLM policy.

The output is a function of the input, therefore its licence must be honoured. The licence gifts the work to the public under the small, not onerous, attribution requirement, so this attribution requirement, for which the authors give up so much of their rights, has a very high importance and must be honoured very strictly, more so than any individual requirement in a more complex licence (like Apache 2, Creative Commons, GPL family, EUPL, etc).

in reply to mirabilos

I'm very interested in your BSD camp take! In what form would you like to have attribution in the generated output?

One (silly) way to achieve that, would be to create a huge file with *all* attributions from the training set, and *always* emit it for any output no matter what. It would be impractical (which might be a feature! if we are anti code LLM no matter what). But if we ignore practicality for a moment, would you consider it acceptable?

in reply to Stefano Zacchiroli

that’s what other build systems for huge things do, e.g. for Android-based images (just the other day I saw my stuff apparently shows up in the Amazon Dot thing), and that would be fine.

As to what form, the licence says to reproduce it.

in reply to mirabilos

I'd love a pointer to the Android example you mention, if you still have it.
in reply to Stefano Zacchiroli

which part of that, the result? The concatenator (I’m hesitant to call it builder)? They basically have the equivalent of d/copyright (except less well done and nowhere near machine-readable) in every module and concatenate those on image build, and it shows up on the about box of Android, of in-car entertainment systems, etc. or on websites with docs for the images.

Only those modules that went into the build ofc as Shamar said the others are not relevant.

in reply to Stefano Zacchiroli

in reply to Stefano Zacchiroli

This website uses cookies to recognize revisiting and logged in users. You accept the usage of these cookies by continue browsing this website.