Friendica in Luxembourg

see shy jo

8 months ago • •

see shy jo
8 months ago • •

I am disappointed in Software Heritage.

They made this statement on using their archive as an AI training dataset: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/?ref=openml.fyi

These seem like good principles. But they are not actually sufficient to respect our work. And the third is too weak, and appears to be providing a figleaf for extractive behavior.

Software Heritage Statement on Large Language Models for Code

Our mission at Software Heritage is to collect, preserve, and make publicly available the entire body of software, in the preferred form for making modifications to it.

^{www.softwareheritage.org}

in reply to see shy jo

see shy jo

in reply to see shy jo • 8 months ago • •

By the way, I'd love for someone to tell me I've gotten some or all of this wrong! I really want to not lose my respect for SWH.

(No interest in debating LLM-as-copyright laundring here or ever tho. Or with any apologists for any corporations.)

This entry was edited (8 months ago)

in reply to see shy jo

James Just James

in reply to see shy jo • 8 months ago • •

@zacchiro might be a knowledgeable person to discuss with.

@Stefano Zacchiroli

in reply to James Just James

Stefano Zacchiroli

in reply to James Just James • 8 months ago • •

Thanks James. Hello @joeyh .

I'd love to hear how you think the principles can be made stronger. (Disclosure: I've contributed inputs to those principles, but I'm not the decision maker.)

For context, my general take is that: given code LLMs exist anyway, we (= free software activists) need them to be free/open (in its various parts) to create more free software.

@see shy jo

in reply to Stefano Zacchiroli

see shy jo

in reply to Stefano Zacchiroli • 8 months ago • •

That assumes 1) they will be useful 2) they will be necessary and 3) that software generated by any recombinations of free software is itself free software.

Perhaps the first two are debatable, but the third is not under current law. So are you actually saying that you think that the copyright law foundation of free software will be upset by AI to the point that it will be feasible to accept a patch consisting of LLM generated code?

in reply to see shy jo

Stefano Zacchiroli

in reply to see shy jo • 8 months ago • •

we don't know yet about (1) and (2) (there is preliminary science about it, but results are inconclusive on pros/cons). What we know is that code LLMs are now a tool that developers use, together with IDEs, compilers, etc.

We don't want free software to be at disadvantage wrt proprietary software in not having access to them.

The interesting question is how do we build a FOSS-friendly code LLM. (Bonus point: how do we make it *disadvantage* proprietary software.) →

in reply to Stefano Zacchiroli

Gergely Nagy 🐁

in reply to Stefano Zacchiroli • 8 months ago • •

One major way I think an open source LLM could have over proprietary ones is to be opt-in, rather than opt-out.

That would put a lot of minds at ease, and unlike the proprietary ones, would be considerably more FLOSS-friendly.

I'd like to think that not turning FLOSS developers against the LLM would be beneficial for everyone involved.

in reply to Gergely Nagy 🐁

Stefano Zacchiroli

in reply to Gergely Nagy 🐁 • 8 months ago • •

I completely agree with the final goal you state. I'm trying to explore (unrelated to SWH) what are the requirements for a FOSS-friendly code LLMs.

But note that part of the problem is that "FOSS" contains very diverse group of people and goals. For instance, it seems to me that once again the lax/permissive vs copyleft split plays an important role here.

in reply to Stefano Zacchiroli

mirabilos

in reply to Stefano Zacchiroli • 8 months ago • •

as someone very strongly in the BSD camp when it comes to licencing, I have a very strict ML/LLM policy.

The output is a function of the input, therefore its licence must be honoured. The licence gifts the work to the public under the small, not onerous, attribution requirement, so this attribution requirement, for which the authors give up so much of their rights, has a very high importance and must be honoured very strictly, more so than any individual requirement in a more complex licence (like Apache 2, Creative Commons, GPL family, EUPL, etc).

in reply to mirabilos

Stefano Zacchiroli

in reply to mirabilos • 8 months ago • •

I'm very interested in your BSD camp take! In what form would you like to have attribution in the generated output?

One (silly) way to achieve that, would be to create a huge file with *all* attributions from the training set, and *always* emit it for any output no matter what. It would be impractical (which might be a feature! if we are anti code LLM no matter what). But if we ignore practicality for a moment, would you consider it acceptable?

in reply to Stefano Zacchiroli

mirabilos

in reply to Stefano Zacchiroli • 8 months ago • •

that’s what other build systems for huge things do, e.g. for Android-based images (just the other day I saw my stuff apparently shows up in the Amazon Dot thing), and that would be fine.

As to what form, the licence says to reproduce it.

in reply to mirabilos

Stefano Zacchiroli

in reply to mirabilos • 8 months ago • •

I'd love a pointer to the Android example you mention, if you still have it.

in reply to Stefano Zacchiroli

mirabilos

in reply to Stefano Zacchiroli • 8 months ago • •

which part of that, the result? The concatenator (I’m hesitant to call it builder)? They basically have the equivalent of d/copyright (except less well done and nowhere near machine-readable) in every module and concatenate those on image build, and it shows up on the about box of Android, of in-car entertainment systems, etc. or on websites with docs for the images.

Only those modules that went into the build ofc as Shamar said the others are not relevant.

in reply to Stefano Zacchiroli

Shamar

in reply to Stefano Zacchiroli • 8 months ago • •

@zacchiro

It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.

In the notorious case of GitHub Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.

Same should happen for any transformation of such code (reordered lines, symbols renames and so on).

If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.

As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.

In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any other programmer directly doing the same.

Yet your proposal is reasonable for the LLM itself: it's a derived work of all the s

@Stefano Zacchiroli @mirabilos

in reply to Stefano Zacchiroli

Shamar

in reply to Stefano Zacchiroli • 8 months ago • •

@zacchiro

It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.

In the notorious case of #GitHub #Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.

Same should happen for any transformation of such code (reordered lines, symbols renames and so on).

If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.

As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.

In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any oth

#github #LLM #Copylot @Stefano Zacchiroli @mirabilos

This website uses cookies to recognize revisiting and logged in users. You accept the usage of these cookies by continue browsing this website.

⇧