Skip to main content


The BigCode project (supported by Hugging Face) created an "AI" dataset with 67 TB of code, a lot of it from GitHub users who did not agree to this. Some even claim that private repositories are included. 91 of my repositories are in it, many without an open-source license, but no private ones. They provide an opt-out link, but only for "future versions", and it simply creates an issue in a GitHub repo. 99.8 % of them are still in "open" state, dating back to March 2023.

https://huggingface.co/spaces/bigcode/in-the-stack

in reply to scy

Additional links:

Open opt-out requests:
https://github.com/bigcode-project/opt-out-v2/issues?page=20&q=is%3Aissue+%22opt-out+request%22
(yes, they're all publicly accessible)

The Stack dataset:
https://huggingface.co/datasets/bigcode/the-stack-v2

Claims about private repos being included:
https://post.lurk.org/@emenel/112111014479288871
(I can neither confirm nor deny this)

in reply to scy

So, this dataset contains a shitload of copyrighted code that does not allow redistribution, let alone creating derivative works from it, and the authors seem to have no intention of rectifying this.

They treat the existing datasets as immutable, and appear to ignore opt-out requests.

If you have a Hugging Face account, you can report the Stack v2 dataset via the three-dots menu on the top right at https://huggingface.co/datasets/bigcode/the-stack-v2

in reply to scy

Also note that while The Stack v1 contained code "from permissive licenses", v2 has extended this to "with permissive licenses or no license".

Yes, back when I was 16, I also thought that "no license" meant "no restrictions on what to do with it", but just to be clear: no, it means "you have no permission to do whatsoever".

Somebody please sue these guys into the ground?

in reply to scy

According to the BigCode project, they are "a community project jointly led by Hugging Face and ServiceNow. Both organizations committed research, engineering, ethics, governance, and legal resources".

https://www.bigcode-project.org/docs/about/organization/

So, maybe say hello to these companies' legal departments too …

in reply to scy

Wait, it's even worse. The dataset is based on @swheritage's archive, containing way more than just GitHub (e.g. @Codeberg is archived, too).

I assumed they were somewhat neutral, but they're praising the LLM usage of this unlicensed code:

https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/

Also, they're refusing to remove deadnames, even outright ignoring GDPR demands for it:

https://cohost.org/arborelia/post/5169338-the-software-heritag

I can only conclude that they're a bad actor and should be considered harmful by the #OpenSource community.

in reply to scy

"Well, changing names in archived repos breaks the hashes, they can't do that!"

Yes, that's a technical limitation in Git.

And sure they can do that. Yes, even if it breaks stuff. Protecting the personality rights of innocent people, especially if they're marginalized, is always more important than having a pristine archive.

And I'm saying this as someone who keeps domains and URLs and repos online even though the projects ended decades ago.

But _people_ are always more important.

in reply to scy

Removing deadnames in Git repos is a major PITA. Hashes break. Forks exist, containing the name too. And the name appears not only in metadata, but also in file contents (readme, copyright notices etc.)

Still: Your job as an archivist, or a software dev working on archival tools, is to _make it work_.

Don't complain about the cost of it. Not hurting people is worth any cost.

And don't place the burden on trans people. They struggle enough already. Go out of your way to help them.

in reply to scy

The best a person can do from start is using screennames for everything that has nothing to do with their name or identity, this should be taught from young age
in reply to scy

Well, I'd argue that they represent pretty well #OpenSource, that was exactly designed to marginalize #FreeSoftware (a political movement of hackers) and to serve corporate interests: https://thebaffler.com/salvos/the-meme-hustler

What you'd expect from an organization ethic-washing #Google, #Microsoft and so on?
https://www.softwareheritage.org/support/sponsors/

@swheritage @Codeberg

This website uses cookies to recognize revisiting and logged in users. You accept the usage of these cookies by continue browsing this website.