The BigCode project (supported by Hugging Face) created an "AI" dataset with 67 TB of code, a lot of it from GitHub users who did not agree to this. Some even claim that private repositories are included. 91 of my repositories are in it, many without an open-source license, but no private ones. They provide an opt-out link, but only for "future versions", and it simply creates an issue in a GitHub repo. 99.8 % of them are still in "open" state, dating back to March 2023.
https://huggingface.co/spaces/bigcode/in-the-stack
Am I in The Stack? - a Hugging Face Space by bigcode
Discover amazing ML apps made by the communityhuggingface.co
scy
in reply to scy • • •Additional links:
Open opt-out requests:
https://github.com/bigcode-project/opt-out-v2/issues?page=20&q=is%3Aissue+%22opt-out+request%22
(yes, they're all publicly accessible)
The Stack dataset:
https://huggingface.co/datasets/bigcode/the-stack-v2
Claims about private repos being included:
https://post.lurk.org/@emenel/112111014479288871
(I can neither confirm nor deny this)
Issues · bigcode-project/opt-out-v2
GitHubscy
in reply to scy • • •So, this dataset contains a shitload of copyrighted code that does not allow redistribution, let alone creating derivative works from it, and the authors seem to have no intention of rectifying this.
They treat the existing datasets as immutable, and appear to ignore opt-out requests.
If you have a Hugging Face account, you can report the Stack v2 dataset via the three-dots menu on the top right at https://huggingface.co/datasets/bigcode/the-stack-v2
bigcode/the-stack-v2 · Datasets at Hugging Face
huggingface.coscy
in reply to scy • • •Also note that while The Stack v1 contained code "from permissive licenses", v2 has extended this to "with permissive licenses or no license".
Yes, back when I was 16, I also thought that "no license" meant "no restrictions on what to do with it", but just to be clear: no, it means "you have no permission to do whatsoever".
Somebody please sue these guys into the ground?
scy
in reply to scy • • •According to the BigCode project, they are "a community project jointly led by Hugging Face and ServiceNow. Both organizations committed research, engineering, ethics, governance, and legal resources".
https://www.bigcode-project.org/docs/about/organization/
So, maybe say hello to these companies' legal departments too …
Organization
BigCodescy
in reply to scy • • •Wait, it's even worse. The dataset is based on @swheritage's archive, containing way more than just GitHub (e.g. @Codeberg is archived, too).
I assumed they were somewhat neutral, but they're praising the LLM usage of this unlicensed code:
https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
Also, they're refusing to remove deadnames, even outright ignoring GDPR demands for it:
https://cohost.org/arborelia/post/5169338-the-software-heritag
I can only conclude that they're a bad actor and should be considered harmful by the #OpenSource community.
The Software Heritage Archive wants to deadname me forever: part 3
arborelia on cohostscy
in reply to scy • • •"Well, changing names in archived repos breaks the hashes, they can't do that!"
Yes, that's a technical limitation in Git.
And sure they can do that. Yes, even if it breaks stuff. Protecting the personality rights of innocent people, especially if they're marginalized, is always more important than having a pristine archive.
And I'm saying this as someone who keeps domains and URLs and repos online even though the projects ended decades ago.
But _people_ are always more important.
scy
in reply to scy • • •Removing deadnames in Git repos is a major PITA. Hashes break. Forks exist, containing the name too. And the name appears not only in metadata, but also in file contents (readme, copyright notices etc.)
Still: Your job as an archivist, or a software dev working on archival tools, is to _make it work_.
Don't complain about the cost of it. Not hurting people is worth any cost.
And don't place the burden on trans people. They struggle enough already. Go out of your way to help them.
Anchal
in reply to scy • • •Shamar
in reply to scy • • •Well, I'd argue that they represent pretty well #OpenSource, that was exactly designed to marginalize #FreeSoftware (a political movement of hackers) and to serve corporate interests: https://thebaffler.com/salvos/the-meme-hustler
What you'd expect from an organization ethic-washing #Google, #Microsoft and so on?
https://www.softwareheritage.org/support/sponsors/
@swheritage @Codeberg
The Meme Hustler
The Baffler