Content Warning

Friends, for something to be open source, we need to see

1. The data it was trained and evaluated on

2. The code

3. The model architecture

4. The model weights.

DeepSeek only gives 3, 4. And I'll see the day that anyone gives us #1 without being forced to do so, because all of them are stealing data.

+ -

Content Warning

@timnitGebru not universal but a lot of models, including major ones, give #1 on huggingface fyi, it's in the model card info. (not always easy to find though, often you have to go back to the base model)

Like 99% of models though you'll find it's just common crawl, wikipedia, and webtext2, with some specialized datasets used on top of that for more specialized models (like github for the code models)

Content Warning

@timnitGebru
Just came across this paper, published in *Nature*, no less, which is probably a good background read on the topic:

https://www.nature.com/articles/s41586-024-08141-1

Full disclosure: I'm not deep enough into the topic to properly get it. The abstract speaks volumes already, though: "open" AI is usually not actually open, and even then it can still be a big problem.

Content Warning

@timnitGebru thank you for sharing.

Quick question have you seen this initiative "European Open-Source AI index"?

https://www.osai-index.eu

By @dingemansemark & @andreasliesenfeld from @Radboud_uni

Looks good to me, to help people determine how open an AI model actually is. Are you aware of other initiatives like this?

I'd like to gather these initiatives and share it with @publicspaces so more people learn what to look for to determine if an AI is truly opensource.

#AI#OpenSource

Content Warning

@timnitGebru Most people in various tech communities on the internet don't even know that they can't view the source code of 'opensurce' Deepseek. Makes me wonder if anyone is even reviewing the code of actual opensource projects nowadays to verify their claims.

Content Warning

@timnitGebru based

@festal

i have never understood what people mean, when they talk about the "stolen data" of AI. every search engine crawls the net, the internet data base backs versions of webpages up, the idea of open source is based on re-using and modifying "data" of others, artists make collages from other images, the whole concept of "knowledge" ist based on the use of other knowledge... so, what exactly does Ai do differently?