You are viewing a single thread.
View all comments
22 points

Anyone know why most are a 2021 internet data cut off?

permalink
report
reply
3 points

Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.

Unless you are a bot… In which case where did you get your data?

permalink
report
parent
reply
4 points

The data wasn’t stolen, I can at least assure you of that

permalink
report
parent
reply
1 point

You paid Hoffman?

permalink
report
parent
reply
7 points

I think it’s just that most are based on chatgpt which cuts off at 2021.

permalink
report
parent
reply
19 points

Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them

permalink
report
parent
reply
4 points

I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA’s need natural sources, or it gets more inbred than the Habsburgs.

permalink
report
parent
reply
0 points

I think it’s telling that they acknowledge that the stuff their bots churn out is often such garbage that training their bots on it would ruin them.

permalink
report
parent
reply
13 points
*

There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It’s going to be real interesting, I don’t know if they’ll be able to train as easily ever again

permalink
report
parent
reply
1 point

Where do you get that from? At least ChatGPT isn’t limited to data from 2021. I haven’t researched about other models.

permalink
report
parent
reply
4 points

Yeah GPT 3.5 and some other FOSS models also say 2021

permalink
report
parent
reply
5 points
8 points

Gpt 3.5 is limited to 2021. Gpt 4; 4.5; the imaginary upcoming gpt 5 models are not, but that does not mean they aren’t limited in their own ways.

permalink
report
parent
reply
4 points

Are you sure those aren’t trained until 2021, frozen, and then fine tuned on later data?

permalink
report
parent
reply

Memes

!memes@lemmy.ml

Create post

Rules:

  1. Be civil and nice.
  2. Try not to excessively repost, as a rule of thumb, wait at least 2 months to do it if you have to.

Community stats

  • 8.5K

    Monthly active users

  • 13K

    Posts

  • 288K

    Comments