lemm.ee

Local All Communities Log in Sign up

Local All Communities

1.2K

Glad this is becoming a meme(fedia.io)

posted 8 months ago

by

lledrtx@lemmy.world

in

Sort:

Hot Top Controversial New Old

You are viewing a single thread.

View all comments

[ +- ]

dislocate_expansion@reddthat.comB

22 points

8 months ago

Anyone know why most are a 2021 internet data cut off?

report

reply

[ +- ]

Natanael@slrpnk.net

19 points

8 months ago

Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them

report

reply

[ +- ]

Scrubbles@poptalk.scrubbles.tech

13 points

8 months ago

*

There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It’s going to be real interesting, I don’t know if they’ll be able to train as easily ever again

report

reply

[ +- ]

Iron Lynx@lemmy.world

4 points

8 months ago

I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA’s need natural sources, or it gets more inbred than the Habsburgs.

report

reply

[ +- ]

TurtleJoe@lemmy.world

0 points

8 months ago

I think it’s telling that they acknowledge that the stuff their bots churn out is often such garbage that training their bots on it would ruin them.

report

reply

[ +- ]

Donkter@lemmy.world

7 points

8 months ago

I think it’s just that most are based on chatgpt which cuts off at 2021.

report

reply

[ +- ]

can@sh.itjust.works

3 points

8 months ago

Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.

Unless you are a bot… In which case where did you get your data?

report

reply

[ +- ]

dislocate_expansion@reddthat.comB

4 points

8 months ago

The data wasn’t stolen, I can at least assure you of that

report

reply

[ +- ]

can@sh.itjust.works

1 point

8 months ago

You paid Hoffman?

report

reply

[ +- ]

potustheplant@feddit.nl

1 point

8 months ago

Where do you get that from? At least ChatGPT isn’t limited to data from 2021. I haven’t researched about other models.

report

reply

[ +- ]

RatBin@lemmy.world

8 points

8 months ago

Gpt 3.5 is limited to 2021. Gpt 4; 4.5; the imaginary upcoming gpt 5 models are not, but that does not mean they aren’t limited in their own ways.

report

reply

[ +- ]

dislocate_expansion@reddthat.comB

4 points

8 months ago

Are you sure those aren’t trained until 2021, frozen, and then fine tuned on later data?

report

reply

[ +- ]

RatBin@lemmy.world

2 points

8 months ago

I really don’t know, I’m speculating, but neither does openai know, that’s sure. So we have the most popular ML system used by millions based on…what exactly?

report

reply

Show more comments

[ +- ]

dislocate_expansion@reddthat.comB

4 points

8 months ago

Yeah GPT 3.5 and some other FOSS models also say 2021

report

reply

[ +- ]

potustheplant@feddit.nl

5 points

8 months ago

OpenAI stated in a tweet a few months ago that the limitation is no longer in place.

report

reply

[ +- ]

webghost0101@sopuli.xyz

4 points

8 months ago

To be fair this tweet doesn’t say anything about training data but simply that it theoretically can use present day data if it looks it up online.

For gpt4 i think its was initially trained up to 2021 but it has gotten updates where data up to december 2023 was used in training. It “knows” this data and does not need to look ut up.

Whether they managed to further train the initial gpt4 model to do so or added something they trained separately is probably a trade secret.

report

reply

[ +- ]

dislocate_expansion@reddthat.comB

2 points

8 months ago

Thanks!

report

reply

Show more comments

Memes

!memes@lemmy.ml

Rules:

Be civil and nice.
Try not to excessively repost, as a rule of thumb, wait at least 2 months to do it if you have to.

Community stats

8.5K
Monthly active users
13K
Posts
288K
Comments

Community moderators

ghost_laptop@lemmy.ml
sexy_peach@feddit.de
Cyclohexane@lemmy.ml
Arthur Besse@lemmy.ml

modlog legal instances join-lemmy.org

lemmy-ui-next v0.11.0 (github)lemmy v0.19.5 (github)