Apple study exposes deep cracks in LLMs’ “reasoning” capabilities(arstechnica.com)

posted 1 day ago

misk@sopuli.xyz

technology@lemmy.world

90 commentshide report

Sort:

Hot Top Controversial New Old

[ - ]

Chickenstalker@lemmy.world

0 points

18 hours ago

Are we not flawed too? Does that not makes AI…human?

permalink

report

[ - ]

ContrarianTrail@lemm.ee

23 points

18 hours ago

How dare you imply that humans just make shit up when they don’t know the truth

permalink

report

parent

[ - ]

WldFyre@lemm.ee

5 points

13 hours ago

Did I misremember something, or is my memory easily influenced by external stimuli? No, the Mandela Effect must be real!

permalink

report

parent

[ - ]

rickdg@lemmy.world

-4 points

1 day ago

Real headline: Apple research presents possible improvements in benchmarking LLMs.

permalink

report

[ - ]

patatahooligan@lemmy.world

21 points

23 hours ago

Not even close. The paper is questioning LLMs ability to reason. The article talks about fundamental flaws of LLMs and how we might need different approaches to achieve reasoning. The benchmark is only used to prove the point. It is definitely not the headline.

permalink

report

parent

[ - ]

Monument@lemmy.sdf.org

-4 points

19 hours ago

You say “Not even close.” in response to the suggestion that Apple’s research can be used to improve benchmarks for AI performance, but then later say the article talks about how we might need different approaches to achieve reasoning.

Now, mind you - achieving reasoning can only happen if the model is accurate and works well. And to have a good model, you must have good benchmarks.

Not to belabor the point, but here’s what the article and study says:

The article talks at length about the reliance on a standardized set of questions - GSM8K, and how the questions themselves may have made their way into the training data. It notes that modifying the questions dynamically leads to decreases in performance of the tested models, even if the complexity of the problem to be solved has not gone up.

The third sentence of the paper (Abstract section) says this “While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics.” The rest of the abstract goes on to discuss (paraphrased in layman’s terms) that LLM’s are ‘studying for the test’ and not generally achieving real reasoning capabilities.

By presenting their methodology - dynamically changing the evaluation criteria to reduce data pollution and require models be capable of eliminating red herrings - the Apple researchers are offering a possible way benchmarking can be improved.
Which is what the person you replied to stated.

The commenter is fairly close, it seems.

permalink

report

parent

[ - ]

zbyte64@awful.systems

4 points

9 hours ago

Adding the benchmark back into the training process doesn’t mean you get an LLM the can weed out irrelevant data, what you get is an LLM that can pass the new metric and you have to design a new metric with different semantic patterns to actually know if it’s “eliminating red herrings”.

permalink

report

parent

[ - ]

rickdg@lemmy.world

-1 points

20 hours ago

Once there’s a benchmark, LLMs can optimise for it. This is just another piece of news where people call “game over” but the money poured into R&D isn’t stopping anytime soon. Wasn’t synthetic data supposed to be game over for LLMs? Its limitations have been identified and it’s still being leveraged.

permalink

report

parent

[ - ]

WrenFeathers@lemmy.world

9 points

14 hours ago

Someone needs to pull the plug on all of that stuff.

permalink

report

[ - ]

jabathekek@sopuli.xyz

190 points

1 day ago

permalink

report

[ - ]

WhatAmLemmy@lemmy.world

74 points

23 hours ago

The results of this new GSM-Symbolic paper aren’t completely new in the world of AI research. Other recent papers have similarly suggested that LLMs don’t actually perform formal reasoning and instead mimic it with probabilistic pattern-matching of the closest similar data seen in their vast training sets.

WTF kind of reporting is this, though? None of this is recent or new at all, like in the slightest. I am shit at math, but have a high level understanding of statistical modeling concepts mostly as of a decade ago, and even I knew this. I recall a stats PHD describing models as “stochastic parrots”; nothing more than probabilistic mimicry. It was obviously no different the instant LLM’s came on the scene. If only tech journalists bothered to do a superficial amount of research, instead of being spoon fed spin from tech bros with a profit motive…

permalink

report

parent

[ - ]

fluxion@lemmy.world

7 points

23 hours ago

Clearly this sort of reporting is not prevalent enough given how many people think we have actually come up with something new these last few years and aren’t just throwing shitloads of graphics cards and data at statistical models

permalink

report

parent

[ - ]

jabathekek@sopuli.xyz

12 points

17 hours ago

describing models as “stochastic parrots”

That is SUCH a good description.

permalink

report

parent

[ - ]

no banana@lemmy.world

39 points

23 hours ago

It’s written as if they literally expected AI to be self reasoning and not just a mirror of the bullshit that is put into it.

permalink

report

parent

[ - ]

Sterile_Technique@lemmy.world

33 points

22 hours ago

Probably because that’s the common expectation due to calling it “AI”. We’re well past the point of putting the lid back on that can of worms, but we really should have saved that label for… y’know… intelligence, that’s artificial. People think we’ve made an early version of Halo’s Cortana or Star Trek’s Data, and not just a spellchecker on steroids.

The day we make actual AI is going to be a really confusing one for humanity.

permalink

report

parent

Show more comments

[ - ]

aesthelete@lemmy.world

6 points

11 hours ago

If only tech journalists bothered to do a superficial amount of research, instead of being spoon fed spin from tech bros with a profit motive…

This is outrageous! I mean the pure gall of suggesting journalists should be something other than part of a human centipede!

permalink

report

parent

[ - ]

seaQueue@lemmy.world

17 points

17 hours ago

permalink

report

parent

[ - ]

jabathekek@sopuli.xyz

10 points

12 hours ago

*starts sweating

Look at that subtle pixel count, the tasteful colouring… oh my god, it’s even transparent…

permalink

report

parent

[ - ]

🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.ee

-9 points

22 hours ago

Are the uncensored models more capable tho?

permalink

report

[ - ]

misk@sopuli.xyzOP

11 points

21 hours ago

Given the use cases they were benchmarking I would be very surprised if they were any better.

permalink

report

parent

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

Community stats

18K
Monthly active users
12K
Posts
539K
Comments

Our Rules

Approved Bots

Community stats

Community moderators