Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal(www.404media.co)

posted 4 months ago

shish_mish@lemmy.world

technology@lemmy.world

34 commentshide report

Sort:

Hot Top Controversial New Old

[ - ]

BearOfaTime@lemm.ee

36 points

4 months ago

Excellent! Now I won’t get reddit results and then have to filter them out!

permalink

report

[ - ]

MehBlah@lemmy.world

8 points

4 months ago

Sounds great to me. With reddit gone maybe we can start to find what we are looking for without having to go sort through reddit.

permalink

report

parent

[ - ]

tal@lemmy.today

4 points

4 months ago

Kagi has a “search lens” specifically to search the Threadiverse. Like, they track lemmy/kbin/etc sites and you can specifically include them in their own results section, and can also have “!threadiverse” or whatever you want specifically search that.

They do the same for Usenet.

I suppose, given this new robots.txt Reddit development, that they’ll probably never have a Reddit lens, though.

permalink

report

parent

[ - ]

zutto@lemmy.fedi.zutto.fi

4 points

4 months ago

Kagi is a metasearch-engine (apart from their homebrew small-web index, known as Teclis), so the reddit lenses will continue to function long as one of the search engines it’s querying is paying reddit.

permalink

report

parent

[ - ]

MyOpinion@lemm.ee

40 points

4 months ago

These two shitty companies deserve each other.

permalink

report

[ - ]

woelkchen@lemmy.world

3 points

4 months ago

test site:reddit.com works fine from DDG for me.

permalink

report

[ - ]

tal@lemmy.today

13 points

4 months ago

Older results will still show up, but these search engines are no longer able to “crawl” Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward.

Robots.txt lets you ask specific user-agents not to index the site. My guess is that that’s how they restricted it. I don’t know how those changes are reflected in existing indexed pages – don’t know if there’s any standard there – but it’ll stop crawlers from downloading new pages.

Try searching for new posts, see how DDG/Bing compares to Google.

EDIT: Yeah. They’ve got a sitewide ban for all crawlers. That’d normally block Google’s bot too, but I bet that they have some offline agreement to have it ignore the thing, operate out-of-spec.

https://www.reddit.com/robots.txt

User-agent: *

Disallow: /

Here’s a snapshot on archive.org’s Wayback Machine from April 30 of this year. Very different:

https://web.archive.org/web/20240430000731/http://www.reddit.com/robots.txt

permalink

report

parent

[ - ]

squidspinachfootball@lemm.ee

4 points

4 months ago

iirc, isn’t robots.txt more of a gentlemen’s agreement? I vaguely recall bots being able to crawl a site regardless, it’s just that most devs respect robots.txt and don’t. Could be wrong though, happy to be corrected.

permalink

report

parent

[ - ]

tal@lemmy.today

4 points

4 months ago

Sure, you can write software that violates the spec. But I mean, that’d be true for anything that Reddit can do on their end. Even if they block responses to what they think are bots, software can always try hard to impersonate users and scrape websites. You could go through a VPN, pretend to be a browser being linked to a page.

But major search engines will follow the spec with their crawlers.

EDIT: RFC 9309, Robots Exclusion Protocol

https://datatracker.ietf.org/doc/html/rfc9309

If no matching group exists, crawlers MUST obey the group with a user-agent line with the “*” value, if present.

To evaluate if access to a URI is allowed, a crawler MUST match the paths in “allow” and “disallow” rules against the URI.

EDIT2: Even if, amusingly, Google apparently isn’t for this particular case with GoogleBot, given the way that they’re signing agreements. They’ll honor it for sites that they haven’t signed agreements with, though.

EDIT3: Actually, on second thought, GoogleBot may be honoring it too. GoogleBot may not be crawling Reddit anymore. They may have some “direct pipe” that passes comments to Google that bypasses Google’s scraper. Less load on both their systems, and lets Google get real-time index updates without having to hammer the hell out of Reddit’s backend to see if things have changed. Like, think of how Twitter’s search engine is especially useful because it has full-text search through comments and immediately updates the index when someone comments.

permalink

report

parent

Show more comments

[ - ]

Eager Eagle@lemmy.world

3 points

4 months ago

User-Agent: bender
Disallow: /my_shiny_metal_ass

permalink

report

parent

[ - ]

itslilith@lemmy.blahaj.zone

1 point

4 months ago

set the date filter to something recent, test site:reddit.com df:w (results from last week only) gives 0 hours hits

permalink

report

parent

[ - ]

Brewchin@lemmy.world

71 points

4 months ago

Parts of the Internet now only searchable on specific sites now? What next - charging a monthly subscription to use Google?

This needs to be regulated before the Internet becomes like streaming TV.

permalink

report

[ - ]

tal@lemmy.today

25 points

4 months ago

Robots.txt has been around for a long time, and all the major search engines will honor it. Not having a full index of the Web is the norm.

That isn’t to say that the practice of signing agreements isn’t potentially a concern. Not sure that I like the idea of search engines paying sites money to degrade search results of competitors.

permalink

report

parent

[ - ]

DominusOfMegadeus@sh.itjust.works

1 point

4 months ago

Deleted by creator

permalink

report

parent

[ - ]

reddig33@lemmy.world

9 points

4 months ago

What isn’t the norm is to serve one robots.txt to one company, and a different robots.txt to everyone else. Which is what Reddit is doing here.

permalink

report

parent

[ - ]

Maxnmy's@lemmy.world

17 points

4 months ago

Alright then. The 3rd party app drama already pushed me here. I really won’t go back for anything if I’m not allowed to search for Reddit anymore.

permalink

report

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

Community stats

17K
Monthly active users
12K
Posts
554K
Comments

Our Rules

Approved Bots

Community stats

Community moderators