NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Thank You, AI (kraxel.org)
data-ottawa 19 minutes ago [-]
Does anyone know what's the deal with these scrapers, or why they're attributed to AI?

I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.

Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.

hsuduebc2 5 minutes ago [-]
I’m guessing, but I think a big portion of AI requests now come from agents pulling data specifically to answer a user’s question. I don’t think that data is collected mainly for training now but are mostly retrieved and fed into LLMs so they can generate the response. Thus so many repeated requests.
themafia 7 minutes ago [-]
There's value to be had in ripping the copyright off your stuff so someone else can pass it off as their stuff. LLMs have no technical improvements so all they can do is throw more and more stolen data into it and hope it, somehow, crosses a nebulous "threshold" where it suddenly becomes actually profitable to use and sell.

It's a race to the bottom. What's different is we're much closer to the bottom now.

CuriouslyC 32 minutes ago [-]
Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy.
jaunt7632 21 minutes ago [-]
A healthy front page shouldn’t be a “famous people only” section. If only big names can show up there, it’s not discovery anymore, it’s just a popularity scoreboard.
fouc 24 minutes ago [-]
because he's unable to self-host git anymore because AI bots are hammering it to submit PRs.

self-hosting was originally a "right" we had upon gaining access to the internet in the 90s, it was the main point of the hyper text transfer protocol.

geerlingguy 7 minutes ago [-]
Also converting the blog from something dynamic to a static site generator. I made the same switch partly for ease of maintenance, but a side benefit is it's more resilient to this horrible modern era of scrapers far outnumbering legitimate traffic.

It's painful to have your site offline because a scraper has channeled itself 17,000 layers deep through tag links (which are set to nofollow, and ignored in robots.txt, but the scraper doesn't care). And it's especially annoying when that happens on a daily basis.

Not everyone wants to put their site behind Cloudflare.

ares623 20 minutes ago [-]
Well the fact that this supposed nobody is overwhelmed by AI scrapers should speak a lot about the issue no?
bibimsz 25 minutes ago [-]
the era of mourning has begun
oceanplexian 10 minutes ago [-]
It's not that hard to serve some static files @ 10k RPS from something running on modest, 10 year old hardware.

My advice to the OP is if you're not experienced enough, maybe stop taking subtle digs at AI and fire up Claude Code and ask it how to set up a LAMP stack or a simple Varnish Cache. You might find it's a lot easier than writing a blog post.

simonw 4 minutes ago [-]
A varnish cache won't help you if you're running something like a code forge where every commit has its own page - often more than one page, there's the page for the commit and then the page for "history from this commit" and a page for every one of the files that existed in the repo at the time of that commit...

Then a poorly written crawler shows up and requests 10,000s of pages that haven't been requested recently enough to be in your cache.

I had to add a Cloudflare Captcha to the /search/ page of my blog because of my faceted search engine - which produces may thousands of unique URLs when you consider tags and dates and pagination and sort-by settings.

And that's despite me serving ever page on my site through a 15 minute Cloudflare cache!

Static only works fine for sites that have a limited number of pages. It doesn't work for sites that truly take advantage of the dynamic nature of the web.

QuiDortDine 5 minutes ago [-]
Not sure why you're talking like OP pissed in your cheerios. They are a victim of a broken system, it shouldn't be on them to spend more effort protecting their stuff from careless-to-malicious actors.
aguacaterojo 3 minutes ago [-]
How would a LAMP stack help his git server?
Jaxkr 36 minutes ago [-]
The author of this post could solve their problem with Cloudflare or any of its numerous competitors.

Cloudflare will even do it for free.

denkmoon 23 minutes ago [-]
Cool, I can take all my self hosted stuff and stick it behind centralised enterprise tech to solve a problem caused by enterprise tech. Why even bother?
FeteCommuniste 24 seconds ago [-]
"Cause a problem and then sell the solution" proves a winning business strategy once more.
rubiquity 32 minutes ago [-]
The scrapers should use some discretion. There are some rather obvious optimizations. Content that is not changing is less likely to change in the future.
the_fall 21 minutes ago [-]
They don't. I'm using Cloudflare and 90%+ of the traffic I'm getting are still broken scrapers, a lot of them coming through residential proxies. I don't know what they block, but they're not very good at that. Or, to be more fair: I think the scrapers have gotten really good at what they do because there's real money to be made.
overgard 18 minutes ago [-]
I'm pretty sure scrapers aren't supposed to act as low key DOS attacks
isodev 31 minutes ago [-]
I think the point of the post was how something useless (AI) and its poorly implemented scrapers is wrecking havoc in a way that’s turning the internet into a digital desert.

That Cloudflare is trying to monetise “protection from AI” is just another grift in the sense that they can’t help themselves as a corp.

fouc 22 minutes ago [-]
you don't understand what self-hosting means. self-hosting means the site is still up when AWS and Cloudflare go down.
Lerc 19 minutes ago [-]
I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this?
sdf2erf 26 minutes ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 03:48:02 GMT+0000 (Coordinated Universal Time) with Vercel.