1
0
Fork 0
forked from zesterer/babble
Junk food for your local LLM
Find a file
Ed Summers 3ec4cd8595
Deny all robots
Currently the robots.txt is set up to allow complete access by robots.
This means that well meaning bots that actually respect a sites wishes
with regards to crawling will be invited into the maze.

I think it makes more sense to tell all robots to go away, and if the
robot just blindly ignores this it will get lost in the babble tarpit.

Given enough babble instances this means that over time bot creators
will write LLM scraping bots that respect robots.txt so that they don't
incur the cost to their compute, bandwidth, and ultimately the quality
of their model.

```
To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:
```

via https://www.robotstxt.org/robotstxt.html
2025-05-21 11:45:46 -04:00
src Deny all robots 2025-05-21 11:45:46 -04:00
.gitignore Stats collection 2025-04-29 20:37:34 +01:00
Cargo.lock Fetch client ip from X-Forwarded-For, if possible 2025-04-29 20:57:25 +01:00
Cargo.toml Fetch client ip from X-Forwarded-For, if possible 2025-04-29 20:57:25 +01:00
README.md Add robots.txt support 2025-04-29 21:10:12 +01:00
wap.txt Initial implementation 2025-04-20 23:43:03 +01:00

Babble

Standalone LLM crawler tarpit binary. Generates an endless stream of deterministic bollocks to be ingested by bots, with plenty of links.

Why?

  • Divert and slow down LLM crawler traffic, protecting your main site
  • Potentially poison LLM training data (likely not very effective)
  • Collective defence; the more time a scraper spends swallowing babble, the less time it'll spend bulling someone else's site
  • Do your bit to protect the public commons from those who would readily see it destroyed for the sake of an investment round

Usage

--cert <path> | Path of `cert.pem` (for TLS)
--key <path> | Path of `key.pem` (for TLS)
--sock <address> | Bind to the given socket. Defaults to 0.0.0.0:3000.

Babble will search for a robots.txt file in the working directory to use. If it does not find one, it will use a default one that denies everything.

Babble will periodically emit statistics into stats.txt, showing information about the worst-offending requesting IPs.

Warning

Deploy it in a docker environment. It's probably safe, but no reason to take chances.

Usage terms

There are none, other than those implied by dependencies. Use it whenever and wherever you want, and in any way.

Attribution

Fuck you, Sam Altman.