|
You are here |
kevingimbel.de | ||
| | | | |
pxlnv.com
|
|
| | | | | After Robb Knight found - and Wired confirmed - Perplexity summarizes websites which have followed its opt out instructions, I noticed a number of people making a similar claim: this is nothing but a big misunderstanding of the function of controls like robots.txt. A Hacker News comment thread contains several versions of these two arguments: [...] | |
| | | | |
www.andrlik.org
|
|
| | | | | It is now clear that at least some AI companies are ignoring robots.txt that forbid them from scraping a site. Robb Knight wrote up a great guide for explicitly blocking those scraping bots via your Nginx config. However, this site is currently served by AWS CloudFront, which means that the content gets served without the request touching the source server. I was sure there had to be a way to do something similar with a CloudFront function, so I set out to try. | |
| | | | |
tsak.dev
|
|
| | | | | With the recent news of OpenAI's web crawler respecting robots.txt and the ensuing scramble by seemingly everybody ensuring their robots.txt is blocking GPTBot, I was thinking if there wasn't a better solution to help our future AI overlords make sense of the world. As I am hosting all my sites on a tiny NUC using nginx and having previously played with its return directive I decided to reuse the same trick for visits of GPTBot. | |
| | | | |
www.engadget.com
|
|
| | | Sarah Silverman and two other authors allege OpenAI and Meta trained their large language models on copyrighted materials, including works they published, without obtaining consent. | ||