let's say i want to host my own local search engine, i have the application ready.
now i want to activate my crawlers to scrap and index the web.
would i be in hot water for doing this? is there any implementation level rule that i can check other than robots.txt?
any thoughts or inputs on the subject other than it being a huge waste of time and resources :D.

  • 6
    There isn't a netiquetteas far as I know.

    Usually in all companies I worked for, we banned lol crawlers who ....

    1) ignored robots.txt / sitemaps.xml, especially if they tried to call "randomized" queries / routes

    2) crawled too aggressively - either the number of calls exceeded a certain limit per second or the number of queries for a site exceeded a certain limit (trying to fetch the same site every min isn't nice either)

    3) behaved "weird"... E.g. TLS downgrades, HTTP request smuggling, randomization of user agent header, ...
  • 0
    Check the fair use law and see if you cover the four conditions. That will cover your of your engine is to be used by public.
Add Comment