OSS-bot – Michael Nielsen

OSS-bot is a crawler I (Michael Nielsen) built for educational purposes — I run occasional informal meetups where programmers in Toronto get together to talk about machine learning, information retrieval, and similar topics. The crawler:

(1) Is designed to be polite — it obeys robots.txt, as well as various other best practices. If you wish to exclude OSS-bot, please add the appropriate exclusion rules to your robots.txt file, for some User-agent pattern matching the string “OSS-bot”. Note that the crawler caches robots.txt files, so it may take up to a day or so to cease accessing your site. Contact me (mn@michaelnielsen.org) if this is a concern.

(2) The mean time between requests is usually about 3 minutes, for any given domain. The crawler will never take less than 70 seconds between requests for any given domain.

(3) Crawls typically last 1-3 hours. Occasional crawls may last longer, up to 48 hours. The crawler is run infrequently, and most months is run for between 0 and 10 hours.

(4) Only html content is crawled (not images, javascript, etc), so the total bandwidth consumed is typically a few hundred kilobytes per hour per domain.

OSS-bot crawls in this way so that the burden imposed by the crawler is (briefly) comparable to that imposed by a moderately intense user. However, I’m still learning best practices for crawling, and don’t want to impose an undue burden on sites. If you have concerns or suggestions or would simply like OSS-bot to stop crawling your site, please contact me (mn@michaelnielsen.org).

1 comment