In this post I describe a simple, single-machine web crawler that I’ve written, and do some simple profiling and benchmarking. In the next post I intend to benchmark it against two popular open source crawlers, the scrapy and Nutch crawlers.
I’m doing this as part of an attempt to answer a big, broad question: if you were trying to build a web-scale crawler, does it make most sense to start from scratch (which gives you a lot of flexibility), or would it make more sense to start from an existing project, like Nutch?
Of course, there are many aspects to answering this question, but obviously one important aspect is speed: how fast can we download pages? I’m especially interested in understanding where the bottlenecks are in my code. Is it the fact that I’ve used Python? Is it download speed over the network? Is it access to the database server? Is it parsing content? Are we CPU-bound, network-bound, or disk-bound? The answers to these questions will help inform decisions about whether to work on improving the crawler, or perhaps to work starting from an existing crawler.
The code for my test crawler is at GitHub. The crawler uses as a set of seed urls a listing of some of the top blogs from Technorati. Only urls from within the corresponding domains are crawled. I won’t explicitly show the code for getting the seed urls, but it’s in the same GitHub repository (link). Here’s the code for the crawler:
"""crawler.py crawls the web. It uses a domain whitelist generated from Technorati's list of top blogs. USEAGE python crawler.py & The crawler ingests input from external sources that aren't under centralized control, and so needs to deal with many potential errors. By design, there are two broad classes of error, which we'll call anticipated errors and unanticipated errors. Anticipated errors are things like a page failing to download, or timing out, or a robots.txt file disallowing crawling of a page. When anticipated errors arise, the crawler writes the error to info.log, and continues in an error-appropriate manner. Unanticipated errors are, not surprisingly, errors which haven't been anticipated and designed for. Rather than the crawler falling over, we log the error and continue. At the same time, we also keep track of how many unanticipated errors have occurred in close succession. If many unanticipated errors occur rapidly in succession it usually indicates that some key piece of infrastructure has failed --- maybe the network connection is down, or something like that. In that case we shut down the crawler entirely.""" import cPickle import json import logging import logging.handlers import os from Queue import Queue import re import robotparser from StringIO import StringIO import sys import threading import time import traceback import urllib import urllib2 from urlparse import urlparse # Third party libraries from lxml import etree import MySQLdb from redis import Redis # Configuration parameters # NUM_THREADS is the number of crawler threads: setting this parameter # requires some experimentation. A rule of thumb is that the speed of # the crawler should scale roughly proportionally to NUM_THREADS, up # to a point at which performance starts to saturate. That's the # point at which to stop. NUM_THREADS = 15 # MAX_LENGTH is the largest number of bytes to download from any # given url. MAX_LENGTH = 100000 # NUM_PAGES is the number of pages to crawl before halting the crawl. # Note that the exact number of pages crawled will be slightly higher, # since each crawler thread finishes its current downloads before # exitting. At the moment, NUM_PAGES is set quite modestly. To # increase it dramatically --- say up to 10 million --- would require # moving away from Redis to maintain the URL queue. NUM_PAGES = 5000 # Global variables to keep track of the number of unanticipated # errors, and a configuration parameter --- the maximum number of # close unanticipated errors in a row that we'll tolerate before # shutting down. count_of_close_unanticipated_errors = 0 time_of_last_unanticipated_error = time.time() MAX_CLOSE_UNANTICIPATED_ERRORS = 5 # Counter to keep track of the number of pages crawled. r = Redis() r.set("count",0) # total_length: the total length of all downloaded files. # Interpreting it depends on total_length = 0 def main(): create_logs() establish_url_queues() get_seed_urls() get_domain_whitelist() establish_mysql_database() start_time = time.time() for j in range(NUM_THREADS): crawler = Crawler() crawler.setName("thread-%s" % j) print "Launching crawler %s" % (j,) crawler.start() crawler.join() end_time = time.time() elapsed_time = end_time-start_time r = Redis() num_pages = int(r.get("count")) print "%s pages downloaded in %s seconds." % (num_pages,elapsed_time) print "That's %s pages per second." % (num_pages/elapsed_time) print "\nTotal length of download is %s." % total_length print "Assuming UTF-8 encoding (as for most English pages) that's the # of bytes downloaded." print "Bytes per second: %s" % (total_length/elapsed_time) def create_logs(): """Set up two logs: (1) info_logger logs routine events, including both pages which have been crawled, and also anticipated errors; and (2) critical_logger records unanticipated errors.""" global info_logger global critical_logger info_logger = logging.getLogger('InfoLogger') info_logger.setLevel(logging.INFO) info_handler = logging.handlers.RotatingFileHandler( 'info.log', maxBytes=1000000, backupCount=5) info_logger.addHandler(info_handler) critical_logger = logging.getLogger('CriticalLogger') critical_logger.setLevel(logging.CRITICAL) critical_handler = logging.handlers.RotatingFileHandler( 'critical.log',maxBytes=100000,backupCount=5) critical_logger.addHandler(critical_handler) def establish_url_queues(): """Checks whether the Redis database has been set up. If not, set it up.""" r = Redis() # Strictly, we should check that the lists for all threads are empty. # But this works in practice. if r.llen("thread-0") == 0: get_seed_urls() def get_seed_urls(): """Puts the seed urls generated by get_seed_urls_and_domains.py into the crawl queues.""" f = open("seed_urls.json") urls = json.load(f) f.close() append_urls(urls) def append_urls(urls): """Appends the contents of urls to the crawl queues. These are implemented as Redis lists, with names "thread-0", "thread-1", and so on, corresponding to the different threads.""" r = Redis() for url in urls: thread = hash(domain(url)) % NUM_THREADS r.rpush("thread-%s" % str(thread),url) def domain(url): """A convenience method to return the domain associated to a url.""" return urlparse(url).netloc def get_domain_whitelist(): global domain_whitelist f = open("domain_whitelist.json") domain_whitelist = set(json.load(f)) f.close() def establish_mysql_database(): """Checks whether the tables in the MySQL database "crawl" have been set up. If not, then set them up. Note that this routine assumes that the "crawl" database has already been created. """ conn = MySQLdb.connect("localhost","root","","crawl") cur = conn.cursor() if int(cur.execute("show tables")) == 0: create_tables(conn) conn.close() def create_tables(conn): """Creates the MySQL tables and indices for the crawl. We create two tables: (1) robot_parser, which is used to store the (parsed) robots.txt file for each domain; and (2) pages, which stores urls and the corresponding title and content text.""" cur = conn.cursor() cur.execute( 'create table robot_parser (domain text, robot_file_parser text)') cur.execute('create table pages (url text, title text, content mediumtext)') cur.execute('create index domain_idx on robot_parser(domain(255))') cur.execute('create index url_pages_idx on pages(url(255))') class Crawler(threading.Thread): def run(self): global total_length r = Redis() conn = MySQLdb.connect("localhost","root","","crawl") parser = etree.HTMLParser() while int(r.get("count")) < NUM_PAGES: try: urls = self.pop_urls() new_urls = [] except: self.error_handler() continue for url in urls: try: if not self.is_crawling_allowed(conn,url): continue try: request = urllib2.urlopen(url,timeout=5) except urllib2.URLError: info_logger.info("%s: could not open" % url) continue headers = request.headers length = get_length(url,headers) try: content = request.read(length) except urllib2.URLError: info_logger.info("%s: could not download" % url) continue try: tree = etree.parse(StringIO(content),parser) except: info_logger.info("%s: lxml could not parse" % url) continue self.add_to_index(conn,url,content,tree) r.incr("count") total_length += len(content) info_logger.info("Crawled %s" % url) for url in tree.xpath("//a/@href"): if not(domain(url) in domain_whitelist): pass else: new_urls.append(url) except: self.error_handler() continue try: append_urls(new_urls) conn.commit() except: self.error_handler() continue def pop_urls(self): """Returns 10 urls from the current thread's url queue.""" urls = [] r = Redis() for j in range(10): urls.append(r.lpop(self.name)) return urls def error_handler(self): """Logs unanticipated errors to critical_logger. Also checks whether or not the error occurred close to (within 3 seconds) another unanticipated error, and if that occurs too many times in a row, shuts down the crawler.""" global time_of_last_unanticipated_error global count_of_close_unanticipated_errors critical_logger.critical( "Error getting URLs at %s:\n\n%s" % (time.asctime(), traceback.format_exc())) # Check whether the error is close (within 3 seconds) to the # last unanticipated error if (time.time() < time_of_last_unanticipated_error + 3.0): critical_logger.critical( "\nThis error occurred close to another.\n") # Not threadsafe, but shouldn't cause major problems count_of_close_unanticipated_errors += 1 else: count_of_close_unanticipated_errors = 0 # Shut down if we have too many close unanticipated errors if (count_of_close_unanticipated_errors >= MAX_CLOSE_UNANTICIPATED_ERRORS): critical_logger.critical( "\nExit: too many close unanticipated errors.") sys.exit(1) time_of_last_unanticipated_error = time.time() def is_crawling_allowed(self,conn,url): """Checks that url can be crawled. At present, this merely checks that robots.txt allows crawling, and that url signifies a http request. It could easily be extended to do other checks --- on the language of the page, for instance.""" return (self.is_url_a_http_request(url) and self.does_robots_txt_allow_crawling(conn,url)) def is_url_a_http_request(self,url): """Checks that url is for a http request, and not (say) ftp.""" u = urlparse(url) if u.scheme == "http": return True else: info_logger.info("%s: not a http request" % url) return False def does_robots_txt_allow_crawling(self,conn,url): """Check that robots.txt allows url to be crawled.""" cur = conn.cursor() cur.execute( "select robot_file_parser from robot_parser where domain='%s'" \ % domain(url)) rfp_db = cur.fetchone() if rfp_db: # we've crawled this domain before rfp = cPickle.loads(str(rfp_db[0])) else: # we've never crawled this domain rfp = robotparser.RobotFileParser() try: rfp.set_url("http://"+domain(url)+"/robots.txt") rfp.read() except: info_logger.info("%s: couldn't read robots.txt" % url) return False rfp_pickle = cPickle.dumps(rfp) cur = conn.cursor() cur.execute( "insert into robot_parser(domain,robot_file_parser) values(%s,%s)", (domain,rfp_pickle)) conn.commit() if rfp.can_fetch("*",url): return True else: info_logger.info("%s: robots.txt disallows fetching" % url) return False def add_to_index(self,conn,url,content,tree): title = get_title(tree) cur = conn.cursor() try: cur.execute("insert into pages (url,title,content) values(%s,%s,%s)", (url,title,content)) except UnicodeEncodeError: info_logger.info("%s: Couldn't store title %s" % (url,title)) conn.commit() def get_length(url,headers): """Attempts to find the length of a page, based on the "content-length" header. If the information is not available, then sets the length to MAX_LENGTH. Otherwise, sets it to the minimum of the "content-length" header and MAX_LENGTH.""" try: length = int(headers["content-length"]) except (KeyError,ValueError): info_logger.info( "%s: could not retrieve content-length" % url) length = MAX_LENGTH if length > MAX_LENGTH: info_logger.info( "%s: had length %s, truncating to %s" % (url,length,MAX_LENGTH)) length = MAX_LENGTH return length def get_title(tree): """Takes an lxml etree for a HTML document, and returns the text from the first (if any) occurrence of the title tag.""" x = tree.xpath("//title") if len(x) > 0 and x[0].text: return x[0].text else: return "" if __name__ == "__main__": main()
(Feedback on the code is welcome. I’m not an experienced Python programmer, and I know I could learn a lot from people with more experience. Aspects of this code that I suspect could be substantially improved with advice from the right person include the way I do error-handling, and logging.)
I used this code on a Slicehost 512 slice – a very lightweight Xen virtual private server, definitely not heavy machinery! Download speed maxed out at around 15 crawler threads, and so that’s the number of threads I used. The bottleneck appeared to be CPU (which was routinely in the range 50-80 percent), although, as we’ll see below, we were also using a substantial fraction of network capacity. Neither memory nor disk speed seemed to be an issue.
The crawler downloaded 5043 pages in 229 seconds, which is 22 pages per second. Another way of looking at it is that that’s about 2 million pages per day. The total length of the downloaded pages was 386 million characters – around 370 megabytes, assuming UTF-8 encoding, which gives a sustained download rate of about 1.7 megabytes per second. Using wget alone I’m ordinarily able to get several times that (3-6 megabytes per second), and sometimes up to peak speeds over 10 Mb/sec. This suggests that the bottleneck is not network speed, although, as we’ll see below, a substantial fraction of the program’s time is spent downloading.
I profiled the crawler using yappi, a multi-threaded Python profiler. (The standard Python profiler only copes with a single thread.) Here are some things I learnt (note that all numbers are approximate):
- Networking – both downloading data and making connections – consumes 40 percent of time.
- A surprising fraction of that time (6-7 percent) seems to be spent on just opening the connection, using urllib2.urlopen. I’m not sure what’s taking the time – DNS lookup maybe?
- Another surprising aspect of networking is that dealing with urllib2 errors takes 5 percent of the time.
- Redis consumes about 20 percent of time. Well over half of that is spent in client.py._execute_command.
- The parser for robots.txt is suprisingly CPU intensive, consuming about 5 percent of time.
- urlparse consumes about 4 percent of the time.
- For reasons I don’t understand, lxml and etree don’t show up in the profiling results. I have no idea why this is.
- append_urls consumes only 3-4 percent of the time.
- MySQL and the logging were both about 1/4 of a percent. I must admit to being suspicious about the first of these, when compared to the Redis results. It’s true that the program makes far more calls to Redis than MySQL – half a million or so calls to Redis, versus about 16,000 to MySQL – but this still seems wrong. Other possibilities that deserve further consideration: (1) I’ve got Redis poorly configured; or (2) Redis lists are slower than I thought.
Many actions suggest themselves. Most significantly, I’ll probably eliminate Redis, and store the queue of urls to be crawled (the url frontier) using a combination of MySQL (for persistence) and in-memory Python queues (for speed). This is the right thing to do regardless of performance. The reason is that Redis stores the url frontier in-memory, and that severely limits the size of the url frontier, unless you spend a lot of money on memory. The obvious thing to do is to move to a disk-based solution to store the url frontier.
With some alterations it should be easy to make this crawler scalable, i.e., to make it run on a cluster, rather than a single machine. In particular, because the crawler is based on whitelisting of domains, all that’s needed is to start the crawl off by allocating different parts of the whitelist to different machines. Those sites can then be crawled in parallel, with no necessity for inter-machine communication.
Many natural follow-on questions suggest themselves. Are any of these top blogs serving spam, or otherwise compromised in some way? What fraction ban crawling by unknown crawlers (like this one)? What fraction of pages from the crawl are duplicates, or near-duplicates? Do any of the webpages contain honeypots, i.e., url patterns designed to trap or otherwise mislead crawlers? I plan to investigate these questions in future posts.
Interested in more? Please follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon.
Just one comment: threads in python are different from threads from other programming languages. Python has so-called GIL – global interpreter lock which prevents threads running in the parallel. It means your program runs one thread at the moment. If you have access to a computer with more than one CPU, I would recommend you to use different processes instead of threads.
Thanks for the comment.
Slicehost gives users just a single core (on a 512 Slice), so this is not an issue at present. But I will move this to a multi-core system at some point, and should probably rewrite it to use a mixture of threads and processes.
The problem with just using multiple processes is that (as I understand it) there is a significant memory overhead associated with using a large number of processes. And so it seems to make most sense to use threads when I can.
What about running it on GPU cores instead of CPU cores? I know its still early development but you can use PyCuda to access the NVIDIA CUDA api. The latest NVIDIA cards have 512 GPU cores.
http://mathema.tician.de/software/pycuda
PyCuda looks cool – hadn’t seen it before. But I’m not sure how easy it would be to run Python itself (and all the libraries) on the GPU!
Well I was thinking that if your bottle-neck is the CPU that you might be able to designate a GPU core per thread/process for the calculations. I’m an advocate of the GPU if you can’t tell
Have you seen Marc Seeger’s “Building blocks of a scalable webcrawler” (thesis)? – Marc’s blog entry about it: http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-scalable-webcrawler/
Thought it might be helpful.
Great blog btw.
Patrick
No, I hadn’t – thanks for the pointer!
I couldnt found your test crawler code at GitHub,can you send me a copy? Thanks!