How to write a crawler?

You’ll be reinventing the wheel, to be sure. But here’s the basics: A list of unvisited URLs – seed this with one or more starting pages A list of visited URLs – so you don’t go around in circles A set of rules for URLs you’re not interested in – so you don’t index the … Read more

crawler vs scraper

A crawler gets web pages — i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, … Read more

Detecting ‘stealth’ web-crawlers

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more