web-crawler – Page 4

How do sites detect bots behind proxies or company networks

November 17, 2022 by Tarik

No, they’ll ban the public IP and everyone who is NAT’d to that IP will also be banned. Although at least At stack if we think we are going to ban a college or something like that we’ll reach out to their abuse contact to get them to track the offender down and stop the … Read more

Detecting ‘stealth’ web-crawlers

November 13, 2022 by Tarik

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more

Get a list of URLs from a site [closed]

November 9, 2022 by Tarik

I didn’t mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

How to find all links / pages on a website

November 5, 2022 by Tarik

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

How to pass a user defined argument in scrapy spider

November 4, 2022 by Tarik

Spider arguments are passed in the crawl command using the -a option. For example: scrapy crawl myspider -a category=electronics -a domain=system Spiders can access arguments as attributes: class MySpider(scrapy.Spider): name=”myspider” def __init__(self, category=”, **kwargs): self.start_urls = [f’http://www.example.com/{category}’] # py36 super().__init__(**kwargs) # python3 def parse(self, response) self.log(self.domain) # system Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments Update … Read more

how to detect search engine bots with php?

October 24, 2022 by Tarik

I use the following code which seems to be working fine: function _bot_detected() { return ( isset($_SERVER[‘HTTP_USER_AGENT’]) && preg_match(‘/bot|crawl|slurp|spider|mediapartners/i’, $_SERVER[‘HTTP_USER_AGENT’]) ); } update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en added mediapartners

Difference between BeautifulSoup and Scrapy crawler?

October 19, 2022 by Tarik

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling. While BeautifulSoup is a parsing library which also does a pretty good … Read more

TypeError: can’t use a string pattern on a bytes-like object in re.findall()

October 17, 2022 by Tarik

You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode(‘utf-8’). See Convert bytes to a Python String

keep rsync from removing unfinished source files

October 14, 2022 by Tarik

It seems to me the problem is transferring a file before it’s complete, not that you’re deleting it. If this is Linux, it’s possible for a file to be open by process A and process B can unlink the file. There’s no error, but of course A is wasting its time. Therefore, the fact that … Read more

Finding the layers and layer sizes for each Docker image

October 8, 2022 by Tarik

Check out dive written in golang. Awesome tool!