web-crawler – Page 2

Python: Disable images in Selenium Google ChromeDriver

May 26, 2023 by Tarik

Here is another way to disable images: from selenium import webdriver chrome_options = webdriver.ChromeOptions() prefs = {“profile.managed_default_content_settings.images”: 2} chrome_options.add_experimental_option(“prefs”, prefs) driver = webdriver.Chrome(chrome_options=chrome_options) I found it below: http://nullege.com/codes/show/src@o@s@osintstalker-HEAD@fbstalker1.py/56/selenium.webdriver.ChromeOptions.add_experimental_option

Change IP address dynamically?

May 22, 2023 by Tarik

An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware. Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py: DOWNLOADER_MIDDLEWARES = { ‘scrapy.contrib.downloadermiddleware.retry.RetryMiddleware’: 90, ‘tutorial.randomproxy.RandomProxy’: 100, ‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’: 110, ‘scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware’ : None, ‘tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware’ :400, } Random Proxy You can use scrapy-proxies. This component will process Scrapy requests … Read more

Detect Search Crawlers via JavaScript

May 21, 2023 by Tarik

This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here: /bot|crawler|spider|crawling/i For example you have some object, util.browser, you can store what type of device a user is on: util.browser … Read more

How do I make a simple crawler in PHP? [closed]

May 18, 2023 by Tarik

Meh. Don’t parse HTML with regexes. Here’s a DOM version inspired by Tatu’s: <?php function crawl_page($url, $depth = 5) { static $seen = array(); if (isset($seen[$url]) || $depth === 0) { return; } $seen[$url] = true; $dom = new DOMDocument(‘1.0’); @$dom->loadHTMLFile($url); $anchors = $dom->getElementsByTagName(‘a’); foreach ($anchors as $element) { $href = $element->getAttribute(‘href’); if (0 !== … Read more

How to write a crawler?

May 17, 2023 by Tarik

You’ll be reinventing the wheel, to be sure. But here’s the basics: A list of unvisited URLs – seed this with one or more starting pages A list of visited URLs – so you don’t go around in circles A set of rules for URLs you’re not interested in – so you don’t index the … Read more

How to do HTTP-request/call with JSON payload from command-line?

May 15, 2023 by Tarik

You could use wget as well: wget -O- –post-data=”{“some data to post…”}” \ –header=”Content-Type:application/json” \ ‘http://www.example.com:9000/json’ Calling wget with the option -O providing the – (space in between will be ignored, so it could also be written as -O -) to it as its value will cause wget to output the HTTP response directly to … Read more

Python: maximum recursion depth exceeded while calling a Python object

May 12, 2023 by Tarik

Python don’t have a great support for recursion because of it’s lack of TRE (Tail Recursion Elimination). This means that each call to your recursive function will create a function call stack and because there is a limit of stack depth (by default is 1000) that you can check out by sys.getrecursionlimit (of course you … Read more

Anyone know of a good Python based web crawler that I could use?

May 11, 2023 by Tarik

Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission). Twill is a simple scripting language built on top of Mechanize BeautifulSoup + urllib2 also works quite nicely. Scrapy looks like an extremely promising project; it’s new.

Click a Button in Scrapy

March 24, 2023 by Tarik

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

getting Forbidden by robots.txt: scrapy

March 23, 2023 by Tarik

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY = False Here are the release notes