screen-scraping – Row Coding

HTML Scraping in Php [duplicate]

November 27, 2023 by Tarik

Beautiful Soup cannot find a CSS class if the object has other classes, too

September 22, 2023 by Tarik

Unfortunately, BeautifulSoup treats this as a class with a space in it ‘class1 class2’ rather than two classes [‘class1′,’class2’]. A workaround is to use a regular expression to search for the class instead of a string. This works: soup.findAll(True, {‘class’: re.compile(r’\bclass1\b’)})

How to run multiple Tor processes at once with different exit IPs?

September 6, 2023 by Tarik

Create four torrc files, say /etc/tor/torrc.1 to .4. In each file, edit the lines: SocksPort 9050 ControlPort 9051 DataDirectory /var/lib/tor to use different resources for each torrc file, e.g. for for torrc.1: SocksPort 9060 ControlPort 9061 DataDirectory /var/lib/tor1 for torrc.2, SocksPort 9062 ControlPort 9063 DataDirectory /var/lib/tor2 and so on. A configuration file containing only the … Read more

Scrapy Python Set up User Agent

August 31, 2023 by Tarik

Move your USER_AGENT line to the settings.py file, and not in your scrapy.cfg file. settings.py should be at same level as items.py if you use scrapy startproject command, in your case it should be something like myproject/settings.py

How can I scrape an HTML table to CSV?

August 24, 2023 by Tarik

Select the HTML table in your tools’s UI and copy it into the clipboard (if that’s possible Paste it into Excel. Save as CSV file However, this is a manual solution not an automated one.

Simple Screen Scraping using jQuery

August 13, 2023 by Tarik

Use $.ajax to load the other page into a variable, then create a temporary element and use .html() to set the contents to the value returned. Loop through the element’s children of nodeType 1 and keep their first children’s nodeValues. If the external page is not on your web server you will need to proxy … Read more

Options for web scraping – C++ version only

August 13, 2023 by Tarik

libcurl to download the html file libtidy to convert to valid xml libxml to parse/navigate the xml

Headless, scriptable Firefox/Webkit on linux? [closed]

August 13, 2023 by Tarik

What about phantomjs?

How to scroll down with Phantomjs to load dynamic content

August 9, 2023 by Tarik

Found a way to do it and tried to adapt to your situation. I didn’t test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript … Read more

Download image file from the HTML page source

July 27, 2023 by Tarik

Here is some code to download all the images from the supplied URL, and save them in the specified output folder. You can modify it to your own needs. “”” dumpimages.py Downloads all the images on the supplied URL, and saves them to the specified output file (“/test/” by default) Usage: python dumpimages.py http://example.com/ [output] … Read more