screen-scraping
Beautiful Soup cannot find a CSS class if the object has other classes, too
Unfortunately, BeautifulSoup treats this as a class with a space in it ‘class1 class2’ rather than two classes [‘class1′,’class2’]. A workaround is to use a regular expression to search for the class instead of a string. This works: soup.findAll(True, {‘class’: re.compile(r’\bclass1\b’)})
How to run multiple Tor processes at once with different exit IPs?
Create four torrc files, say /etc/tor/torrc.1 to .4. In each file, edit the lines: SocksPort 9050 ControlPort 9051 DataDirectory /var/lib/tor to use different resources for each torrc file, e.g. for for torrc.1: SocksPort 9060 ControlPort 9061 DataDirectory /var/lib/tor1 for torrc.2, SocksPort 9062 ControlPort 9063 DataDirectory /var/lib/tor2 and so on. A configuration file containing only the … Read more
Scrapy Python Set up User Agent
Move your USER_AGENT line to the settings.py file, and not in your scrapy.cfg file. settings.py should be at same level as items.py if you use scrapy startproject command, in your case it should be something like myproject/settings.py
How can I scrape an HTML table to CSV?
Select the HTML table in your tools’s UI and copy it into the clipboard (if that’s possible Paste it into Excel. Save as CSV file However, this is a manual solution not an automated one.
Simple Screen Scraping using jQuery
Use $.ajax to load the other page into a variable, then create a temporary element and use .html() to set the contents to the value returned. Loop through the element’s children of nodeType 1 and keep their first children’s nodeValues. If the external page is not on your web server you will need to proxy … Read more
Options for web scraping – C++ version only
libcurl to download the html file libtidy to convert to valid xml libxml to parse/navigate the xml
Headless, scriptable Firefox/Webkit on linux? [closed]
What about phantomjs?
How to scroll down with Phantomjs to load dynamic content
Found a way to do it and tried to adapt to your situation. I didn’t test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript … Read more
Download image file from the HTML page source
Here is some code to download all the images from the supplied URL, and save them in the specified output folder. You can modify it to your own needs. “”” dumpimages.py Downloads all the images on the supplied URL, and saves them to the specified output file (“/test/” by default) Usage: python dumpimages.py http://example.com/ [output] … Read more