Search in html source with GOOGLE? [closed]

I’ve come across the following resources on my travels (some already mentioned above):

HTML Mark-up-focused search engines

I’d also like to throw in the following:

Huge, website crawl data archives

How can we analyze this crawl data?

For an idea of how to begin analyzing some of this massive data, take a look at Big Data/Map-reduce-type frameworks(s).

Google lists some ideas on using Apache’s Spark project to analyze Common Crawl’s dump(s). To understand the file format(s) used by Common Crawl, refer to the following:

The article, Accessing-Common-Crawl-Dataset-on-S3, outlines accessing Common Crawl’s 250TB+ dump(s) in a low cost manner without transferring that data load outside of Amazon’s AWS/S3 network. Of course, that assumes you are going to use some combination AWS/EC2/S3 etc. to analyze the crawl data.

Finally, Patrick Durusau maintains some interesting Common-Crawl-usage-related blog pages.

Personally, I find this subject intriguing, I suggest we get this crawl data while it’s HOT! 😉

Leave a Comment