Free Large datasets to experiment with Hadoop
Few points about your question regarding crawling and wikipedia. You have linked to the wikipedia data dumps and you can use the Cloud9 project from UMD to work with this data in Hadoop. They have a page on this: Working with Wikipedia Another datasource to add to the list is: ClueWeb09 – 1 billion webpages … Read more