Building a web search engine [closed]

There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc): The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In … Read more

ElasticSearch – Searching For Human Names

First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543 If you go there, switch to the “Analysis”-tab to see how the text is transformed: Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is … Read more

Are search engines going to see my dynamically created content in Bootstrap tabs?

No, we (Google) won’t see the content behind tabs iff the content under the tab is dynamically generated (i.e. not just hidden). You can also see what we “see” using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google.

How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability

Short answer: it depends on your use case, but you probably don’t want to use it as a primary store. Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using … Read more

Designing a web crawler

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper: In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must … Read more

Search in html source with GOOGLE? [closed]

I’ve come across the following resources on my travels (some already mentioned above): HTML Mark-up-focused search engines Nerdydata I’d also like to throw in the following: Huge, website crawl data archives Common Crawl – ‘years of free web page data to help change the world’ (over 250TB+) How can we analyze this crawl data? For … Read more