search-engine – Row Coding

Building a web search engine [closed]

November 25, 2023 by Tarik

There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc): The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In … Read more

ElasticSearch – Searching For Human Names

September 9, 2023 by Tarik

First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543 If you go there, switch to the “Analysis”-tab to see how the text is transformed: Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is … Read more

Search engine Lucene vs Database search

September 6, 2023 by Tarik

I suggest you read Full Text Search Engines vs. DBMS. A one-liner would be: If the bulk of your use case is full text search, use Lucene. If the bulk of your use case is joins and other relational operations, use a database. You may use a hybrid solution for a more complicated use case.

Is there a search engine that support regular expression search? [closed]

June 13, 2023 by Tarik

Let me write here an answer from the superuser.com question due to my complete solidarity with the author: quote from the Ask Metafilter: The only possible way to make keyword searching efficient over hundreds of terabytes (or whatever their index is up to these days) is to precompute an index of words. In fact a … Read more

Are search engines going to see my dynamically created content in Bootstrap tabs?

June 8, 2023 by Tarik

No, we (Google) won’t see the content behind tabs iff the content under the tab is dynamically generated (i.e. not just hidden). You can also see what we “see” using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google.

Is there a good indexing / search engine for Node.js? [closed]

May 14, 2023 by Tarik

Just an update to my earlier answer – since there was so much discussion I didn’t want this update to get lost. You can download it here:

How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability

March 3, 2023 by Tarik

Short answer: it depends on your use case, but you probably don’t want to use it as a primary store. Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using … Read more

Designing a web crawler

February 27, 2023 by Tarik

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper: In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must … Read more

Search in html source with GOOGLE? [closed]

February 21, 2023 by Tarik

I’ve come across the following resources on my travels (some already mentioned above): HTML Mark-up-focused search engines Nerdydata I’d also like to throw in the following: Huge, website crawl data archives Common Crawl – ‘years of free web page data to help change the world’ (over 250TB+) How can we analyze this crawl data? For … Read more

What does percolator mean/do in elasticsearch?

February 11, 2023 by Tarik

What you usually do is index documents and get them back by querying. What the percolator allows you to do in a nutshell is index your queries and percolate documents against the indexed queries to know which queries they match. It’s also called reversed search, as what you do is the opposite to what you … Read more