How to use a Lucene Analyzer to tokenize a String?

Based off of the answer above, this is slightly modified to work with Lucene 4.0. public final class LuceneUtil { private LuceneUtil() {} public static List<String> tokenizeString(Analyzer analyzer, String string) { List<String> result = new ArrayList<String>(); try { TokenStream stream = analyzer.tokenStream(null, new StringReader(string)); stream.reset(); while (stream.incrementToken()) { result.add(stream.getAttribute(CharTermAttribute.class).toString()); } } catch (IOException e) { … Read more

SQL Server 2008 Full Text Search (FTS) versus Lucene.NET

SQL Server FTS is going to be easier to manage for a small deployment. Since FTS is integrated with the DB, the RDBMS handles updating the index automatically. The con here is that you don’t have an obvious scaling solution short of replicating DB’s. So if you don’t need to scale, SQL Server FTS is … Read more

Difference between BooleanClause.Occur.Must and BooleanClause.Occur.SHOULD in lucene

BooleanClause.Occur.SHOULD means that the clause is optional, whereas BooleanClause.Occur.Must means that the clause is compulsory. However, if a boolean query only has optional clauses, at least one clause must match for a document to appear in the results. For better control over what documents match a BooleanQuery, there is also a minimumShouldMatch parameter which lets … Read more

ElasticSearch – Searching For Human Names

First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543 If you go there, switch to the “Analysis”-tab to see how the text is transformed: Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is … Read more

What is best and most active open source .Net search technology?

While they were no ‘full blown’ releases (i.e. full documentation, web site updates) of Lucene.Net for quite some time, there are still fresh commits to its SVN repository. The latest release (2.3.2) for example was tagged in 07/24/09 (see here). Since the development is still active I would use it for new full-text-search projects.

Lucene Score results

The scoring contains the Inverse Document Frequency(IDF). If the term “John Smith” is in one partition, 0, 100 times and in partition 1, once. The score for searching for John Smith would be higher search in partition 1 as the term is more scarce. To get round this you would wither have to have your … Read more