博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Lucene.net: the main concepts
阅读量:6983 次
发布时间:2019-06-27

本文共 5872 字,大约阅读时间需要 19 分钟。

hot3.png

In  you learnt how to get a copy of  and where to go in order to look for more information. As you noticed the documentation is far from being complete and easy to read. So in the post I’ll write about the main concepts behind Lucene.net and which are the main steps in the development of a solution based on Lucene.net.

Some of the main concepts

Before looking at the development phases, it’s important to have a look at the main actors of Lucene.net.

Directoy

The directory is where Lucene indexes are stored: it can be a physical folder on the filesystem (FSDirectory) or an area in memory where files are stored (RAMDirectory). The index structure is compatible with all ports of Lucene, so you could also have the indexing done with .NET and searched with Java, or the other way around (obviously, using the filesystem directory).

IndexWriter

This component is responsible for the management of the indexes. It creates indexes, adds documents to the index, optimizes the index.

Analyzer

This is where the complexity of the indexing resides. In a few words the analyzer contains the policy for extracting index terms from the text. There are several analyzers available both in the core library and in the contrib project. And the java version has even more analyzers that have not been ported to .net yet.

Probably the analyzer you’ll use the most is the StandardAnalyzer, which tokenizes the text based on European-language grammars, sets everything to lowercase and removes English stopwords.

Another interesting analyzer is the SnowballAnalyzer, which works exactly like the standard one, but adds one more step at the end: the  phase, using the . Stemming is the process of reducing inflected words to their root. For example, if you are looking for “developing”, probably you are also interested in the word “developed” or “develop” or “developer”. During the indexing phase, the stemming process normalizes all these inflected words to their root “develop”. And does the same when querying the index (if you search for “development” it will search for “develop”). Obviously this is tied to the language of the text, so the snowball analyzer comes with many different “grammars” for that.

Document and Fields

A document is a single entity that is put into the index. And it contains many fields which are, like in a database, the single different pieces of information that make a document. Different fields can be indexed using different algorithm and analyzers. For example you might just want to store the document id, without being able to search on it. But you want to be able to search by tags as single keywords, and, finally you want to index the body of blog post for full text search (thus using the Analyzer and the tokenizers).

Since this is an important topic, I’ll write a more in depth post in the future.

Searcher and IndexReader

The searcher is the component that, with the help of the IndexReader, scans the index files and returns results based on the query supplied.

QueryParser

The query parser is responsible for parsing a string of text to create a query object. It evaluates  and uses an analyzer (which should be the same you used to index the text) to tokenize the single statements.

The main development steps

And now let’s have a brief overview at the logical steps involved in integrating Lucene.net into your applications:

1 – Initialize Directory and IndexWriter

The first step is initializing the Directory and the IndexWriter. In a web application, like Subtext, this is most likely done in the application startup and then the instance stored in a global variable somewhere (or accessed through a Singleton) since only one Writer can read the Dictionary at the same time.

And when you create the IndexWriter you can supply the analyzer that will be used by default to index all the text.

2 – Add Documents to the Index

Each document is made by various Fields. You have to create a Document with all the Fields that must be indexed and also the ones you need in order to link the result to the real document that is being indexed (for example the id of the post).

And once created the Document, you have to add it to the Directory with the IndexWriter.

At this point, you could either add more documents or close the IndexWriter. The index will be saved to the Directory and can be re-opened later for adding more Documents or to perform queries on in.

3 – Create the Query

Once you have all your documents in the index, it’s time to do some queries.

You can create the query either via the QueryParser or creating a Query object directly via API.

4 – Pass the Query to the IndexSearcher

And once you have the Query object you have to pass it to the Search method of a IndexSearcher.

One caveats is that the IndexSearcher sees the index only at the point it was at the time it was opened. So in order to search over the most recent set of documents you have to re-open theIndexSearcher. But re-opening takes time and resources, so in a web application you might want to cache it somehow and re-open it periodically.

5 – Iterates over the results

The Search method returns the results, inside a Hit object, which contains all the documents that match the query, ordered by Score, which is a  that should tell you how much the document found is related to your query. For more information refer to Lucene website:.

6 – Close everything

And once you are done with everything, close the IndexWriterIndexSearcher and the Directoryobject. In a web application this is typically performed in the application shutdown event.

Next

You just read about the main concepts behind Lucene.net. In a future post I’ll write  that puts together all the concepts discussed here.

Tags: 

 •  •  • 

posted on Monday, August 31, 2009 12:11 PM

Related Links

   (9/2/2009)  (9/4/2009)  (9/8/2009)  (9/10/2009)  (2/26/2010)

转载于:https://my.oschina.net/u/138995/blog/178778

你可能感兴趣的文章
usaco Typo
查看>>
DataTable 实现新增加合计行
查看>>
字符串
查看>>
创建对象的三种方式
查看>>
spring学习之spring 插件 for eclipse
查看>>
js-sha256源码
查看>>
运维笔试题
查看>>
dispaly、position、float之间的关系与相互作用
查看>>
MyEclipse加入jquery.js文件missing semicolon的错误
查看>>
axis1.4生成客户端
查看>>
MI-NOTE黑砖
查看>>
WinForm中Component Class、User Control及Custom Control的区别和使用建议
查看>>
地区选择控件杂记
查看>>
来自工程师的8项Web性能提升建议
查看>>
dns配置文件
查看>>
springBoot、SpringCloud 常用注解
查看>>
UITouch 触摸事件处理
查看>>
system类
查看>>
模拟登陆提交
查看>>
详解.NET程序集的加载规则
查看>>