如何读取Lucene索引数据 - lewutian@126的日志 - 网易博客 lucene 更新索引-爱华网

本文要介绍一下如何利用IndexReader获取信息。为什么要读索引呢？因为我需要实现这些功能：

(1) 统计term在整个collection中的文档频度(document frequency, DF)；

(2) 统计term在整个collection中出现的词次(term frequency in whole collection)；

(3) 统计term在某个文档中出现的频度(term frequency, TF)；

(4) 列出term在某文档中出现的位置(position)；

(5) 整个collection中文档的个数；

那么为什么要用到这些数据呢？这些数据是实现TR(Text Retrieval，文本检索)的必备的“原料”，而且是经过加工的。在检索之前，只有原始文本(raw data)；经过索引器(indexer)的处理之后，原始文本变成了一个一个的term(或者token)，然后被indexer纪录下来所在的位置、出现的次数。有了这些数据，应用一些模型，就可以实现搜索引擎实现的功能——文本检索。

聪明的读者您可能会说，这看起来似乎很好做，不过就是计数(count)么。不错，就是计数，或者说是统计。但是看似简单的过程，如果加上空间(内存容量)的限制，就显得不那么简单了。假设如果每篇文档有100个term，每个term需要存储10字节信息，存1,000,000篇文档需要 10x100x10^6=10^9=2^30字节，也就是1GB。虽然现在1G内存不算什么，可是总不能把1GB的数据时时刻刻都放入内存吧。那么放入硬盘好了，现在需要用数据的时候，再把1GB数据从硬盘搬到内存。OK，可以先去冲杯咖啡，回来在继续下面的操作。这是1,000,000的文档，如果更多一点呢，现在没有任何辅助数据结构的方式，会导致很差的效率。

Lucene的索引会把数据分成段，并且在需要的时候才读，不需要的时候就让数据乖乖地呆在硬盘上。Lucene本身是一个优秀的索引引擎，能够提供有效的索引和检索机制。文本的目的是，介绍如用利用Lucene的API，如何从已经建好的索引的数据中读取需要的信息。至于Lucene如何使用，我会在后续的文章中逐渐介绍。

我们一步一步来看。这里建设已经有实现建好索引，存放在index目录下。好，要读索引，总得先生成一个读索引器(即Lucene中IndexReader的实例)。好，写下面的程序(程序为C#程序，本文使用DotLucene)。

IndexReader reader;

问题出来了，IndexReader是一个abstract类，不能实例化。那好，换派生类试试看。找到IndexReader的两个孩子——SegmentReader和MultiReader。用哪个呢？无论是哪个都需要一大堆参数(我是颇费了周折才搞清楚它们的用途，后面再解释)，似乎想用Lucene的索引数据不是那么容易啊。通过跟踪代码和查阅文档，我终于找到使用IndexReader的钥匙。原来IndexReader有一个“工厂模式”的static interface——IndexReader.Open。定义如下：

#0001 public static IndexReader Open(System.String path)

#0002 public static IndexReader Open(System.IO.FileInfo path)

#0003 public static IndexReader Open(Directory directory)

#0004 private static IndexReader Open(Directory directory, bool closeDirectory)

其中有三个是public的接口，可供调用。打开一个索引，就是这么简单：

#0001 IndexReader reader = IndexReader.Open(index);

实际上，这个打开索引经历了这样的一个过程：

#0001 SegmentInfos infos = new SegmentInfos();

#0002 Directory directory = FSDirectory.GetDirectory(index, false);

#0003 infos.Read(directory);

#0004 bool closeDirectory = false;

#0005 if (infos.Count == 1)

#0006 {

#0007 // index is optimized

#0008 return new SegmentReader(infos, infos.Info(0), closeDirectory);

#0009 }

#0010 else

#0011 {

#0012 IndexReader[] readers = new IndexReader[infos.Count];

#0013 for (int i = 0; i < infos.Count; i++)

#0014 readers[i] = new SegmentReader(infos.Info(i));

#0015 return new MultiReader(directory, infos, closeDirectory, readers);

#0016 }

首先要读入索引的段信息(segment information, #0001~#0003)，然后看一下有几个段：如果只有一个，那么可能是优化过的，直接读取这一个段就可以(#0008)；否则需要一次读入各个段(#0013~#0014)，然后再拼成一个MultiReader(#0015)。打开索引文件的过程就是这样。

接下来我们要看看如何读取信息了。用下面这段代码来说明。

#0001 public static void PrintIndex(IndexReader reader)

#0002 {

#0003 //显示有多少个document

#0004 System.Console.WriteLine(reader + "tNumDocs = " + reader.NumDocs());

#0005 for (int i = 0; i < reader.NumDocs(); i++)

#0006 {

#0007 System.Console.WriteLine(reader.Document(i));

#0008 }

#0009

#0010 //枚举term，获得<document, term freq, position* >信息

#0011 TermEnum termEnum = reader.Terms();

#0012 while (termEnum.Next())

#0013 {

#0014 System.Console.Write(termEnum.Term());

#0015 System.Console.WriteLine("tDocFreq=" + termEnum.DocFreq());

#0016

#0017 TermPositions termPositions = reader.TermPositions(termEnum.Term());

#0018 int i = 0;

#0019 int j = 0;

#0020 while (termPositions.Next())

#0021 {

#0022 System.Console.WriteLine((i++) + "->" + " DocNo:" + termPositions.Doc() + ", Freq:" + termPositions.Freq());

#0023 for (j = 0; j < termPositions.Freq(); j++)

#0024 System.Console.Write("[" + termPositions.NextPosition() + "]");

#0025 System.Console.WriteLine();

#0026 }

#0027

#0028 //直接获取 <term freq, document> 的信息

#0029 TermDocs termDocs = reader.TermDocs(termEnum.Term());

#0030 while (termDocs.Next())

#0031 {

#0032 System.Console.WriteLine((i++) + "->" + " DocNo:" + termDocs.Doc() + ", Freq:" + termDocs.Freq());

#0033 }

#0034 }

#0035

#0036 // FieldInfos fieldInfos = reader.fieldInfos;

#0037 // FieldInfo pathFieldInfo = fieldInfos.FieldInfo("path");

#0038

#0039 //显示 term frequency vector

#0040 for (int i = 0; i < reader.NumDocs(); i++)

#0041 {

#0042 //对contents的token之后的term存于了TermFreqVector

#0043 TermFreqVector termFreqVector = reader.GetTermFreqVector(i, "contents");

#0044

#0045 if (termFreqVector == null)

#0046 {

#0047 System.Console.WriteLine("termFreqVector is null.");

#0048 continue;

#0049 }

#0050

#0051 String fieldName = termFreqVector.GetField();

#0052 String[] terms = termFreqVector.GetTerms();

#0053 int[] frequences = termFreqVector.GetTermFrequencies();

#0054

#0055 System.Console.Write("FieldName:" + fieldName);

#0056 for (int j = 0; j < terms.Length; j++)

#0057 {

#0058 System.Console.Write("[" + terms[j] + ":" + frequences[j] + "]");

#0059 }
如何读取Lucene索引数据 - lewutian@126的日志 - 网易博客 lucene 更新索引

#0060 System.Console.WriteLine();

#0061 }

#0062 System.Console.WriteLine();

#0063 }

#0004 计算document的个数

#0012~#0034 枚举collection中所有的term

其中#0017~#0026 枚举每个term在出现的document中的所有位置(第几个词，从1开始计数)；#0029~#0033 计算每个term出现在哪些文档和相应的出现频度(即DF和TF)。

#0036~#0037在reader是SegmentReader类型的情况下有效。

#0040~#0061可以快速的读取某篇文档中出现的term和相应的频度。但是这部分需要在建索引时，设置storeTermVector为true。比如

doc.Add(Field.Text("contents", reader, true));

其中的第三项即是。默认为false。

有了这些数据，就可以统计我需要的数据了。以后我会介绍如何建立索引，如何应用Lucene。

from:http://lqgao.spaces.live.com/?_c11_BlogPart_BlogPart=blogview&_c=BlogPart&partqs=cat%3dInside%2520Lucene

http://hi.baidu.com/lewutian推荐文章:

1. Lucene(Nutch)距离商业文本搜索引擎还有多远？

2. Lucene学习笔记 - 5

3. lucene.net 高级应用之排序、设置权重、优化、分布式搜索

4. lucene 的简单实现数据库索引 demo by Alan

5. beta技术沙龙：大型网站的Lucene应用

6. Lucene中文分词 “庖丁解牛”使用指南

7. Lucene应用-实现多重标准搜索

8. Lucene.Net Research

9. lucene学习3——词条字典[Term Dictionary]文件(.tis和.tii)与词条频率文件(.frq)、词条位置文件(.prx)

10. Lucene Payload 的研究与应用

11. Lucene多字段搜索

12. lucene缺点汇总

13. 一个实例包含lucene所有检索核心用法（多域检索、多索引检索）

14. Lucene 高亮显示搜索结果 C#代码

15. lucene 建立索引和简单搜索

lucene索引查看工具

爱华网本文地址 » http://www.aihuau.com/a/25101011/64326.html

如何读取Lucene索引数据 - lewutian@126的日志 - 网易博客 lucene 更新索引

更多阅读

凯登-克洛斯还能比这更诱惑吗-tielingyu@126的日志-网易博客凯登克洛斯迅雷下载

中国应试教育五大弊端 - 狄鲁的日志 - 网易博客应试教育的弊端的例子

PMC经理工作职责 - 任飞扬的日志 - 网易博客 pmc主管工作职责

引用男女间的16个性秘密 - 浪妹的日志 - 网易博客引用帅哥玩大j的日志

蛋糕裙编织方法 - 飞飞的日志 - 网易博客明月编织棒针艺术日志

声明:《如何读取Lucene索引数据 - lewutian@126的日志 - 网易博客 lucene 更新索引》为网友侽紸角分享！如侵犯到您的合法权益请联系我们删除

更多阅读

凯登-克洛斯还能比这更诱惑吗-tielingyu@126的日志-网易博客 凯登克洛斯迅雷下载

中国应试教育五大弊端 - 狄鲁的日志 - 网易博客 应试教育的弊端的例子