对全文检索引擎xapian的学习(一)---索引

2022-04-27 00:00:00 索引文档的是检索增加了

xapian的文档不算丰富,但也够用了.特别是xapian配套的omega项目,是一个使用xapian和学习xapian的宝库.

先说两个重要的概念,term list 和posting list.

term list索引了一个文档,每一个document都对应一个term list.

posting list列出了一个term索引的文档id,每个term都有一个posting list.
在windows下使用xapian,建议从官网下载mvc下的make文件,放在vc下,修改几个错误后就能编译通过.

但官网没有给出omega在windows下的makefile ,我试着在vc下编译,没有成功.在ubuntu下编译成功了,需要提前安装好xapian-core并且把依赖的库也安装好.

移植omega意义不大,我决定学习一下omega的代码,看一下xapian究竟应该怎么用.

omega提供了两个主要的工具是omindex和query.对应的源码是omindex.cc和query.cc.

omindex支持的格式非常丰富,包括html,pdf,xml,excel,csv等.

omindex的核心索引操作,大体分下面几步:

1.保存文档的data:

// Put the data in the document
Xapian::Document newdocument;
string record = "url=";
record += url;
record += "\nsample=";
record += sample;
if (!title.empty()) {
record += "\ncaption=";
record += generate_sample(title, TITLE_SIZE);
}
if (!author.empty()) {
record += "\nauthor=";
record += author;
}
record += "\ntype=";
record += mimetype;
if (last_mod != (time_t)-1) {
record += "\nmodtime=";
record += str(last_mod);
}
record += "\nsize=";
record += str(d.get_size());
newdocument.set_data(record);
data里面保存了很多信息,类型,大小,url等都放在一个字符串中保存了起来.
要注意的是,data不适合频繁存取,存取一次需要耗费较多的资源,对于需要频繁存取的数据,xapian建议使用value.
2.接下来对标题正文进行索引:

// Index the title, document text, and keywords.
indexer.set_document(newdocument);
if (!title.empty()) {
indexer.index_text(title, 5);
indexer.increase_termpos(100);
}
if (!dump.empty()) {
indexer.index_text(dump);
}
if (!keywords.empty()) {
indexer.increase_termpos(100);
indexer.index_text(keywords);
}
// Index the leafname of the file.
{
indexer.increase_termpos(100);
string leaf = d.leafname();
string::size_type dot = leaf.find_last_of('.');
if (dot != string::npos)
leaf.resize(dot);
indexer.index_text(leaf);
}
if (!author.empty()) {
indexer.increase_termpos(100);
indexer.index_text(author, 1, "A");
}
// mimeType:
newdocument.add_boolean_term("T" + mimetype);
indexer是一个Xapian::TermGenerator类型,在往document中添加term的时候,可以不使用TermGenerator,但很明显,使用TermGenerator更加方便快捷.建议使用.

TermGenerator只能添加概率term,如果需要添加boolean型term,只能在doc中添加.
indexer.index_text(title, 5);
上面的语句中,title是要索引的文本,后面的5是wdf,也就是这个term的权重(具体来说,wdf是这个term在document中出现的次数).
给term一个更大的权重是有意义的,可以影响检索结果的排序.
需要注意,title必须是utf8编码的,否则不能识别.

title可以包含多个term,需要以空格隔开,否则title将作为一个term存入document中.

还要注意的一点是,index_text会记住添加的term的位置(position),如果不想记住term的position可以使用index_text_without_positions函数,这会减小索引库文件的大小.

indexer.increase_termpos(100);
函数将term的position增加了100,如果标题中有2个term,position分别是1和2,那么接下来的正文索引,term的position将会以103开始,
这能避免短语检索或NEAR检索误把标题和正文的词结合在一起.

indexer.index_text(keywords);
索引了关键词,很多分词算法可以取得关键词,关键词对于文章的聚合,寻找相似内容很有用处.
indexer.index_text(leaf);
索引了文件名(去掉了文件路径).
indexer.index_text(author, 1, "A");
索引作者,这里多了一个参数"A",这是前缀,在xapian中会经常遇到前缀,有重要作用.
newdocument.add_boolean_term("T" + mimetype);
这里增加了一个term使用的是boolean类型,相当于增加了一个wdf为0的term.

// Add last_mod as a value to allow "sort by date".
newdocument.add_value(VALUE_LASTMOD, int_to_binary_string((uint32_t)last_mod));
这里增加了一个value,保存的是doc的后修改时间.可以使用此value将检索结果按照时间日期排序.
// Add MD5 as a value to allow duplicate documents to be collapsed together.
newdocument.add_value(VALUE_MD5, md5);
这里增加了另外一个value,保存的是doc的md5值,可以用来去重.
// Add the file size as a value to allow "sort by size" and size ranges.
newdocument.add_value(VALUE_SIZE, Xapian::sortable_serialise(d.get_size()));
增加了另外一个value,保存doc的大小,可以用来按大小排序或指定大小范围.
bool inc_tag_added = false;
if (d.is_other_readable()) {
inc_tag_added = true;
newdocument.add_boolean_term("I*");
} else if (d.is_group_readable()) {
const char * group = d.get_group();
if (group) {
newdocument.add_boolean_term(string("I#") + group);
}
}
const char * owner = d.get_owner();
if (owner) {
newdocument.add_boolean_term(string("O") + owner);
if (!inc_tag_added && d.is_owner_readable())
newdocument.add_boolean_term(string("I@") + owner);
}
这里加入了权限控制.如果是文档拥有者只读,加入term"I@",如果是拥有者所在组可读,加入term"I#",如果其它人可读,加入term"I*".
在检索的时候,根据这三个term,可以决定哪些文档是允许当前用户检索的.

string ext_term("E");
for (string::iterator i = ext.begin(); i != ext.end(); ++i) {
char ch = *i;
if (ch >= 'A' && ch <= 'Z')
ch |= 32;
ext_term += ch;
}
newdocument.add_boolean_term(ext_term);
这里增加扩充term,以"E"开头,term内容为小写字母.
if (!skip_duplicates) {
// If this document has already been indexed, update the existing
// entry.
if (did) {
// We already found out the document id above.
db.replace_document(did, newdocument);
} else if (last_mod <= last_mod_max) {
// We checked for the UID term and didn't find it.
did = db.add_document(newdocument);
} else {
did = db.replace_document(urlterm, newdocument);
}
if (did < updated.size()) {
if (usual(!updated[did])) {
updated[did] = true;
--old_docs_not_seen;
}
}
if (verbose) {
if (did <= old_lastdocid) {
cout << "updated" << endl;
} else {
cout << "added" << endl;
}
}
} else {
// If this were a duplicate, we'd have skipped it above.
db.add_document(newdocument);
if (verbose)
cout << "added" << endl;
}
这里是把document入库.对于重复的document,可以跳过,也可以对旧有document进行替换更新.

以上就是index_file函数的主要部分,对于不同格式的文档要进行dump处理,提取出里面的文本内容后再进行索引.
————————————————
版权声明：本文为CSDN博主「sirdan」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/sirdan/article/details/23679685

相关文章