Lucene爬虫(需要建立lucene索引)
如果可能的话,我正在寻找用 java 或任何其他语言编写的 Apache Lucene 网络爬虫.爬虫必须使用lucene并创建有效的lucene索引和文档文件,所以这就是nutch被淘汰的原因例如...
I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...
有谁知道这样的网络爬虫存在吗?如果答案是肯定的,我可以在哪里找到它.天呐……
Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx...
推荐答案
你要问的是两个组件:
- 网络爬虫
- 基于 Lucene 的自动索引器
首先要说一句勇气:去过那里,做到了.我将从制作您自己的角度来分别处理这两个组件,因为我不相信您可以使用 Lucene 来完成您所要求的事情,而无需真正了解底层发生的事情.
First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.
因此,您有一个要爬网"以收集特定资源的网站/目录.假设它是列出目录内容的任何普通网络服务器,那么制作网络爬虫很容易:只需将其指向目录的根并定义收集实际文件的规则,例如以 .txt 结尾".非常简单的东西,真的.
So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.
实际的实现可能是这样的:使用 HttpClient获取实际的网页/目录列表,以您认为最有效的方式解析它们,例如使用 XPath 从获取的文档中选择所有链接,或者使用 Java 的 模式 和 Matcher 类随时可用.如果您决定走 XPath 路线,请考虑使用 JDOM 进行 DOM 处理和 Jaxen 用于实际的 XPath.
The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.
获得所需的实际资源(例如一堆文本文件)后,您需要确定数据的类型,以便能够知道要索引的内容以及可以安全忽略的内容.为简单起见,我假设这些是没有字段或任何内容的纯文本文件,不会深入研究,但如果您有多个字段要存储,我建议您让爬虫生成 1..n 具有 访问器和修改器 的专用 bean(奖励积分: 使 bean immutable,不允许访问者改变 bean 的内部状态,为 bean 创建一个 复制构造函数)在其他组件中使用.
Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.
就 API 调用而言,您应该有类似 HttpCrawler#getDocuments(String url)
的东西,它返回一个 List
以与实际的索引器结合使用.
In terms of API calls, you should have something like HttpCrawler#getDocuments(String url)
which returns a List<YourBean>
to use in conjuction with the actual indexer.
除了 显而易见的东西 与 Lucene 相比,例如设置目录并了解它的线程模型(任何时候只允许一次写入操作,即使在更新索引时也可以存在多次读取),您当然希望将 bean 提供给索引.我已经链接到的五分钟教程基本上就是这样做的,查看示例 addDoc(..)
方法并将字符串替换为 YourBean
.
Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example addDoc(..)
method and just replace the String with YourBean
.
请注意,Lucene IndexWriter 确实有一些清理方法可以方便地以受控方式执行,例如调用 IndexWriter#commit()
只有在一堆文档被添加到索引后才适用性能,然后调用 IndexWriter#optimize()
以确保索引不会随着时间的推移变得非常臃肿也是一个好主意.永远记得关闭索引以避免不必要的 LockObtainFailedException
s 被抛出,与 Java 中的所有 IO 一样,这样的操作当然应该在 finally
块中完成.
Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling IndexWriter#commit()
only after a bunch of documents have been added to index is good for performance and then calling IndexWriter#optimize()
to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessary LockObtainFailedException
s to be thrown, as with all IO in Java such operation should of course be done in the finally
block.
- 你需要记得时常让你的 Lucene 索引的内容过期,否则你永远不会删除任何东西,它会变得臃肿并最终因为它自身的内部复杂性而死.
- 由于线程模型,您很可能需要为索引本身创建一个单独的读/写抽象层,以确保在任何给定时间只有一个实例可以写入索引.
- 由于源数据获取是通过 HTTP 完成的,因此您需要考虑数据验证和可能的错误情况,例如服务器不可用,以避免任何格式错误的索引和客户端挂起.
- 您需要知道要从索引中搜索的内容,才能决定要放入的内容.请注意,必须按日期进行索引,以便将日期拆分为年、月、日、小时、分钟、秒而不是毫秒值,因为在从 Lucene 索引进行范围查询时,
[0 to 5]
实际上被转换为+0 +1 +2 +3 +4 +5
这意味着范围查询很快就消失了,因为查询子部分的数量达到了最大值.
- You need to remember to expire your Lucene index' contents every now and then too, otherwise you'll never remove anything and it'll get bloated and eventually just dies because of its own internal complexity.
- Because of the threading model you most likely need to create a separate read/write abstraction layer for the index itself to ensure that only one instance can write to the index at any given time.
- Since the source data acquisition is done over HTTP, you need to consider the validation of data and possible error situations such as server not available to avoid any kind of malformed indexing and client hangups.
- You need to know what you want to search from the index to be able to decide what you are going to put into it. Note that indexing by date must be done so that you split the date to say year, month, day, hour, minute, second instead of millisecond value because when doing range queries from Lucene index, the
[0 to 5]
actually gets transformed into+0 +1 +2 +3 +4 +5
which means the range query dies out very quickly because there's a maximum number of query sub parts.
有了这些信息,我相信您可以在不到一天的时间内制作自己的特殊 Lucene 索引器,如果您想对其进行严格测试,则需要三天.
With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.
相关文章