使用 Apache Lucene 索引 MySQL 数据库,并使它们保持同步

2022-01-15 00:00:00 indexing synchronization lucene mysql java
  1. 在 MySQL 中添加新项目时,它也必须被 Lucene 索引.
  2. 从 MySQL 中删除现有项目时,它也必须从 Lucene 的索引中删除.

我们的想法是编写一个脚本,该脚本将通过调度程序每 x 分钟调用一次(例如 CRON 任务).这是一种保持 MySQL 和 Lucene 同步的方法.到目前为止我所管理的:

The idea is to write a script that will be called every x minutes via a scheduler (e.g. a CRON task). This is a way to keep MySQL and Lucene synchronized. What I managed until yet:

  1. 对于 MySQL 中的每个新添加项,Lucene 也会对其进行索引.
  2. 对于 MySQL 中已添加的每个项目,Lucene 不会对其重新编制索引(没有重复的项目).

这就是我请求你帮助管理的一点:

This is the point I'm asking you some help to manage:

  1. 对于每个先前添加的项目,然后从 MySQL 中删除,Lucene 也应该取消它的索引.

这是我使用的代码,它试图索引一个 MySQL 表 tag (id [PK] | name):

Here is the code I used, which tries to index a MySQL table tag (id [PK] | name):

public static void main(String[] args) throws Exception {

    Class.forName("com.mysql.jdbc.Driver").newInstance();
    Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/mydb", "root", "");
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
    IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), config);

    String query = "SELECT id, name FROM tag";
    Statement statement = connection.createStatement();
    ResultSet result = statement.executeQuery(query);

    while (result.next()) {
        Document document = new Document();
        document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
        writer.updateDocument(new Term("id", result.getString("id")), document);
    }

    writer.close();

}

PS:此代码仅用于测试目的,无需告诉我它有多糟糕:)

PS: this code is for tests purpose only, no need to tell me how awful it is :)

一种解决方案是删除任何预先添加的文档,并重新索引所有数据库:

One solution could be to delete any previsouly added document, and reindex all the database:

writer.deleteAll();
while (result.next()) {
    Document document = new Document();
    document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
    document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
    writer.addDocument(document);
}

我不确定这是最优化的解决方案,是吗?

I'm not sure it's the most optimized solution, is it?

推荐答案

只要让索引/重新索引与应用程序分开运行,就会出现同步问题.根据您的工作领域,这可能不是问题,但对于许多并发用户应用程序来说却是.

As long as you let the indexing/reindexing run separately from your application, you will have synchronization problems. Depending on your field of work, this might not be a problem, but for many concurrent-user-applications it is.

当我们的作业系统每隔几分钟运行一次异步索引时,我们也遇到了同样的问题.用户会使用搜索引擎找到产品,然后即使管理员从有效产品堆栈中删除了产品,仍然会在前端找到它,直到下一个重新索引作业运行.这会导致向一级支持报告非常混乱且很少可重现的错误.

We had the same problems when we had a job system running asynchronous indexing every few minutes. Users would find a product using the search engine, then even when an administrative person removed the product from the valid product stack, still found it in the frontend, until the next reindexing job ran. This leads to very confusing and seldomly reproducable errors reported to first level support.

我们看到了两种可能性:要么将业务逻辑紧密连接到搜索索引的更新,要么实现更紧密的异步更新任务.我们做了后者.

We saw two possibilities: Either connect the business logic tightly to updates of the search index, or implement a tighter asynchronous update task. We did the latter.

在后台,有一个类在 tomcat 应用程序内的专用线程中运行,该线程接受更新并并行运行它们.后台更新到前端的等待时间降至0.5-2秒,大大减少了一级支持的问题.而且,它尽可能地松散耦合,我们甚至可以实现不同的索引引擎.

In the background, there's a class running in a dedicated thread inside the tomcat application that takes updates and runs them in parallel. The waiting times for backoffice updates to frontend are down to 0.5-2 seconds, which greatly reduces the problems for first level support. And, it is as loosely coupled as can be, we could even implement a different indexing engine.

相关文章