ElasticSearch River JDBC MySQL 不删除记录

我正在使用 ElasticSearch 的 JDBC 插件来更新我的 MySQL 数据库.它选取新的和更改的记录,但不会删除已从 MySQL 中删除的记录.它们保留在索引中.

这是我用来创建河流的代码:

curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{类型":jdbc",jdbc":{"驱动程序": "com.mysql.jdbc.Driver","url" : "jdbc:mysql://localhost:3306/test",用户":test_user",密码":test_pass","sql" : "SELECT `account`.`id` 作为`_id`、`account`.`id`、`account`.`reference`、`account`.`company_name`、`account`.`also_known_as` 来自`account` 不是`account`.`deleted`",策略":简单","民意调查": "5s",版本控制":真的,消化":假,自动提交":真的,"index" : "耳机",类型":帐户"}}'

在 OSX Mountain Lion 上通过自制软件安装 ElasticSearch,没有错误或问题,一切都按预期响应.权限正常,日志中没有错误.

我已经删除并包含(并设置为 true 和 false)我能想到的 autocommitversioningdigesting 的所有组合的.这是一个开发数据库,​​所以我确信记录被完全删除,没有缓存,也没有软删除.如果我删除所有记录(即保持河流完好无损,只删除在 ES 上索引的内容),下次河流更新时它不会重新添加记录,这让我相信我错过了有关版本控制和删除的内容.

请注意,我还尝试了各种方法来指定 _id 列,并通过 JSON on call 检查它是否具有值.

干杯.

解决方案

自从问了这个问题,参数变化很大,不推荐versioning和digesting,poll被schedule代替,需要一个cron重新运行河流的频率的表达式(以下计划每 5 分钟运行一次)

 curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{类型":jdbc",jdbc":{"驱动程序": "com.mysql.jdbc.Driver","url" : "jdbc:mysql://localhost:3306/test",用户":test_user",密码":test_pass","sql" : "SELECT `account`.`id` 作为`_id`、`account`.`id`、`account`.`reference`、`account`.`company_name`、`account`.`also_known_as` 来自`account` 不是`account`.`deleted`",策略":简单",时间表":0 0/5 * * * ?",自动提交":真的,"index" : "耳机",类型":帐户"}}'

但是对于主要问题,我从开发人员那里得到的答案是这样的https://github.com/jprante/elasticsearch-river-jdbc/issues/213

<块引用><块引用>

不再检测到行的删除.

我尝试使用版本管理进行内务管理,但效果不佳以及增量更新和添加行.

一个好的方法是窗口索引.每个时间段(可能一次每天或每周)为河流创建一个新索引,并添加到别名.旧指数将在一段时间后删除.这维护类似于logstash索引,但它不在河流的范围.

我目前用作研究别名的方法是每晚重新创建索引和河流,并安排河流每隔几个小时运行一次.它确保当天放入的新数据将被索引,删除将每 24 小时反映一次

I'm using the JDBC plugin for ElasticSearch to update my MySQL database. It picks up new and changed records, but does not delete records that have been removed from MySQL. They remain in the index.

This is the code I use to create the river:

curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{
    "type" : "jdbc",
    "jdbc" : {
        "driver" : "com.mysql.jdbc.Driver",
        "url" : "jdbc:mysql://localhost:3306/test",
        "user" : "test_user",
        "password" : "test_pass",
        "sql" : "SELECT `account`.`id` as `_id`, `account`.`id`, `account`.`reference`, `account`.`company_name`, `account`.`also_known_as` from `account` WHERE NOT `account`.`deleted`",
        "strategy" : "simple",
        "poll" : "5s",
        "versioning" : true,
        "digesting" : false,
        "autocommit" : true,
        "index" : "headphones",
        "type" : "Account"
    }
}'

Installed ElasticSearch via homebrew on OSX Mountain Lion, no errors or problems and everything responds as expected. Permissions OK, no errors in logs.

I have removed, and included (and set to true and false) every combination of autocommit, versioning and digesting that I could think of. It's a dev database so I'm sure that records are deleted fully, not cached and not soft-deleted. If I delete all the records (i.e. leave the river intact and just delete what was indexed on ES), the next time the river updates it does not re-add the record, which leads me to believe I have missed something regarding versioning and deleting.

Note I've also tried various ways to specify the _id column, and I checked that it had a value via JSON on call.

Cheers.

解决方案

Since this question has been asked, the parameters have changed greatly, versioning and digesting have been deprecated, and poll has been replaced by schedule, which will take a cron expression on how often to rerun the river (below is scheduled to run every 5 mins)

    curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{
        "type" : "jdbc",
        "jdbc" : {
            "driver" : "com.mysql.jdbc.Driver",
            "url" : "jdbc:mysql://localhost:3306/test",
            "user" : "test_user",
            "password" : "test_pass",
            "sql" : "SELECT `account`.`id` as `_id`, `account`.`id`, `account`.`reference`, `account`.`company_name`, `account`.`also_known_as` from `account` WHERE NOT `account`.`deleted`",
            "strategy" : "simple",
            "schedule": "0 0/5 * * * ?" ,
            "autocommit" : true,
            "index" : "headphones",
            "type" : "Account"
        }
    }'

But for the main question, the answer i got from the developer is this https://github.com/jprante/elasticsearch-river-jdbc/issues/213

Deletion of rows is no longer detected.

I tried housekeeping with versioning, but this did not work well together with incremental updates and adding rows.

A good method would be windowed indexing. Each timeframe (maybe once per day or per week) a new index is created for the river, and added to an alias. Old indices are to be dropped after a while. This maintenance is similar to logstash indexing, but it is outside the scope of a river.

The method i am currently using as a I research aliasing is I recreate the index and river nightly, and schedule the river to run every few hours. It ensures new data being put in will be indexed that day, and deletions will reflect every 24 hrs

相关文章