如何将 mysql 数据库同步到外部数据源
我有一个名为 search
的 mysql 数据库表,我需要用 ElasticSearch 索引跟上数据.我已经将表从表导出到 es 索引,但现在我需要保持数据同步,否则搜索将很快变得陈旧.
I have a mysql database table called search
that I need to keep up to data with an ElasticSearch index. I have already exported the table from the table to the es index, but now I need to keep the data in sync or else the search will become stale quite quickly.
我能想到的唯一方法是每 x 分钟导出一次表,然后将其与上次导入的内容进行比较.这是不可行的,因为该表有大约 10M 行,我不想整天每五分钟进行一次表导出.对此有什么好的解决方案?请注意,我只有对数据库的读取权限.
The only way I can think of is by exporting the table every x minutes and then comparing it with what was last imported. This isn't feasible since the table has about 10M rows and I don't want to be doing table exports every five minutes all day long. What would be a good solution for this? Note that I only have read-access to the database.
推荐答案
我会利用 Logstash 和 jdbc
input 插件和一个 elasticsearch
输出 插件.有一篇博客文章 展示了此解决方案的完整示例.
I would leverage Logstash with a jdbc
input plugin and an elasticsearch
output plugin. There's a blog article showing a full example of this solution.
安装 Logstash 后,您可以使用我上面提到的插件创建一个配置文件,如下所示:
After installing Logstash, you can create a configuration file with the plugins I mentioned above like this:
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "user"
jdbc_password => "1234"
jdbc_validate_connection => true
jdbc_driver_library => "mysql-connector-java-5.1.36-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
schedule => "5m"
statement => "SELECT * FROM search WHERE timestamp > :sql_last_value"
}
}
output {
elasticsearch {
protocol => http
index => "searches"
document_type => "search"
document_id => "%{uid}"
host => "ES_NODE_HOST"
}
}
您需要确保更改一些值以匹配您的环境,但这对于您需要执行的操作应该没有问题.
You need to make sure to change a few values to match your environment, but this should work out without a problem for what you need to do.
查询将每 5 分钟运行一次,并将获取所有 search
记录,其 timestamp
(更改该名称以匹配您的数据)比上次查询更新跑了.选定的记录将沉没在位于 ES_NODE_HOST
上的 Elasticsearch 服务器中的 searches
索引中.确保相应地更改索引和类型名称,以及主键字段的名称(即 uid
)以匹配您的数据.
Every 5 minutes the query will run and will fetch all search
records whose timestamp
(change that name to match your data) is more recent than the last time the query ran. The selected records will be sinked in the searches
index located in your Elasticsearch server on ES_NODE_HOST
. Make sure to change the index and type name accordingly, as well as the name of the primary key field (i.e. uid
) to match your data as well.
相关文章