Elasticsearch - 我需要 JDBC 驱动程序吗?
瞄准
将我的弹性搜索服务器与我的 SQL 数据库中的新数据和过期数据同步
To synchronize my elasticsearch server with new and expired data in my SQL database
问题
我可以通过两种截然不同的方式实现这一目标,但我不知道哪种方式更好.我可以通过使用 JDBC River 插件直接连接到 SQL 数据库拉信息到 elasticsearch.或者,我可以使用 PHP 客户端将数据推送到 elasticsearch,使用下面显示的代码作为示例:
There are two very different ways I can achieve this and I don't know which is better. I can either pull information to elasticsearch with a direct connection to the SQL database using the JDBC river plugin. Alternatively I can push data to elasticsearch using the PHP client using the code shown below as an example:
// The Id of the document
$id = 1;
// Create a document
$tweet = array(
'id' => $id,
'user' => array(
'name' => 'mewantcookie',
'fullName' => 'Cookie Monster'
),
'msg' => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
'tstamp' => '1238081389',
'location'=> '41.12,-71.34',
'_boost' => 1.0
);
// First parameter is the id of document.
$tweetDocument = new ElasticaDocument($id, $tweet);
// Add tweet to type
$elasticaType->addDocument($tweetDocument);
// Refresh Index
$elasticaType->getIndex()->refresh();
我打算每三十分钟运行一次 cron 来检查我的数据库中不仅有活动"标志而且没有索引"标志的项目,这意味着我需要将它们添加到索引.
I was going to have a cron run every thirty minuets to check for items in my database that not only have an "active" flag but that also do not have an "indexed" flag, that means I need to add them to the index.
问题
鉴于我有两种方法可以通过两种不同的方式在elasticsearch和mysql之间同步数据,每个选项的优点和缺点是什么.是否有一个特定的用例定义使用一个而不是另一个?
Seeing as I have two methods to synchronize data between elasticsearch and mysql in two different ways, what is the advantage and disadvantage of each option. Is there a specific usecase which defines using one over the other?
推荐答案
如果您暂时忘记需要将初始数据导入 Elasticsearch,我会使用事件系统将数据推送到弹性搜索.从长远来看,这更有效.
If you forget for a moment that you need to import initial data into Elasticsearch, I would use an event system to push data to Elasticsearch. This is more efficient in the long run.
您的应用程序确切地知道什么时候需要 Elasticsearch 索引.以您的推文为例,在某个时候,一条新推文将进入您的应用程序(例如,用户写了一条).这将触发 newTweet
事件.您有一个监听器来监听该事件,并在调度此类事件时将推文存储在 Elasticsearch 中.
Your application knows exactly when something needs to be indexed by Elasticsearch. To take your tweet example, at some point a new tweet will enter your application (a user writes one for example). This would trigger a newTweet
event. You have a listener in place that will listen to that event, and store the tweet in Elasticsearch whenever such an event is dispatched.
如果您不想在 Web 请求中使用资源/时间来执行此操作(并且您肯定 不 想要执行此操作),侦听器可以将作业添加到队列(Gearman 或 Beanstalkd 例如).然后,您需要一个工作人员来接手该工作并将推文存储在 Elasticsearch 中.
If you don't want to use resources/time in the web request to do this (and you definitely don't want to do this), the listener could add a job to a queue (Gearman or Beanstalkd for example). You would then need a worker that will pick that job up and store the tweet in Elasticsearch.
主要优势在于 Elasticsearch 可以更实时地保持最新状态.您不需要会引入延迟的 cronjob.您将(主要)一次处理一个文档.您不需要打扰 SQL 数据库来找出需要(重新)索引的内容.
The main advantage is that Elasticsearch is kept up-to-date more real-time. You won't need a cronjob that would introduce a delay. You'll (mostly) handle a single document at a time. You won't need to bother the SQL database to find out what needs to be (re)indexed.
另一个优势是,当事件/数据量失控时,您可以轻松扩展.当 Elasticsearch 本身需要更多功能时,将服务器添加到集群中.当工人无法处理负载时,只需添加更多(并将它们放在专用机器上).此外,您的网络服务器和 SQL 数据库不会有任何影响.
Another advantage is that you can easily scale when the amount of events/data gets out of hand. When Elasticsearch itself needs more power, add servers to the cluster. When the worker can't handle the load, simply add more of them (and place them on dedicated machines). Plus your webserver(s) and SQL database(s) won't feel a thing.
相关文章