从 MySQL 切换到 Cassandra - 优点/缺点?
对于一些背景知识 - 此问题涉及在单个小型 EC2 实例上运行的项目,并且即将迁移到中型实例.主要组件是 Django、MySQL 和大量用 python 和 java 编写的自定义分析工具,它们做了繁重的工作举重.同一台机器也在运行 Apache.
For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.
数据模型如下所示 - 大量实时数据来自各种联网传感器,理想情况下,我想建立一种长轮询方法,而不是当前每 15 分钟轮询一次的方法(计算统计数据和写入数据库本身的限制).一旦数据进来,我将原始版本存储在MySQL,让分析工具对这些数据松散,并将统计信息存储在另外几个表中.所有这些都是使用 Django 呈现的.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
我需要的关系特征 -
- Order by [Cassandra API 中的 SliceRange 似乎可以满足这一点]
- 分组依据
- 多个表之间的多条关系[Cassandra SuperColumns 似乎对一对多的处理效果很好]
- Sphinx 在这方面给了我一个不错的全文引擎,所以这也是必要的.[在 Cassandra 上,Lucandra 项目似乎满足了这个需求]
我的主要问题是数据读取速度非常慢(写入也不那么热).我现在不想在上面投入大量资金和硬件,我更喜欢可以随时间轻松扩展的东西.从这个意义上说,垂直扩展 MySQL 并非微不足道(或便宜).
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
基本上,在阅读了大量有关 NOSQL 并尝试过 MongoDB、Cassandra 和 Voldemort 之类的东西之后,我的问题是,
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
在中型 EC2 实例上,通过转向 Cassandra 之类的东西,我会在读/写方面获得任何好处吗?这篇文章 (pdf) 似乎确实表明了这一点.目前,我会说每分钟几百次写入将是常态.对于读取 - 由于数据每 5 分钟左右更改一次,缓存失效必须很快发生.在某些时候,它也应该能够处理大量并发用户.即使创建了索引,MySQL 在大型表上执行一些连接时,应用程序的性能目前也会被扼杀 - 大约 32k 行的内容需要超过一分钟的时间来呈现.(这也可能是 EC2 虚拟化 I/O 的产物).表的大小约为 4-5 百万行,大约有 5 个这样的表.
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
鉴于 CAP 定理和最终一致性,每个人都在谈论在多个节点上使用 Cassandra.但是,对于一个刚刚开始成长的项目,是否有意义?部署一个单节点 cassandra 服务器?有什么注意事项吗?例如,它可以取代 MySQL 作为 Django 的后端吗?[这是推荐的吗?]
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
如果我要换班,我猜我将不得不重写应用程序的某些部分来做更多的管理",因为我必须进行多次查找才能获取行.
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
仅将 MySQL 用作键值存储而不是关系引擎是否有意义,并继续使用它?这样我就可以利用大量可用的稳定 API,以及一个稳定的引擎(并根据需要使用关系).(Brett Taylor 在 Friendfeed 上的帖子 - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
非常感谢已经完成轮班的人的任何见解!
Any insights from people who've done a shift would be greatly appreciated!
谢谢.
推荐答案
Cassandra 和当今可用的其他分布式数据库不提供您习惯于从 sql 中使用的那种即席查询支持.这是因为您无法高效地分发带有连接的查询,因此重点是非规范化.
Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
但是,Cassandra 0.6(明天正式发布测试版,但如果您不耐烦,您可以自己从 0.6 分支构建)支持用于分析的 Hadoop map/reduce,这实际上听起来很适合您.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra 为轻松添加新节点提供了出色的支持,甚至可以添加到最初的一组节点中.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
也就是说,在几百次写入/分钟的情况下,您可以在 mysql 上运行很长时间.Cassandra 在作为键/值存储(甚至更好,键/列族)方面要好得多,但 MySQL 在关系数据库方面要好得多.:)
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
目前还没有对 Cassandra(或其他 nosql 数据库)的 django 支持.他们正在谈论为 1.2 之后的下一个版本做一些事情,但基于与 pycon 的 django 开发人员的交谈,没有人真正确定那会是什么样子.
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.
相关文章