MySQL 的扩展解决方案(复制、集群)
在启动时,我们正在考虑为我们的数据库扩展解决方案.MySQL 的事情变得有些混乱(至少对我而言),它有 MySQL 集群、复制和MySQL 集群复制(从 5.1.6 版开始),这是 MySQL 集群的异步版本.MySQL 手册解释了其 集群常见问题,但很难从中确定何时使用其中之一.
At the startup I'm working at we are now considering scaling solutions for our database. Things get somewhat confusing (for me at least) with MySQL, which has the MySQL cluster, replication and MySQL cluster replication (from ver. 5.1.6), which is an asynchronous version of the MySQL cluster. The MySQL manual explains some of the differences in its cluster FAQ, but it is hard to ascertain from it when to use one or the other.
如果您熟悉这些解决方案之间的差异、优缺点,以及您建议何时使用每种解决方案,我将不胜感激.
I would appreciate any advice from people who are familiar with the differences between those solutions and what are the pros and cons, and when do you recommend to use each.
推荐答案
我一直在阅读有关可用选项的大量内容.我还接触了 High Performance MySQL 2nd edition,强烈推荐.
I've been doing A LOT of reading on the available options. I also got my hands on High Performance MySQL 2nd edition, which I highly recommend.
这是我设法拼凑起来的:
聚类
This is what I've managed to piece together:
Clustering
一般意义上的集群是在许多服务器之间分配负载,这些服务器对于外部应用程序来说是一台服务器.
Clustering in the general sense is distributing load across many servers that appear to an outside application as one server.
MySQL NDB Cluster 是一个分布式的、内存中的、无共享的存储引擎,具有同步复制和自动数据分区(对不起,我从高性能书中借用了字面意思,但他们把它放在那里非常好).对于某些应用程序来说,它可以是一种高性能的解决方案,但 Web 应用程序通常不能很好地在它上面工作.
MySQL NDB Cluster is a distributed, in-memory, shared-nothing storage engine with synchronous replication and automatic data partioning (excuse me I borrow literally from the High Performance book, but they put it very nicely there). It can be a high performance solution for some applications, but web application generally do not work well on it.
主要问题是,除了非常简单的查询(仅涉及一张表)之外,集群通常还必须在多个节点上搜索数据,从而导致网络延迟蔓延并显着减慢查询的完成时间.由于应用程序将集群视为一台计算机,因此无法告诉它从哪个节点获取数据.
The major problem is that beyond very simple queries (that touch only one table), the cluster will generally have to search for data on several nodes, allowing network latency to creep in and significantly slow down completion time for queries. Since the application treats the cluster as one computer, it can't tell it which node to fetch the data from.
此外,内存要求对于许多大型数据库来说是不可行的.
In addition, the in-memory requirement is not workable for many large databases.
这是 MySQL 的另一个集群解决方案,充当 MySQL 服务器之上的中间件.它提供同步复制、负载平衡和故障转移.它还确保请求始终从最新副本中获取数据,自动选择具有最新数据的节点.
This is another clustering solution for MySQL, that acts as a middleware on top of the MySQL server. It offers synchronous replication, load balancing and failover. It also ensures that requests always get the data from the latest copy, automatically choosing a node that has the fresh data.
我读过一些关于它的好东西,总的来说,这听起来很有希望.
I've read some good things on it, and overall it sounds pretty promising.
Federation 类似于聚类,所以我也把它拉到了这里.MySQL 通过联合存储引擎提供联合.与 NDB 集群解决方案类似,它仅适用于简单查询 - 但对于复杂查询的集群更糟(因为网络延迟要高得多).
Federation is similar to clustering, so I tugged it here as well. MySQL offers federation via the federated storage engine. Similar to the NDB cluster solution, it works well with simple queries only - but even worse the the cluster for complicated ones (since network latency is much higher).
MySQL 具有在不同服务器上创建数据库复制的内置能力.这可以用于很多事情 - 在服务器之间分配负载、热备份、创建测试服务器和故障转移.
MySQL has the built in capacity to create replications of a database on different servers. This can be used for many things - splitting the load between servers, hot backups, creating test servers and failover.
复制的基本设置包括一台主服务器处理主要写入和一个或多个从服务器只处理读取.一个更高级的变体是 master-master 配置,它允许将写入扩展为好几个服务器同时写入.
The basic setup of replication involves one master server handling mostly writes and one or more slaves handling reads only. A more advanced variation is that of the master-master configuration, which allows to scale writes as well by having several servers writing at the same time.
每种配置都有其优点和缺点,但它们共有的一个问题是复制延迟 - 由于 MySQL 复制是异步的,并非所有节点始终拥有最新数据.这要求应用程序知道复制并合并复制感知查询以按预期工作.对于某些应用程序,这可能不是问题,但如果您总是需要最新的数据,事情就会变得有些复杂.
Each configuration has its pros and cons, but one problem they all share is replication lag - since MySQL replication is asynchronous, not all nodes have the freshest data at all time. This requires the application to be aware of the replication and incorporate replication-aware queries to work as expected. For some applications this might not be a problem, but if you always need the freshest data things get somewhat complicated.
复制需要一些负载平衡来在节点之间分配负载.这可以像对应用程序代码进行一些修改一样简单,或者使用专用的软件和硬件解决方案.
Replication requires some load balancing to split the load between the nodes. This can be as simple as some modifications to the application code, or using dedicated software and hardware solutions.
分片是扩展数据库解决方案的常用方法.您将数据拆分为更小的碎片,并将它们分布在不同的服务器节点上.这需要应用程序知道对数据存储的修改才能有效地工作,因为它需要知道在哪里可以找到它需要的信息.
Sharding is commonly used approach to scale database solutions. You split the data into smaller shards and spread them around different server nodes. This requires the application to be aware of the modification to the data storage to work efficiently, as it needs to know where to find the information it needs.
有一些抽象框架可用于帮助处理数据分片,例如 Hibernate Shards,一个Hibernate ORM 的扩展(不幸的是在 Java 中.我使用的是 PHP).HiveDB 是另一种同样支持分片重新平衡的解决方案.
There are abstraction frameworks available to help deal with data sharding, such as Hibernate Shards, an extension to the Hibernate ORM (which unfortunately is in Java. I'm using PHP). HiveDB is another such solution which also supports shard rebalancing.
Sphinx 是一个全文搜索引擎,其用途远不止测试搜索.对于许多查询,它比 MySQL 快得多(特别是在分组和排序方面),并且可以并行查询远程系统并聚合结果 - 这使得它在与分片一起使用时非常有用.
Sphinx is a full-text search engine, that can be used for far more than test searches. For many queries it is much faster than MySQL (especially for grouping and sorting), and can query remote systems in parallel and aggregate the results - which make it very useful in use with sharding.
一般来说,sphinx 应该与其他扩展解决方案一起使用,以获得更多可用的硬件和基础设施.缺点是您再次需要应用程序代码来了解 sphinx 才能明智地使用它.
In general sphinx should be used with other scaling solutions to get more of the available hardware and infrastructure. The downside is that again you need the application code to be aware of sphinx to use it wisely.
扩展解决方案因需要它的应用程序的需求而异.对于我们和大多数 Web 应用程序,我相信复制(可能是多主)是负载均衡器分配负载的方式.特定问题区域(大表)的分片也是能够横向扩展的必要条件.
Scaling solutions differ depending on the needs of the application that needs it. For us and for most web-applications, I believe that replication (probably multi-master) is the way to go with a load balancer distributing the load. Sharding of specific problem areas (huge tables) is also a must for being able to scale horizontally.
我还将试一试Continent Sequoia,看看它是否真的能做到它所承诺的,因为它将对应用程序代码进行最少的更改.
I'm also going to give a shot to Continuent Sequoia and see if it can really do what it promises to since it will involve the least amount of changes to application code.
相关文章