MySQL 分片方法?
分片 MySQL 表的最佳方法是什么.我能想到的方法是:
What is the best approach for Sharding MySQL tables. The approaches I can think of are :
- 应用级分片?
- 在 MySQL 代理层进行分片?
- 用于分片的中央查找服务器?
你知道这方面有什么有趣的项目或工具吗?
Do you know of any interesting projects or tools in this area?
推荐答案
分片 MySQL 表的最佳方法是不要这样做,除非完全不可避免.
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
在编写应用程序时,您通常希望以最大限度提高开发速度和开发人员速度的方式进行编写.仅在必要时优化延迟(答案准备好之前的时间)或吞吐量(每个时间单位的答案数量).
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
只有当所有这些分区的总和不再适合单个数据库服务器实例时,您才进行分区,然后将分区分配给不同的主机(= 分片) - 原因是写入或读取.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
写入情况是 a) 写入频率使该服务器磁盘永久过载或 b) 写入过多,因此复制在此复制层次结构中永久滞后.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
分片的读取情况是当数据的大小如此之大以至于它的工作集不再适合内存并且数据读取开始访问磁盘而不是大部分时间从内存中提供服务时.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
只有当您必须进行分片时,您才这样做.
Only when you have to shard you do it.
分片的那一刻,您正在以多种方式为此付出代价:
The moment you shard, you are paying for that in multiple ways:
您的大部分 SQL 不再是声明性的.
Much of your SQL is no longer declarative.
通常,在 SQL 中,您会告诉数据库您想要什么数据,然后让优化器将其转换为数据访问程序.这是一件好事,因为它很灵活,而且编写这些数据访问程序是一件很无聊的工作,会损害速度.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
在分片环境中,您可能将节点 A 上的表与节点 B 上的数据连接起来,或者您在节点 A 和 B 上有一个比节点大的表,并且正在将数据与节点 B 上的数据连接起来和 C. 你开始手动编写应用程序端基于哈希的连接解析来解决这个问题(或者你正在重新发明 MySQL 集群),这意味着你最终会得到很多不再声明的 SQL,而是在一种程序方式(例如,您在循环中使用 SELECT 语句).
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
您的网络延迟很大.
通常,SQL 查询可以在本地解析,优化器知道与本地磁盘访问相关的成本,并以最小化成本的方式解析查询.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
在分片环境中,通过在网络上对多个节点运行键值访问来解决查询(希望使用批量键访问而不是每次往返的单独键查找)或通过推送部分 WHERE
子句到可以应用它们的节点(称为条件下推"),或两者兼而有之.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE
clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
但即使在最好的情况下,这也比本地情况涉及更多的网络往返,而且更复杂.特别是因为 MySQL 优化器对网络延迟一无所知(好吧,MySQL 集群在这方面正在慢慢变得更好,但对于集群外的 vanilla MySQL 仍然如此).
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
您正在失去 SQL 的大量表达能力.
You are losing a lot of expressive power of SQL.
好吧,这可能不太重要,但是外键约束和其他用于数据完整性的 SQL 机制无法跨越多个分片.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL 没有允许异步查询的 API.
MySQL has no API which allows asynchronous queries that is in working order.
当同一类型的数据驻留在多个节点上时(例如节点 A、B 和 C 上的用户数据),通常需要针对所有这些节点解决横向查询(查找所有未登录的用户帐户"90 天或更长时间").数据访问时间随节点数量线性增长,除非可以并行询问多个节点并在结果进入时聚合(Map-Reduce").
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
这样做的前提是一个异步通信 API,它不存在于 MySQL 的良好工作状态.另一种选择是在子进程中进行大量分叉和连接,这是在季票上访问吮吸世界.
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
一旦您开始分片,数据结构和网络拓扑就会变得可见,因为性能指向您的应用程序.为了合理地执行,您的应用程序需要了解这些事情,这意味着实际上只有应用程序级分片才有意义.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
如果您想自动分片(例如通过散列主键来确定哪一行进入哪个节点),或者如果您想以手动方式进行功能拆分(与 xyz 用户故事相关的表转到这个主,而 abc 和 def 相关的表转到那个主").
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
功能分片的优点是,如果做得好,大多数开发人员大部分时间都看不到它,因为与他们的用户故事相关的所有表都将在本地可用.这使他们仍然可以尽可能长时间地从声明式 SQL 中受益,并且由于跨网络传输的数量保持最少,因此也将产生更少的网络延迟.
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
函数式分片的缺点是不允许单个表大于一个实例,需要设计者手动关注.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
功能分片的优势在于,它相对容易地对现有代码库进行,但更改数量不会太大.http://Booking.com 在过去的几年里已经做过多次,而且效果很好.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
说了这么多,看着你的问题,我确实相信你问错了问题,或者我完全误解了你的问题陈述.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
相关文章