MySQL 和 NoSQL:帮我选一个合适的

2022-01-13 00:00:00 nosql php mysql cassandra

有一个大数据库,有 1,000,000,000 行,称为线程(这些线程确实存在,我并不是因为我喜欢它而使事情变得更难).Threads 里面只有一些东西,让事情变得更快:(int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)

There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I'm not making things harder just because of I enjoy it). Threads has only a few stuff in it, to make things faster: (int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)

查询:

select * from thread where forumid = 100 and replycount >1 个订单,按 dateline desc 限制 10000, 100

因为有 1G 的记录,所以查询速度很慢.所以我想,让我们将这 1G 的记录拆分到我拥有的论坛(类别)中尽可能多的表中!这几乎是完美的.有很多表,我搜索的记录较少,而且速度真的更快.查询现在变为:

Since that there are 1G of records it's quite a slow query. So I thought, let's split this 1G of records in as many tables as many forums(category) I have! That is almost perfect. Having many tables I have less record to search around and it's really faster. The query now becomes:

select * from thread_{forum_id} where replycount >1 个订单,按 dateline desc 限制 10000, 100

这对于 99% 的论坛(类别)来说确实更快,因为其中大多数只有少数主题 (100k-1M).但是,因为有些记录有大约 10M 的记录,所以一些查询仍然很慢(0.1/.2 秒,对我的应用程序来说太慢了!我已经在使用索引了!强>).

This is really faster with 99% of the forums (category) since that most of those have only a few of topics (100k-1M). However because there are some with about 10M of records, some query are still to slow (0.1/.2 seconds, to much for my app!, I'm already using indexes!).

我不知道如何使用 MySQL 来改进这一点.有什么办法吗?

I don't know how to improve this using MySQL. Is there a way?

对于这个项目,我将使用 10 台服务器(12GB 内存,软件 raid 10 上的 4x7200rpm 硬盘,四核)

For this project I will use 10 Servers (12GB ram, 4x7200rpm hard disk on software raid 10, quad core)

这个想法是在服务器之间简单地拆分数据库,但是上面解释的问题仍然不够.

The idea was to simply split the databases among the servers, but with the problem explained above that is still not enought.

如果我在这 10 台服务器上安装 cassandra(假设我有时间让它按预期工作),我是否应该假设性能提升?

If I install cassandra on these 10 servers (by supposing I find the time to make it works as it is supposed to) should I be suppose to have a performance boost?

我该怎么办?继续在多台机器上使用带有分布式数据库的 MySQL 或构建 cassandra 集群?

我被要求发布索引是什么,它们是:

I was asked to post what are the indexes, here they are:

mysql> show index in thread;
PRIMARY id
forumid
dateline
replycount

选择解释:

mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250;
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
| id | select_type | table  | type | possible_keys | key     | key_len | ref         | rows   | Extra                       |
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+
|  1 | SIMPLE      | thread | ref  | forumid       | forumid | 4       | const,const | 221575 | Using where; Using filesort | 
+----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+

推荐答案

您应该阅读以下内容并了解一些关于精心设计的 innodb 表的优势以及如何最好地使用聚集索引 - 仅适用于 innodb!

You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

然后按照以下简化示例设计您的系统:

then design your system something along the lines of the following simplified example:

重要的特点是表使用innodb引擎,thread表的主键不再是单个auto_incrementing键,而是基于forum_id和thread_id组合的复合clustered键.例如

The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.

threads - primary key (forum_id, thread_id)

forum_id    thread_id
========    =========
1                   1
1                   2
1                   3
1                 ...
1             2058300  
2                   1
2                   2
2                   3
2                  ...
2              2352141
...

每个论坛行都包含一个名为 next_thread_id (unsigned int) 的计数器,该计数器由触发器维护,并在每次将线程添加到给定论坛时递增.这也意味着如果对 thread_id 使用单个 auto_increment 主键,我们每个论坛可以存储 40 亿个线程,而不是总共 40 亿个线程.

Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.

forum_id    title   next_thread_id
========    =====   ==============
1          forum 1        2058300
2          forum 2        2352141
3          forum 3        2482805
4          forum 4        3740957
...
64        forum 64       3243097
65        forum 65      15000000 -- ooh a big one
66        forum 66       5038900
67        forum 67       4449764
...
247      forum 247            0 -- still loading data for half the forums !
248      forum 248            0
249      forum 249            0
250      forum 250            0

使用复合键的缺点是不能再通过单个键值来选择线程,如下:

The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:

select * from threads where thread_id = y;

你必须做的:

select * from threads where forum_id = x and thread_id = y;

但是,您的应用程序代码应该知道用户正在浏览哪个论坛,因此实现起来并不困难 - 将当前查看的 forum_id 存储在会话变量或隐藏表单字段等中......

However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...

这是简化的架构:

drop table if exists forums;
create table forums
(
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum
)engine=innodb;


drop table if exists threads;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index
)engine=innodb;

delimiter #

create trigger threads_before_ins_trig before insert on threads
for each row
begin
declare v_id int unsigned default 0;

  select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
  set new.thread_id = v_id;
  update forums set next_thread_id = v_id where forum_id = new.forum_id;
end#

delimiter ;

您可能已经注意到我将 reply_count 作为主键的一部分包含在内,这有点奇怪,因为 (forum_id, thread_id) 组合本身是唯一的.这只是一个索引优化,它在执行使用 reply_count 的查询时节省了一些 I/O.有关更多信息,请参阅上面的 2 个链接.

You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.

我仍在将数据加载到我的示例表中,到目前为止,我已经加载了大约.5 亿行(是您系统的一半).加载过程完成后,我应该期望有大约:

I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:

250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)

我故意让一些论坛的帖子数超过 500 万,例如,论坛 65 有 1500 万个帖子:

I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:

forum_id    title   next_thread_id
========    =====   ==============
65        forum 65      15000000 -- ooh a big one

查询运行时

select sum(next_thread_id) from forums;

sum(next_thread_id)
===================
539,155,433 (500 million threads so far and still growing...)

在 innodb 下对 next_thread_ids 求和以给出总线程数比平时快得多:

under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:

select count(*) from threads;

论坛65有多少个话题:

How many threads does forum 65 have:

select next_thread_id from forums where forum_id = 65

next_thread_id
==============
15,000,000 (15 million)

这再次比平常更快:

select count(*) from threads where forum_id = 65

好的,现在我们知道到目前为止我们有大约 5 亿个线程,论坛 65 有 1500 万个线程 - 让我们看看架构如何执行 :)

Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;

runtime = 0.022 secs

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;

runtime = 0.027 secs

对我来说看起来非常高效 - 所以这是一个包含 500+ 百万行(并且还在增长)的单个表,其查询在 0.02 秒内覆盖了 1500 万行(在负载下!)

Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)

这些将包括:

  • 按范围分区

  • partitioning by range

分片

投入金钱和硬件

等等……

希望这个答案对您有所帮助:)

hope you find this answer helpful :)

相关文章