MySQL 的 ORDER BY RAND() 如何工作?

2021-11-20 00:00:00 random select mysql

我一直在研究和测试如何在 MySQL 中进行快速随机选择.在这个过程中，我遇到了一些意想不到的结果，现在我不完全确定我知道 ORDER BY RAND() 是如何工作的.

I've been doing some research and testing on how to do fast random selection in MySQL. In the process I've faced some unexpected results and now I am not fully sure I know how ORDER BY RAND() really works.

我一直认为当你对表执行 ORDER BY RAND() 时，MySQL 会向表中添加一个新列，该列填充有随机值，然后按该列对数据进行排序，然后例如您采用随机到达的上述值.我做了很多谷歌搜索和测试，最后发现查询 Jay在他的博客中提供确实是最快的解决方案:

I always thought that when you do ORDER BY RAND() on the table, MySQL adds a new column to the table which is filled with random values, then it sorts data by that column and then e.g. you take the above value which got there randomly. I've done lots of googling and testing and finally found that the query Jay offers in his blog is indeed the fastest solution:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

虽然普通 ORDER BY RAND() 在我的测试表上需要 30-40 秒，但他的查询在 0.1 秒内完成工作.他在博客中解释了它是如何运作的，所以我会跳过这个，最后转向奇怪的事情.

While common ORDER BY RAND() takes 30-40 seconds on my test table, his query does the work in 0.1 seconds. He explains how this functions in the blog so I'll just skip this and finally move to the odd thing.

我的表是一个普通表，有一个 PRIMARY KEY id 和其他非索引的东西，比如 username、age 等.这是我正在努力解释的事情

My table is a common table with a PRIMARY KEY id and other non-indexed stuff like username, age, etc. Here's the thing I am struggling to explain

SELECT * FROM table ORDER BY RAND() LIMIT 1; /*30-40 seconds*/ SELECT id FROM table ORDER BY RAND() LIMIT 1; /*0.25 seconds*/ SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /*90 seconds*/

我有点期望看到所有三个查询的时间大致相同，因为我总是对单个列进行排序.但由于某种原因，这并没有发生.如果您对此有任何想法，请告诉我.我有一个项目，我需要按 RAND() 进行快速 ORDER，我个人更喜欢使用

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this. I have a project where I need to do fast ORDER BY RAND() and personally I would prefer to use

SELECT id FROM table ORDER BY RAND() LIMIT 1; SELECT * FROM table WHERE id=ID_FROM_PREVIOUS_QUERY LIMIT 1;

是的，这比 Jay 的方法慢，但它更小且更容易理解.我的查询相当大，有几个 JOIN 和 WHERE 子句，虽然 Jay 的方法仍然有效，但查询变得非常大和复杂，因为我需要使用 JOINed(在他的查询中称为 x)子请求中的所有 JOIN 和 WHERE.

which, yes, is slower than Jay's method, however it is smaller and easier to understand. My queries are rather big ones with several JOINs and with WHERE clause and while Jay's method still works, the query grows really big and complex because I need to use all the JOINs and WHERE in the JOINed (called x in his query) sub request.

感谢您的时间！

推荐答案

虽然没有通过 rand() 快速订购"这样的东西，但对于您的特定任务，有一个解决方法.

While there's no such thing as a "fast order by rand()", there is a workaround for your specific task.

要获取任意单个随机行，您可以像这位德国博主所做的那样:http://web.archive.org/web/20200211210404/http://www.roberthartung.de/mysql-order-by-rand-a-case-study-of-alternatives/(我看不到热链接网址.如果有人看到，请随时编辑链接.)

For getting any single random row, you can do like this german blogger does: http://web.archive.org/web/20200211210404/http://www.roberthartung.de/mysql-order-by-rand-a-case-study-of-alternatives/ (I couldn't see a hotlink url. If anyone sees one, feel free to edit the link.)

文本是德文，但 SQL 代码在页面下方并在大白框中，因此不难看到.

The text is in german, but the SQL code is a bit down the page and in big white boxes, so it's not hard to see.

基本上他所做的是制作一个程序来完成获取有效行的工作.这会生成一个介于 0 和 max_id 之间的随机数，尝试获取一行，如果它不存在，则继续操作，直到找到一个.他允许通过将 x 数量的随机行存储在临时表中来获取它们，因此您可能可以重写该过程以更快地仅获取一行.

Basically what he does is make a procedure that does the job of getting a valid row. That generates a random number between 0 and max_id, try fetching a row, and if it doesn't exist, keep going until you hit one that does. He allows for fetching x number of random rows by storing them in a temp table, so you can probably rewrite the procedure to be a bit faster fetching only one row.

这样做的缺点是，如果你删除了很多行，并且有很大的间隙，它很可能会错过很多次，使其无效.

The downside of this is that if you delete A LOT of rows, and there are huge gaps, the chances are big that it will miss tons of times, making it ineffective.

更新:不同的执行时间

SELECT * FROM table ORDER BY RAND() LIMIT 1;/30-40 秒/

SELECT * FROM table ORDER BY RAND() LIMIT 1; /30-40 seconds/

SELECT id FROM table ORDER BY RAND() LIMIT 1;/0.25 秒/

SELECT id FROM table ORDER BY RAND() LIMIT 1; /0.25 seconds/

SELECT id, username FROM table ORDER BY RAND() LIMIT 1;/90 秒/

SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /90 seconds/

我有点期望看到所有三个查询的时间大致相同，因为我总是对单个列进行排序.但由于某种原因，这并没有发生.如果您对此有任何想法，请告诉我.

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this.

这可能与索引有关.id 被索引并且可以快速访问，而将 username 添加到结果中，意味着它需要从每一行读取它并将其放入内存表中.使用 * 它还必须将所有内容读入内存，但不需要在数据文件中跳转，这意味着不会浪费时间寻找.

It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking.

这仅在存在可变长度列 (varchar/text) 时才有所不同，这意味着它必须检查长度，然后跳过该长度，而不是在每行之间跳过设定的长度(或 0).

This makes a difference only if there are variable length columns (varchar/text), which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.

相关文章