来自 Sql 数据库的简单随机样本
如何在 SQL 中获取有效的简单随机样本?有问题的数据库正在运行 MySQL;我的表至少有 200,000 行,我想要一个大约 10,000 的简单随机样本.
How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.
显而易见"答案是:
SELECT * FROM table ORDER BY RAND() LIMIT 10000
对于大表,这太慢了:它为每一行调用 RAND()
(已经把它放在 O(n) 上),并对它们进行排序,使其成为 O(n lg n)最好.有没有比 O(n) 更快的方法?
For large tables, that's too slow: it calls RAND()
for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?
注意:正如 Andrew Mao 在评论中指出的,如果你在 SQL Server 上使用这种方法,你应该使用 T-SQL 函数 NEWID()
,因为 RAND() 可能会为所有行返回相同的值.
Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID()
, because RAND() may return the same value for all rows.
5 年后
我用更大的表再次遇到了这个问题,最终使用了@ignorant 的解决方案版本,有两个调整:
I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:
- 将行采样到我想要的样本大小的 2-5 倍,以低廉的价格
ORDER BY RAND()
- 在每次插入/更新时将
RAND()
的结果保存到索引列.(如果您的数据集不是很需要更新,您可能需要找到另一种方法来保持此列的新鲜度.)
- Sample the rows to 2-5x my desired sample size, to cheaply
ORDER BY RAND()
- Save the result of
RAND()
to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)
为了获取一个表的 1000 项样本,我对行进行计数,并将结果采样到平均 10,000 行,其中包含frozen_rand 列:
To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:
SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high
SELECT *
FROM table
WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000
(我的实际实现涉及更多工作以确保我不会欠采样,并手动将 rand_high 包裹起来,但基本思想是随机将 N 减少到几千个.")
(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")
虽然这做出了一些牺牲,但它允许我使用索引扫描对数据库进行采样,直到它小到足以再次ORDER BY RAND()
.
While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND()
again.
推荐答案
这里有一个关于此类问题的非常有趣的讨论:http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
我认为绝对没有关于您的 O(n lg n) 解决方案是最好的表格的假设.尽管实际上使用一个好的优化器或稍微不同的技术,您列出的查询可能会好一点,O(m*n) 其中 m 是所需的随机行数,因为它不必对整个大数组进行排序,它可以只搜索最小的 m 次.但是对于您发布的那种数字,无论如何,m 都比 lg n 大.
I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.
我们可能会尝试的三个假设:
Three asumptions we might try out:
表中有一个唯一的索引主键
there is a unique, indexed, primary key in the table
你要选择的随机行数(m)远小于表中的行数(n)
the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)
唯一主键是一个整数,范围从 1 到 n,没有间隙
the unique primary key is an integer that ranges from 1 to n with no gaps
只有假设 1 和 2,我认为这可以在 O(n) 中完成,尽管您需要将整个索引写入表以匹配假设 3,因此它不一定是快速的 O(n).如果我们可以额外假设表格的其他优点,我们可以在 O(m log m) 中完成任务.假设 3 将是一个容易使用的不错的附加属性.有了一个很好的随机数生成器,保证在连续生成 m 个数字时没有重复,O(m) 解决方案将成为可能.
With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.
给定三个假设,基本思想是在 1 到 n 之间生成 m 个唯一的随机数,然后从表中选择具有这些键的行.我现在没有 mysql 或任何东西在我面前,所以在稍微伪代码中,这看起来像:
Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:
create table RandomKeys (RandomKey int)
create table RandomKeysAttempt (RandomKey int)
-- generate m random keys between 1 and n
for i = 1 to m
insert RandomKeysAttempt select rand()*n + 1
-- eliminate duplicates
insert RandomKeys select distinct RandomKey from RandomKeysAttempt
-- as long as we don't have enough, keep generating new keys,
-- with luck (and m much less than n), this won't be necessary
while count(RandomKeys) < m
NextAttempt = rand()*n + 1
if not exists (select * from RandomKeys where RandomKey = NextAttempt)
insert RandomKeys select NextAttempt
-- get our random rows
select *
from RandomKeys r
join table t ON r.RandomKey = t.UniqueKey
如果你真的很关心效率,你可能会考虑用某种程序语言生成随机密钥并将结果插入数据库,因为除了 SQL 之外的几乎任何东西都可能在循环和随机方面做得更好需要生成号码.
If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.
相关文章