SQL中的有偏随机?

2022-01-07 00:00:00 statistics random sql mysql

我的数据库中有一些条目,在我的例子中是带有评级和受欢迎程度以及其他因素的视频.在所有这些因素中,我计算了一个似然因子,或者更多的说一个增强因子.

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.

所以我基本上有字段 ID 和 BOOST.这个提升的计算方式是一个整数,它表示相比之下这个条目应该被命中的频率的百分比.

So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.

ID  Boost
1   1
2   2
3   7

因此,如果我无限期地运行我的随机函数,我最终应该会在 ID 1 上得到 X 次命中,在 ID 2 上的命中次数是原来的两倍,在 ID 3 上的次数是原来的 7 倍.

So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.

所以每次命中都应该是随机的,但概率为(boost/sum of boosts).所以在这个例子中 ID 3 的概率应该是 0.7(因为总和是 10.为了简单起见,我选择了这些值).

So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).

我想过类似下面的查询:

I thought about something like the following query:

SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();

很遗憾,在考虑表中的以下条目后,这不起作用:

Unfortunately that doesn't work, after considering the following entries in the table:

ID  Boost
1   1
2   2

它将有 50/50 的几率随机选择第二个或两个元素.

It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.

所以 0.5 命中到第二个元素并且 0.5 命中是随机选择的(第二个和第一个)元素,因此每个元素为 0.25.所以我们最终得到 0.25/0.75 的比率,但它应该是 0.33/0.66

So 0.5 hit goes to the second element And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each. So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66

我需要一些修改或新方法才能以良好的性能做到这一点.

I need some modification or new a method to do this with good performance.

我还考虑过累积存储 boost 字段,所以我只是从 (0-sum()) 进行范围查询,但是如果出现以下情况,我将不得不重新索引一个项目之后的所有内容我改变它或开发一些交换算法或其他东西......但这真的很不优雅.

I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.

插入/更新和选择都应该很快!

Both inserting/updating and selecting should be fast!

你有解决这个问题的方法吗?

Do you have any solutions to this problem?

要考虑的最佳用例可能是广告投放.请选择一个具有给定概率的随机广告"......但是我需要它用于另一个目的,只是为了给你最后一张图片它应该做什么.

The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.

感谢 kens 的回答,我想到了以下方法:

Thanks to kens answer i thought about the following approach:

  1. 从 0-sum(distinct boost) 计算一个随机值

  1. calculate a random value from 0-sum(distinct boost)

SET @randval = (select ceil(rand() * sum(DISTINCT boost)) from test);

SET @randval = (select ceil(rand() * sum(DISTINCT boost)) from test);

从所有不同的增强因子中选择加起来超过随机值的增强因子

select the boost factor from all distinct boost factors which added up surpasses the random value

然后在我们的第一个示例中,1 的概率为 0.1,2 的概率为 0.2,而 7 的概率为 0.7.

then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.

  1. 现在从所有具有此提升因子的条目中随机选择一个条目

问题:因为具有一次提升的条目数总是不同的.例如,如果只有 1 个提升的条目,我在 10 个调用中的 1 个中得到它,但是如果有 100 万个和 7 个,则几乎每个都不会返回......所以这行不通:(试图改进它.

PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned... so this doesnt work out :( trying to refine it.

我必须以某种方式包含具有此提升因子的条目数......但我不知何故坚持......

I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...

推荐答案

你需要每行生成一个随机数并加权.

You need to generate a random number per row and weight it.

在这种情况下,RAND(CHECKSUM(NEWID())) 绕过了 RAND 的每个查询"评估.然后简单地乘以 boost 和 ORDER BY 结果 DESC.SUM..OVER 给你总提升

In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost

DECLARE @sample TABLE (id int, boost int)

INSERT @sample VALUES (1, 1), (2, 2), (3, 7)

SELECT
    RAND(CHECKSUM(NEWID())) * boost  AS weighted,
    SUM(boost) OVER () AS boostcount,
    id
FROM
    @sample
GROUP BY
    id, boost
ORDER BY
    weighted DESC

如果您有截然不同的提升值(我认为您提到过),我还会考虑使用 LOG(以 e 为底)来平滑分布.

If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.

最后,ORDER BY NEWID() 是一种不考虑提升的随机性.对 RAND 进行播种是有用的,但对它本身无效.

Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.

这个示例是放在 SQL Server 2008 上的,顺便说一句

This sample was put together on SQL Server 2008, BTW

相关文章