生成一个范围内的N个随机数,其总和为常数
我想从 [a,b] 之间的特定分布(例如均匀随机)中生成 N 个随机数,它们的总和为常数 C.我尝试了一些我能想到的解决方案,其中一些建议用于类似的线程,但它们中的大多数要么针对有限形式的问题工作,要么我无法证明结果仍然遵循所需的分布.
I want to generate N random numbers drawn from a specif distribution (e.g uniform random) between [a,b] which sum to a constant C. I have tried a couple of solutions I could think of myself, and some proposed on similar threads but most of them either work for a limited form of problem or I can't prove the outcome still follows the desired distribution.
我尝试过的:生成 N 个随机数,将它们全部除以它们的总和并乘以所需的常数.这似乎有效,但结果不遵循数字应在 [a:b] 内的规则.
What I have tried: Generage N random numbers, divide all of them by the sum of them and multiply by the desired constant. This seems to work but the result does not follow the rule that the numbers should be within [a:b].
生成 N-1 个随机数加上 0 和所需的常数 C 并对其进行排序.然后计算每两个连续数字之间的差值,差值就是结果.这再次与 C 相加,但具有与上一个方法相同的问题(范围可以大于 [a:b].
Generage N-1 random numbers add 0 and desired constant C and sort them. Then calculate the difference between each two consecutive nubmers and the differences are the result. This again sums to C but have the same problem of last method(the range can be bigger than [a:b].
我还尝试生成随机数,并始终以保持所需总和和范围的方式跟踪最小值和最大值,并得出以下代码:
I also tried to generate random numbers and always keep track of min and max in a way that the desired sum and range are kept and come up with this code:
bool generate(function<int(int,int)> randomGenerator,int min,int max,int len,int sum,std::vector<int> &output){
/**
* Not possible to produce such a sequence
*/
if(min*len > sum)
return false;
if(max*len < sum)
return false;
int curSum = 0;
int left = sum - curSum;
int leftIndexes = len-1;
int curMax = left - leftIndexes*min;
int curMin = left - leftIndexes*max;
for(int i=0;i<len;i++){
int num = randomGenerator((curMin< min)?min:curMin,(curMax>max)?max:curMax);
output.push_back(num);
curSum += num;
left = sum - curSum;
leftIndexes--;
curMax = left - leftIndexes*min;
curMin = left - leftIndexes*max;
}
return true;
}
这似乎有效,但结果有时非常不准确,我认为它没有遵循原始分布(例如均匀分布).例如:
This seems to work but the results are sometimes very skewed and I don't think it's following the original distribution (e.g. uniform). E.g:
//10 numbers within [1:10] which sum to 50:
generate(uniform,1,10,10,50,output);
//result:
2,7,2,5,2,10,5,8,4,5 => sum=50
//This looks reasonable for uniform, but let's change to
//10 numbers within [1:25] which sum to 50:
generate(uniform,1,25,10,50,output);
//result:
24,12,6,2,1,1,1,1,1,1 => sum= 50
注意输出中有多少个.这听起来可能合理,因为范围更大.但它们看起来真的不像均匀分布.我不确定即使有可能实现我想要的,也许限制使问题无法解决.
Notice how many ones exist in the output. This might sound reasonable because the range is larger. But they really don't look like a uniform distribution. I am not sure even if it is possible to achieve what I want, maybe the constraints are making the problem not solvable.
推荐答案
如果您希望样本服从均匀分布,则问题简化为生成总和 = 1 的 N 个随机数.反过来,这是一个特殊的Dirichlet 分布的情况,但也可以使用指数分布更容易地计算.方法如下:
In case you want the sample to follow a uniform distribution, the problem reduces to generate N random numbers with sum = 1. This, in turn, is a special case of the Dirichlet distribution but can also be computed more easily using the Exponential distribution. Here is how:
- 取一个统一的样本 v1 ... vN,其中所有的 vi 都在 0 和 1 之间.
- 对于所有的 i,1<=i<=N,定义 ui := -ln vi(注意 ui> 0).
- 将 ui 标准化为 pi := ui/s 其中 s 是总和 u1+...+uN.
- Take a uniform sample v1 … vN with all vi between 0 and 1.
- For all i, 1<=i<=N, define ui := -ln vi (notice that ui > 0).
- Normalize the ui as pi := ui/s where s is the sum u1+...+uN.
p1..pN 是均匀分布的(在dim N-1 的单纯形中),它们的和为1.
The p1..pN are uniformly distributed (in the simplex of dim N-1) and their sum is 1.
您现在可以将这些 pi 乘以您想要的常数 C,然后通过对其他常数 A 求和来转换它们
You can now multiply these pi by the constant C you want and translate them by summing some other constant A like this
qi := A + pi*C.
qi := A + pi*C.
编辑 3
为了解决评论中提出的一些问题,让我添加以下内容:
In order to address some issues raised in the comments, let me add the following:
- 为了确保最终的随机序列落在区间 [a,b] 内,选择上面的常数 A 和 C 作为 A := a 和 C := ba,即取 qi =a + pi*(ba).由于 pi 在 (0,1) 范围内,所有 qi 都将在 [a,b] 范围内.
- 如果 vi 恰好为 0,则不能取(负)对数 -ln(vi) 因为 ln() 未定义为 0.概率此类事件的发生率极低.但是,为了确保不会发出错误信号,上面第 1 项中 v1 ... vN 的生成必须以特殊方式威胁任何 0 的出现:将 -ln(0) 视为 +infinity(记住:ln(x) -> -infinity 当 x->0 时).因此总和 s = +infinity,这意味着 pi = 1 和所有其他 pj = 0.没有这个约定,序列 (0...1...0) 永远不会被生成(非常感谢@Severin Pappadeux 的这个有趣的评论.)
- 正如@Neil Slater 在问题所附的第四条评论中所解释的那样,从逻辑上讲,不可能满足原始框架的所有要求.因此,任何解决方案都必须将约束放宽到原始约束的适当子集.@Behrooz 的其他评论似乎证实这在这种情况下就足够了.
- To ensure that the final random sequence falls in the interval [a,b] choose the constants A and C above as A := a and C := b-a, i.e., take qi = a + pi*(b-a). Since pi is in the range (0,1) all qi will be in the range [a,b].
- One cannot take the (negative) logarithm -ln(vi) if vi happens to be 0 because ln() is not defined at 0. The probability of such an event is extremely low. However, in order to ensure that no error is signaled the generation of v1 ... vN in item 1 above must threat any occurrence of 0 in a special way: consider -ln(0) as +infinity (remember: ln(x) -> -infinity when x->0). Thus the sum s = +infinity, which means that pi = 1 and all other pj = 0. Without this convention the sequence (0...1...0) would never be generated (many thanks to @Severin Pappadeux for this interesting remark.)
- As explained in the 4th comment attached to the question by @Neil Slater it is logically impossible to fulfill all the requirements of the original framing. Therefore any solution must relax the constraints to a proper subset of the original ones. Other comments by @Behrooz seem to confirm that this would suffice in this case.
编辑 2
评论中又提出了一个问题:
One more issue has been raised in the comments:
为什么重新调整统一样本还不够?
换句话说,我为什么要费心取负对数?
原因是,如果我们只是重新缩放,那么结果样本将不会均匀分布在 (0,1) 段(或 [a,b] 为最终样本.)
The reason is that if we just rescale then the resulting sample won't distribute uniformly across the segment (0,1) (or [a,b] for the final sample.)
为了形象化,让我们考虑 2D,即让我们考虑 N=2 的情况.一个均匀样本 (v1,v2) 对应于正方形中具有原点 (0,0) 和角点 (1,1) 的随机点.现在,当我们将这样一个点除以总和 s=v1+v2 归一化时,我们所做的是将点投影到对角线上,如图所示(请记住,对角线是 x + y = 1 线):
To visualize this let's think 2D, i.e., let's consider the case N=2. A uniform sample (v1,v2) corresponds to a random point in the square with origin (0,0) and corner (1,1). Now, when we normalize such a point dividing it by the sum s=v1+v2 what we are doing is projecting the point onto the diagonal as shown in the picture (keep in mind that the diagonal is the line x + y = 1):
但考虑到更靠近从 (0,0) 到 (1,1) 的主对角线的绿线比靠近 x 轴和 y 轴的橙色线长,投影往往会累积更多围绕投影线(蓝色)的中心,缩放样本所在的位置.这表明简单的缩放不会在所描绘的对角线上产生统一的样本.另一方面,可以从数学上证明负对数确实产生了所需的均匀性.因此,与其复制粘贴数学证明,不如邀请每个人实现这两种算法,并检查结果图是否与此答案描述的一样.
But given that green lines, which are closer to the principal diagonal from (0,0) to (1,1), are longer than orange ones, which are closer to the axes x and y, the projections tend to accumulate more around the center of the projection line (in blue), where the scaled sample lives. This shows that a simple scaling won't produce a uniform sample on the depicted diagonal. On the other hand, it can be proven mathematically that the negative logarithms do produce the desired uniformity. So, instead of copypasting a mathematical proof I would invite everyone to implement both algorithms and check that the resulting plots behave as this answer describes.
(注意:此处 是一篇关于这个有趣主题的博客文章,适用于石油和天然气行业)
(Note: here is a blog post on this interesting subject with an application to the Oil & Gas industry)
相关文章