生成一个范围内的 N 个随机数，总和为常数

2022-01-09 00:00:00 algorithm random sum range c++

我想从 [a,b] 之间的特定分布(例如均匀随机)中生成 N 个随机数，总和为常数 C.我已经尝试了一些我自己能想到的解决方案，其中一些提出了类似的线程，但它们中的大多数要么解决有限形式的问题，要么我无法证明结果仍然遵循所需的分布.

I want to generate N random numbers drawn from a specif distribution (e.g uniform random) between [a,b] which sum to a constant C. I have tried a couple of solutions I could think of myself, and some proposed on similar threads but most of them either work for a limited form of problem or I can't prove the outcome still follows the desired distribution.

我尝试过的:生成 N 个随机数，将它们除以它们的总和，然后乘以所需的常数.这似乎可行，但结果不遵循数字应在 [a:b] 内的规则.

What I have tried: Generage N random numbers, divide all of them by the sum of them and multiply by the desired constant. This seems to work but the result does not follow the rule that the numbers should be within [a:b].

生成 N-1 个随机数加上 0 和所需的常数 C 并对它们进行排序.然后计算每两个连续数字之间的差异，差异就是结果.这再次与 C 相加，但与最后一个方法有相同的问题(范围可以大于 [a:b].

Generage N-1 random numbers add 0 and desired constant C and sort them. Then calculate the difference between each two consecutive nubmers and the differences are the result. This again sums to C but have the same problem of last method(the range can be bigger than [a:b].

我还尝试生成随机数并始终以保持所需总和和范围的方式跟踪最小值和最大值，并提出以下代码:

I also tried to generate random numbers and always keep track of min and max in a way that the desired sum and range are kept and come up with this code:

bool generate(function<int(int,int)> randomGenerator,int min,int max,int len,int sum,std::vector<int> &output){ /** * Not possible to produce such a sequence */ if(min*len > sum) return false; if(max*len < sum) return false; int curSum = 0; int left = sum - curSum; int leftIndexes = len-1; int curMax = left - leftIndexes*min; int curMin = left - leftIndexes*max; for(int i=0;i<len;i++){ int num = randomGenerator((curMin< min)?min:curMin,(curMax>max)?max:curMax); output.push_back(num); curSum += num; left = sum - curSum; leftIndexes--; curMax = left - leftIndexes*min; curMin = left - leftIndexes*max; } return true; }

这似乎可行，但结果有时非常不准确，我认为它不遵循原始分布(例如统一).例如:

This seems to work but the results are sometimes very skewed and I don't think it's following the original distribution (e.g. uniform). E.g:

//10 numbers within [1:10] which sum to 50: generate(uniform,1,10,10,50,output); //result: 2,7,2,5,2,10,5,8,4,5 => sum=50 //This looks reasonable for uniform, but let's change to //10 numbers within [1:25] which sum to 50: generate(uniform,1,25,10,50,output); //result: 24,12,6,2,1,1,1,1,1,1 => sum= 50

注意输出中有多少个.这听起来可能是合理的，因为范围更大.但它们看起来并不像均匀分布.我不确定即使有可能实现我想要的，也可能是限制因素使问题无法解决.

Notice how many ones exist in the output. This might sound reasonable because the range is larger. But they really don't look like a uniform distribution. I am not sure even if it is possible to achieve what I want, maybe the constraints are making the problem not solvable.

推荐答案

如果您希望样本服从均匀分布，则问题简化为生成总和 = 1 的 N 个随机数.这又是一个特殊的Dirichlet 分布的情况，但也可以使用指数分布更容易地计算.方法如下:

In case you want the sample to follow a uniform distribution, the problem reduces to generate N random numbers with sum = 1. This, in turn, is a special case of the Dirichlet distribution but can also be computed more easily using the Exponential distribution. Here is how:

取一个均匀的样本 v₁ … v_N，所有 v_i 都在 0 和 1 之间.
对于所有 i，1<=i<=N，定义 u_i := -ln v_i(注意 u_i> 0).
将 u_i 归一化为 p_i := u_i/s 其中 s 是总和 u₁+...+u_N.

Take a uniform sample v₁ … v_N with all v_i between 0 and 1.

For all i, 1<=i<=N, define u_i := -ln v_i (notice that u_i > 0).

Normalize the u_i as p_i := u_i/s where s is the sum u₁+...+u_N.

p₁..p_N 是均匀分布的(在 dim N-1 的单纯形中)，它们的和为 1.

The p₁..p_N are uniformly distributed (in the simplex of dim N-1) and their sum is 1.

您现在可以将这些 p_i 乘以您想要的常数 C，然后通过将其他一些常数 A 相加来转换它们

You can now multiply these p_i by the constant C you want and translate them by summing some other constant A like this

q_i := A + p_i*C.

q_i := A + p_i*C.

编辑 3

为了解决评论中提出的一些问题，让我添加以下内容:

In order to address some issues raised in the comments, let me add the following:

为保证最终的随机序列落在区间[a,b]中，选择上面的常数A和C分别为A := a 和C := ba，即取q_i =a + p_i*(ba).由于 p_i 在 (0,1) 范围内，所有 q_i 将在 [a,b] 范围内.
如果 v_i 恰好为 0，则不能取(负)对数 -ln(v_i)，因为 ln() 未定义为 0.概率这种事件的发生率极低.但是，为了确保不发出错误信号，上述第 1 项中 v₁ ... v_N 的生成必须以特殊方式威胁 0 的任何出现:将 -ln(0) 视为 +infinity(记住:当 x->0 时，ln(x) -> -infinity).因此总和 s = +infinity，这意味着 p_i = 1 和所有其他 p_j = 0.如果没有这个约定，序列 (0...1...0) 永远不会生成(非常感谢@Severin Pappadeux 的这个有趣的评论.)
正如@Neil Slater 对问题所附的第 4 条评论中所解释的，在逻辑上不可能满足原始框架的所有要求.因此，任何解决方案都必须将约束放松到原始约束的适当子集.@Behrooz 的其他评论似乎证实了这在这种情况下就足够了.

To ensure that the final random sequence falls in the interval [a,b] choose the constants A and C above as A := a and C := b-a, i.e., take q_i = a + p_i*(b-a). Since p_i is in the range (0,1) all q_i will be in the range [a,b].

One cannot take the (negative) logarithm -ln(v_i) if v_i happens to be 0 because ln() is not defined at 0. The probability of such an event is extremely low. However, in order to ensure that no error is signaled the generation of v₁ ... v_N in item 1 above must threat any occurrence of 0 in a special way: consider -ln(0) as +infinity (remember: ln(x) -> -infinity when x->0). Thus the sum s = +infinity, which means that p_i = 1 and all other p_j = 0. Without this convention the sequence (0...1...0) would never be generated (many thanks to @Severin Pappadeux for this interesting remark.)

As explained in the 4th comment attached to the question by @Neil Slater it is logically impossible to fulfill all the requirements of the original framing. Therefore any solution must relax the constraints to a proper subset of the original ones. Other comments by @Behrooz seem to confirm that this would suffice in this case.

编辑 2

评论中又提出了一个问题:

One more issue has been raised in the comments:

为什么重新调整一个统一的样本是不够的?

换句话说，我为什么要费心去取负对数?

原因是，如果我们只是重新缩放，那么生成的样本将不会均匀分布在片段 (0,1)(或最终样本的 [a,b] 中.)

The reason is that if we just rescale then the resulting sample won't distribute uniformly across the segment (0,1) (or [a,b] for the final sample.)

为了形象化，让我们考虑 2D，即，让我们考虑 N=2 的情况.一个均匀的样本 (v₁,v₂) 对应于正方形中的一个随机点，原点 (0,0) 和角点 (1,1).现在，当我们将这样一个点除以总和 s=v₁+v₂ 进行归一化时，我们所做的是将该点投影到对角线上，如图所示(请记住，对角线是线 x + y = 1):

To visualize this let's think 2D, i.e., let's consider the case N=2. A uniform sample (v₁,v₂) corresponds to a random point in the square with origin (0,0) and corner (1,1). Now, when we normalize such a point dividing it by the sum s=v₁+v₂ what we are doing is projecting the point onto the diagonal as shown in the picture (keep in mind that the diagonal is the line x + y = 1):

但是考虑到更靠近从 (0,0) 到 (1,1) 的主对角线的绿色线比靠近 x 轴和 y 轴的橙色线长，投影往往会累积更多围绕投影线的中心(蓝色)，缩放样本所在的位置.这表明简单的缩放不会在所描绘的对角线上产生均匀的样本.另一方面，可以在数学上证明负对数确实产生了所需的均匀性.因此，我不会复制粘贴数学证明，而是邀请所有人实现这两种算法并检查结果图的行为是否与此答案描述的一样.

But given that green lines, which are closer to the principal diagonal from (0,0) to (1,1), are longer than orange ones, which are closer to the axes x and y, the projections tend to accumulate more around the center of the projection line (in blue), where the scaled sample lives. This shows that a simple scaling won't produce a uniform sample on the depicted diagonal. On the other hand, it can be proven mathematically that the negative logarithms do produce the desired uniformity. So, instead of copypasting a mathematical proof I would invite everyone to implement both algorithms and check that the resulting plots behave as this answer describes.

(注意: 这里是一篇关于这个有趣主题的博文，并应用于石油和天然气行业)

(Note: here is a blog post on this interesting subject with an application to the Oil & Gas industry)

相关文章