神经网络的 softmax 激活函数的实现

2021-12-31 00:00:00 math neural-network c++ softmax

我在神经网络的最后一层使用 Softmax 激活函数.但是我在安全实现这个函数方面遇到了问题.

I am using a Softmax activation function in the last layer of a neural network. But I have problems with a safe implementation of this function.

一个简单的实现是这样的:

A naive implementation would be this one:

Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
  y(f) = exp(y(f));
y /= y.sum();

这对于 > 100 个隐藏节点效果不佳,因为在许多情况下 y 将是 NaN(如果 y(f) > 709,exp(y(f)) 将返回 inf).我想出了这个版本:

This does not work very well for > 100 hidden nodes because the y will be NaN in many cases (if y(f) > 709, exp(y(f)) will return inf). I came up with this version:

Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
  y(f) = safeExp(y(f), y.rows());
y /= y.sum();

其中 safeExp 定义为

double safeExp(double x, int div)
{
  static const double maxX = std::log(std::numeric_limits<double>::max());
  const double max = maxX / (double) div;
  if(x > max)
    x = max;
  return std::exp(x);
}

这个函数限制了exp的输入.在大多数情况下这有效但并非在所有情况下我都没有真正设法找出在哪些情况下它不起作用.当我在前一层有 800 个隐藏神经元时,它根本不起作用.

This function limits the input of exp. In most of the cases this works but not in all cases and I did not really manage to find out in which cases it does not work. When I have 800 hidden neurons in the previous layer it does not work at all.

然而,即使这有效,我也会以某种方式扭曲"ANN的结果.你能想出任何其他方法来计算正确的解决方案吗?是否有任何 C++ 库或技巧可以用来计算这个 ANN 的准确输出?

However, even if this worked I somehow "distort" the result of the ANN. Can you think of any other way to calculate the correct solution? Are there any C++ libraries or tricks that I can use to calculate the exact output of this ANN?

Itamar Katz 提供的解决方案是:

edit: The solution provided by Itamar Katz is:

Vector y = mlp(x); // output of the neural network without softmax activation function
double ymax = maximal component of y
for(int f = 0; f < y.rows(); f++)
  y(f) = exp(y(f) - ymax);
y /= y.sum();

它在数学上确实是相同的.然而,在实践中,由于浮点精度,一些小值变为 0.我想知道为什么没有人在教科书中写下这些实现细节.

And it really is mathematically the same. In practice however, some small values become 0 because of the floating point precision. I wonder why nobody ever writes these implementation details down in textbooks.

推荐答案

首先转到对数刻度,即计算 log(y) 而不是 y.分子的日志是微不足道的.为了计算分母的对数,您可以使用以下技巧":http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/

First go to log scale, i.e calculate log(y) instead of y. The log of the numerator is trivial. In order to calculate the log of the denominator, you can use the following 'trick': http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/

相关文章