与 int by int 相比，为什么执行 float by float 矩阵乘法更快?

2021-12-19 00:00:00 numpy matrix c++ eigen avx

有两个 int 矩阵 A 和 B，有超过 1000 行和 10K 列，我经常需要将它们转换为浮点矩阵以获得加速(4 倍或更多).

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more).

我想知道为什么会这样?我意识到浮点矩阵乘法有很多优化和矢量化，例如 AVX 等.但是，对于整数，有诸如 AVX2 之类的指令(如果我没记错的话).而且，不能将 SSE 和 AVX 用于整数吗?

I'm wondering why is this the case? I realize that there is a lot of optimization and vectorizations such as AVX, etc going on with float matrix multiplication. But yet, there are instructions such AVX2, for integers (if I'm not mistaken). And, can't one make use of SSE and AVX for integers?

为什么在矩阵代数库(例如 Numpy 或 Eigen)下没有启发式方法来捕获它并像 float 一样更快地执行整数矩阵乘法?

Why isn't there a heuristic underneath matrix algebra libraries such as Numpy or Eigen to capture this and perform integer matrix multiplication faster just like float?

关于接受的答案:虽然@sascha 的答案非常有用且相关，但@chatz 的答案是 int 乘以 int 乘法很慢的实际原因，而不管是否存在 BLAS 整数矩阵运算.>

About accepted answer: While @sascha's answer is very informative and relevant, @chatz's answer is the actual reason why the int by int multiplication is slow irrespective of whether BLAS integer matrix operations exist.

推荐答案

如果你编译这两个简单的函数，它们本质上只是计算一个乘积(使用 Eigen 库)

If you compile these two simple functions which essentially just calculate a product (using the Eigen library)

#include <Eigen/Core> int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B) { Eigen::MatrixXi C= A*B; return C(0,0); } int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B) { Eigen::MatrixXf C= A*B; return C(0,0); }

使用标志 -mavx2 -S -O3 您将看到非常相似的汇编代码，用于整数和浮点版本.但是，主要区别在于 vpmulld 的延迟是 vmulps 的 2-3 倍，而吞吐量仅为 1/2 或 1/4.(在最近的 Intel 架构上)

using the flags -mavx2 -S -O3 you will see very similar assembler code, for the integer and the float version. The main difference however is that vpmulld has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps. (On recent Intel architectures)

参考:Intel Intrinsics Guide，吞吐量"表示倒数吞吐量，即如果没有延迟发生(稍微简化)，每次操作使用多少时钟周期.

Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).

相关文章