Armadillo 中的并行化

2021-12-30 00:00:00 parallel-processing blas c++ armadillo

Armadillo C++ 线性代数库文档 陈述了用 C++ 开发库的原因之一成为通过现代 C++ 编译器中存在的 OpenMP 轻松实现并行化",但 Armadillo 代码没有使用 OpenMP.我如何获得与犰狳并行化的好处?这是通过使用高速LAPACK 和BLAS 替代品 之一来实现的吗?我的平台是 Linux、Intel 处理器,但我怀疑这个问题有一个通用的答案.

The Armadillo C++ linear algebra library documentation states one of the reasons for developing the library in C++ to be "ease of parallelisation via OpenMP present in modern C++ compilers", but the Armadillo code does not use OpenMP. How can I gain the benefits of parallelisation with Armadillo? Is this achieved by using one of the high-speed LAPACK and BLAS replacements? My platform is Linux, Intel processor but I suspect there is a generic answer to this question.

推荐答案

好吧,看来并行化确实是通过使用高速 LAPACK 和 BLAS 替代品来实现的.在 Ubuntu 12.04 上,我使用包管理器安装了 OpenBLAS 并从源代码构建了 Armadillo 库.examples 文件夹中的示例构建并运行,我可以使用 OPENBLAS_NUM_THREADS 环境变量控制内核数量.

Okay so it appears that parallelisation is indeed achieved by using the high-speed LAPACK and BLAS replacements. On Ubuntu 12.04 I installed OpenBLAS using the package manager and built the Armadillo library from the source. The examples in the examples folder built and run and I can control the number of cores using the OPENBLAS_NUM_THREADS environment variable.

我创建了一个小项目 openblas-benchmark 用于测量 Armadillo 在计算 a 时的性能提升矩阵乘积 C=AxB 用于各种大小的矩阵,但到目前为止我只能在 2 核机器上对其进行测试.

I created a small project openblas-benchmark which measures the performance increase of Armadillo when computing a matrix product C=AxB for various size matrices but I could only test it on a 2-core machine so far.

性能图显示,对于大于 512x512 的矩阵,执行时间减少了近 50%.请注意,两个轴都是对数的;y 轴上的每条网格线代表执行时间加倍.

The performance plot shows nearly 50% reduction in execution time for matrices larger than 512x512. Note that both axes are logarithmic; each grid line on the y axis represents a doubling in execution time.

相关文章