ARM 设备上代码优化的 C++ 技巧

2022-01-17 00:00:00 optimization arm c++

我一直在为 ARM 设备上的增强现实开发 C++ 代码，代码的优化对于实现保持良好的帧率.为了将效率提高到最高水平，我认为收集一般提示很重要使编译器的工作更轻松，并减少程序的循环数.欢迎提出任何建议.

I have been developing C++ code for augmented reality on ARM devices and optimization of the code is very important in order to keep a good frame rate. In order to rise efficiency to the maximum level I think it is important to gather general tips that make life easier for compilers and reduce the number of cicles of the program. Any suggestion is welcomed.

1- 避免高成本指令:除法、平方根、sin、cos

1- Avoid high-cost instructions: division, square root, sin, cos

使用逻辑移位来除以或乘以 2.
尽可能乘以倒数.

2- 优化内部for"循环:它们是一个瓶颈，所以我们应该避免在内部进行大量计算，尤其是除法、平方根......

2- Optimize inner "for" loops: they are a botleneck so we should avoid making many calculations inside, especially divisions, square roots..

3- 对一些数学函数(sin、cos、...)使用查找表

3- Use look-up tables for some mathematical functions (sin, cos, ...)

有用的工具

objdump:获取已编译程序的汇编代码.这允许比较两个函数并检查它是否真正优化.

推荐答案

为回答您关于优化 ARM 的 C++ 代码时的一般规则的问题，这里有一些建议:

To answer your question about general rules when optimizing C++ code for ARM, here are a few suggestions:

1) 正如您所提到的，没有除法指令.尽可能使用逻辑移位或乘以倒数.
2)内存比CPU执行慢很多；使用逻辑运算来避免小型查找表.
3) 尝试一次写入 32 位，以充分利用写入缓冲区.写短裤或字符会大大减慢代码速度.换句话说，将较小的位进行逻辑或运算并将它们写入 DWORDS 会更快.
4) 注意你的 L1/L2 缓存大小.一般来说，ARM 芯片的缓存比英特尔小得多.
5) 尽可能使用 SIMD (NEON).NEON 指令非常强大，对于矢量化"代码，可以非常快.NEON 内部函数在大多数 C++ 环境中都可用，其速度几乎与编写手动调整的 ASM 代码一样快.
6)使用缓存预取提示(PLD)来加速循环读取.ARM 没有现代英特尔芯片那样的智能预缓存逻辑.
7) 不要相信编译器会生成好的代码.查看 ASM 输出并重写 ASM 中的热点.对于位/字节操作，C 语言不能像在 ASM 中那样有效地指定事物.ARM 具有强大的 3 操作数指令、多重加载/存储和自由"移位，其性能优于编译器生成的能力.

1) As you mentioned, there is no divide instruction. Use logical shifts or multiply by the inverse when possible.
2) Memory is much slower than CPU execution; use logical operations to avoid small lookup tables.
3) Try to write 32-bits at a time to make best use of the write buffer. Writing shorts or chars will slow the code down considerably. In other words, it's faster to logical-OR the smaller bits together and write them as DWORDS.
4) Be aware of your L1/L2 cache size. As a general rule, ARM chips have much smaller caches than Intel.
5) Use SIMD (NEON) when possible. NEON instructions are quite powerful and for "vectorizable" code, can be quite fast. NEON intrinsics are available in most C++ environments and can be nearly as fast as writing hand tuned ASM code.
6) Use the cache prefetch hint (PLD) to speed up looping reads. ARM doesn't have smart precache logic the way that modern Intel chips do.
7) Don't trust the compiler to generate good code. Look at the ASM output and rewrite hotspots in ASM. For bit/byte manipulation, the C language can't specify things as efficiently as they can be accomplished in ASM. ARM has powerful 3-operand instructions, multi-load/store and "free" shifts that can outperform what the compiler is capable of generating.

相关文章