快速可除性测试(按 2,3,4,5,.., 16)?
最快的可分性测试是什么?比如说,给定一个小端架构和一个 32 位有符号整数:如何非常快速地计算出一个数可以被 2、3、4、5、...整除 16?
What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?
警告:给定的代码只是示例.每条线都是独立的!在许多没有 DIV 硬件(如许多 ARM)的处理器上,使用模运算的明显解决方案很慢.一些编译器也无法进行这样的优化(例如,如果除数是函数的参数或依赖于某些东西).
WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).
Divisible_by_1 = do();
Divisible_by_2 = if (!(number & 1)) do();
Divisible_by_3 = ?
Divisible_by_4 = ?
Divisible_by_5 = ?
Divisible_by_6 = ?
Divisible_by_7 = ?
Divisible_by_8 = ?
Divisible_by_9 = ?
Divisible_by_10 = ?
Divisible_by_11 = ?
Divisible_by_12 = ?
Divisible_by_13 = ?
Divisible_by_14 = ?
Divisible_by_15 = ?
Divisible_by_16 = if(!number & 0x0000000F) do();
和特殊情况:
Divisible_by_2k = if(number & (tk-1)) do(); //tk=2**k=(2*2*2*...) k times
推荐答案
找出除法指令(包括 x86/x64 上的取模)的替代方案并不是一个坏主意,因为它们非常慢.比大多数人意识到的要慢(甚至慢得多).那些建议使用 "% n" 其中 n 是变量的人给出了愚蠢的建议,因为它总是会导致使用除法指令.另一方面,% c"(其中 c 是常数)将允许编译器确定其曲目中可用的最佳算法.有时是除法指令,但很多时候不是.
It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.
在本文档中,Torbj?rn Granlund 显示了无符号 32 位复数所需的时钟周期比:div 在 Sandybridge 上为 4:26 (6.5x),在 K10 上为 3:45 (15x).对于 64 位,相应的比率为 4:92 (23x) 和 5:77 (14.4x).
In this document Torbj?rn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).
L"列表示延迟.T"列表示吞吐量.这与处理器并行处理多条指令的能力有关.Sandybridge 可以每隔一个周期发出一次 32 位乘法,或者每个周期发出一次 64 位乘法.对于 K10,相应的吞吐量是相反的.对于分区,K10 需要在开始另一个序列之前完成整个序列.我怀疑 Sandybridge 也是如此.
The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.
以 K10 为例,这意味着在 32 位除法 (45) 所需的周期内,可以发出相同数量 (45) 的乘法,并且其中的倒数第二和最后一个将完成除法完成后的一个和两个时钟周期.很多工作可以用 45 次乘法来完成.
Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.
有趣的是,随着从 K8-K9 到 K10 的演变,div 的效率变得越来越低:32 位和 64 位的时钟周期从 39 到 45,从 71 到 77 个时钟周期.
It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.
Granlund 在 gmplib.org 和 页面/~tege/">斯德哥尔摩皇家理工学院包含更多好东西,其中一些已被纳入gcc编译器.
Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.
相关文章