按位旋转是否比当前 Intel CPU 上的移位慢?
我很好奇 java.lang.Integer.rotateLeft
是否通过使用旋转指令进行了优化并为其编写了基准测试.结果尚无定论:它比两班倒快得多,但比单班倒慢一点.所以我用 C++ 重写了它并得到了大致相同的结果.通过 g++ -S -Wall -O3
编译时,我可以在 生成的汇编器.我的 CPU 是 Intel Core i5.
I was curious if java.lang.Integer.rotateLeft
gets optimized by using a rotation instruction and wrote a benchmark for it. The results were inconclusive: It was much faster than two shifts but a bit slower than a single one. So I rewrote it in C++ and got about the same results. When compiling via g++ -S -Wall -O3
I can see the instruction in the generated assembler. My CPU is Intel Core i5.
基准相当长,肯定不会最好的一段代码,但我不认为它被破坏了.或者是吗?根据文档,轮换需要一个周期,就像轮班一样.谁能解释一下结果?
The benchmark is quite long and surely not the nicest piece of code, but I don't think it's broken. Or is it? According to the documentation the rotations take one cycle, just like shifts. Can anybody explain the results?
rotations: 6860
shift: 5100
<小时>
前两个答案是错误的.gcc 和 java 的 JIT 都知道旋转指令并使用它们.关于 gcc 参见上面的链接,关于 java 参见我的 java benchmark 及其结果
benchmark ns linear runtime
Rotate 3.48 ====================
NonRotate 5.05 ==============================
Shift 2.16 ============
推荐答案
我不知道 gcc 和 java jit 能够识别一系列 SHIFT 和 OR 运算符可以简化为 ROTATE 指令,非常有趣.
I did not know that gcc and the java jit were capable of recognizing that a sequence of SHIFT and OR operators can be reduced to a ROTATE instruction, very interesting.
g++ 编译器展开您的循环并使用 SHIFT immediate
和 ROTATE immediate
指令(因为您按常量值移位和旋转).
The g++ compiler unrolls your loops and uses SHIFT immediate
and ROTATE immediate
instructions (since you shift and rotate by constant values).
这是在 TimeShift 循环展开案例中重复的六个指令序列:
Here's the six instruction sequence that is repeated in the TimeShift loop unroll case:
movq %rax, %rbx
salq $13, %rbx
leaq (%rbp,%rbx), %rbx
movq %rdi, %rbp
sarq $27, %rbp
xorq %rbx, %rdx
这是在 TimeRotate 循环展开案例中重复的六个指令序列:
Here's the six instruction sequence that is repeated in the TimeRotate loop unroll case:
movq %rdx, %rbx
rorq $45, %rbx
leaq (%rbp,%rbx), %rbx
movq %r8, %rbp
rorq $49, %rbp
xorq %rbx, %r9
它们的区别主要在于 SHIFT
使用 salq/sarq 和 ROTATE
使用 rorq 所以你想知道为什么时间不同是正确的.
They differ mainly in the use of salq/sarq for SHIFT
and rorq for ROTATE
so you are correct in wondering why the timing differs.
答案深藏在 Sandy Bridge(您的 Core i5 处理器)的微架构中,可在 INTEL® 64 和 IA-32 处理器架构优化参考手册最新的是Order Number: 248966-026 April 2012
The answer lies deep in the micro-architecture of Sandy Bridge (your Core i5 processor) and is found in INTEL® 64 and IA-32 Processor Architectures Optimization Reference Manual
The latest is Order Number: 248966-026 April 2012
无论您使用 by 1
操作码还是 by immediate
,SHIFT
指令都有 1 个周期延迟.它可以从 Port 0
或 Port 1
分派,因此具有 0.5 个周期的吞吐量 - 处理器可以分派和退出两个 SHIFT immediate
每个周期的指令.如果需要条件标志的结果(它们不在 gcc 生成的代码中),ROTATE
指令需要三个微操作,如果不需要,则需要两个微操作(因此在你的情况).然而,ROTATE
指令只能从 Port 1
分派,因此具有 1 个周期的吞吐量 - 处理器只能分派和退出一个 ROTATE immediate
代码> 每个周期.
The SHIFT
instruction has 1 cycle latency whether you use the by 1
opcode or by immediate
. It can dispatch from either Port 0
or Port 1
and for this reason has a 0.5 cycle throughput - the processor can dispatch and retire two SHIFT immediate
instructions per cycle. The ROTATE
instruction needs three micro-ops if the results of the condition flags are needed (they aren't in the code generated by gcc) and two micro-ops if not (so two micro-ops in your case). The ROTATE
instruction, however, can only be dispatched from Port 1
and therefore has a 1 cycle throughput - the processor can dispatch and retire only one ROTATE immediate
per cycle.
我已经复制了下面的相关图片和部分.
I've copied the relevant image and section below.
3.5.1.5 位旋转
按位旋转可以选择使用 CL 寄存器中指定的计数旋转,或者立即常数和 1 位.一般来说,rotate by immediate 和 rotate by寄存器指令比循环旋转 1 位慢.旋转 1 指令有与班次相同的延迟.汇编/编译器编码规则 35.(ML 影响,L 通用性)避免 ROTATE通过寄存器或通过立即指令旋转.如果可能,请替换为旋转 1 条指令.在 Intel 微架构代号 Sandy Bridge 中,ROL/ROR by immediate 有 1-周期吞吐量,SHLD/SHRD 使用相同的寄存器作为源和目标立即常数具有 1 个周期的延迟和 0.5 个周期的吞吐量.ROL/RORreg, imm8" 指令有两个微操作,旋转延迟为 1 个周期为标志注册结果和 2 个周期(如果使用).在 Intel 微架构代号 Ivy Bridge 中,立即数大于 1 的ROL/ROR reg, imm8"指令是一个具有一个周期延迟的微操作,当使用溢出标志结果.当立即数为一时,依赖溢出后续指令的 ROL/ROR 标志结果将看到 ROL/ROR 指令具有两个周期的延迟.
Bitwise rotation can choose between rotate with count specified in the CL register, an immediate constant and by 1 bit. Generally, The rotate by immediate and rotate by register instructions are slower than rotate by 1 bit. The rotate by 1 instruction has the same latency as a shift. Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE by register or ROTATE by immediate instructions. If possible, replace with a ROTATE by 1 instruction. In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1- cycle throughput, SHLD/SHRD using the same register as source and destination by an immediate constant has 1-cycle latency with 0.5 cycle throughput. The "ROL/ROR reg, imm8" instruction has two micro-ops with the latency of 1-cycle for the rotate register result and 2-cycles for the flags, if used. In Intel microarchitecture code name Ivy Bridge, The "ROL/ROR reg, imm8" instruction with immediate greater than 1, is one micro-op with one-cycle latency when the overflow flag result is used. When the immediate is one, dependency on the overflow flag result of ROL/ROR by a subsequent instruction will see the ROL/ROR instruction with two-cycle latency.
2.4.4.2 执行单元和发布端口
在每个周期,内核可能会向四个发布端口中的一个或多个发送微操作.在微架构层面,store操作进一步分为两部分:store数据和存储地址操作.调度μops的四个端口到执行单元以及加载和存储操作如图 2-6 所示.一些端口每个时钟可以调度两个微操作.那些执行单元被标记为 Double速度.
At each cycle, the core may dispatch µops to one or more of four issue ports. At the microarchitecture level, store operations are further divided into two parts: store data and store address operations. The four ports through which μops are dispatched to execution units and to load and store operations are shown in Figure 2-6. Some ports can dispatch two µops per clock. Those execution units are marked Double Speed.
端口 0. 在周期的前半部分,端口 0 可以调度任一浮点move µop(浮点堆栈移动、浮点交换或浮点存储数据)或一个算术逻辑单元 (ALU) µop(算术、逻辑、分支或存储数据).在周期的后半段,它可以调度一个类似的 ALU µop.
Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point move µop (a floating-point stack move, floating-point exchange or floating-point store data) or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store data). In the second half of the cycle, it can dispatch one similar ALU µop.
端口 1. 在周期的前半部分,端口 1 可以调度任一浮点执行(除移动之外的所有浮点运算,所有 SIMD 运算) µop 或1 个正常速度整数(乘法、移位和旋转) µop 或 1 个 ALU(算术)微操作.在周期的后半段,它可以调度一个类似的 ALU µop.
Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point execution (all floating-point operations except moves, all SIMD operations) µop or one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic) µop. In the second half of the cycle, it can dispatch one similar ALU µop.
端口 2.此端口支持每个周期调度一个加载操作.
Port 2. This port supports the dispatch of one load operation per cycle.
端口 3. 此端口支持每个周期调度一个存储地址操作.
Port 3. This port supports the dispatch of one store address operation per cycle.
每个周期的总问题带宽范围为 0 到 6 µops.每个管道包含多个执行单元.µops 被分派到对应于正确操作类型的管道.例如,整数算术逻辑单元浮点执行单元(加法器、乘法器和除法器)可以共享一个管道.
The total issue bandwidth can range from zero to six µops per cycle. Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit and the floating-point execution units (adder, multiplier, and divider) can share a pipeline.
相关文章