根据掩码合并位序列 a 和 b
根据bit twiddling hacks网站,操作
unsigned int a; // value to merge in non-masked bits
unsigned int b; // value to merge in masked bits
unsigned int mask; // 1 where bits from b should be selected; 0 where from a.
unsigned int r; // result of (a & ~mask) | (b & mask) goes here
r = a ^ ((a ^ b) & mask);
允许根据掩码合并两个位序列a
和b
.我想知道:
allows to merge two bit sequences a
and b
according to a mask. I was wondering:
- 此操作是否有特定/常用名称?
- 在某些指令集上是否存在针对此操作的特定汇编指令?
推荐答案
我称之为混合,使用 masked-xor 方法.相关:这个问答详细解释了这些布尔运算如何/为什么做到这一点.
I'd call this a bit-blend, using the masked-xor method. Related: this Q&A explains in detail how/why those boolean operations accomplish this.
在 SSE/AVX 编程中,基于掩码从一个向量到另一个向量的选择性复制称为混合.SSE4.1 添加了诸如 PBLENDVB xmm1、xmm2/m128、
,其中隐式操作数XMM0控制src的哪些字节覆盖dst中的对应字节.(如果没有 SSE4.1,您通常会将掩码 AND 和 ANDNOT 到两个向量上, 和 OR 一起使用;掩码异或技巧的指令级并行性较低,并且可能需要至少与 OR 方法一样多的 MOV 指令来复制寄存器.)
In SSE/AVX programming, selective copying from one vector to another based on a mask is called a blend. SSE4.1 added instructions like PBLENDVB xmm1, xmm2/m128, <XMM0>
, where the implicit operand XMM0 controls which bytes of the src overwrite corresponding bytes in the dst. (Without SSE4.1, you'd usually AND and ANDNOT the mask onto two vectors, and OR that together; the masked-xor trick has less instruction-level parallelism, and probably requires at least as many MOV instructions to copy registers as the OR method.)
还有一个即时混合指令,pblendw
,其中掩码是 8 位立即数而不是寄存器.还有 32 位和 64 位立即混合(blendps
、blendpd
、vpblendd
)和可变混合(blendvps
)code>, blendvpd
).
There's also an immediate blend instruction, pblendw
, where the mask is an 8-bit immediate instead of a register. And there are 32-bit and 64-bit immediate blends (blendps
, blendpd
, vpblendd
) and variable blends (blendvps
, blendvpd
).
IDK 如果其他 SIMD 指令集(NEON、AltiVec,无论 MIPS 怎么称呼他们的,等等)也称它们为混合"与否.
IDK if other SIMD instruction sets (NEON, AltiVec, whatever MIPS calls theirs, etc.) also call them "blends" or not.
SSE/AVX(或 x86 整数指令)在 AVX512F 之前不会提供比通常的按位 XOR/AND 更好的按位(而不是按元素)混合.
SSE/AVX (or x86 integer instructions) don't provide anything better than the usual bitwise XOR/AND for doing bitwise (instead of element-wise) blends until AVX512F.
AVX512F 可以做这个(或任何其他按位三元函数)与单个 vpternlogd
或 vpternlogq
指令.(d 和 q 元素大小之间的唯一区别是,如果您使用掩码寄存器对目标进行合并掩码或零掩码,但这并没有阻止英特尔制作单独的内在函数,即使对于无掩码的情况:
AVX512F can do the bitwise version of this (or any other bitwise ternary function) with a single vpternlogd
or vpternlogq
instruction. (The only difference between d and q element sizes is if you use a mask register for merge-masking or zero-masking the destination, but that didn't stop Intel from making separate intrinsics even for the no-mask case:
__m512i _mm512_ternarylogic_epi32 (__m512i a, __m512i b, __m512i c, int imm8)
和等效的 ..._epi64 版本.
__m512i _mm512_ternarylogic_epi32 (__m512i a, __m512i b, __m512i c, int imm8)
and the equivalent ..._epi64 version.
imm8
立即数字节是一个真值表.通过将 a、b 和 c 的相应位用作真值表中的 3 位索引,可以独立地确定目的地的每一位.即作为 imm8[a:b:c]
.
The imm8
immediate byte is a truth table. Every bit of the destination is determined independently, from the corresponding bits of a, b and c by using them as a 3-bit index into the truth table. i.e. as imm8[a:b:c]
.
当 AVX512 最终出现在主流台式机/笔记本电脑 CPU 中时,它会很有趣,但这可能还需要几年的时间.
AVX512 will be fun to play with when it eventually appears in mainstream desktop/laptop CPUs, but that's probably a couple years away still.
相关文章