布尔值在编译器中为 8 位.对它们的操作效率低下吗?

2021-12-20 00:00:00 optimization c boolean c++ x86

我正在阅读 Agner Fog 的在 C++ 中优化软件"(特定于 x86 处理器，用于英特尔、AMD 和威盛)，并在第 34 页说明

I'm reading Agner Fog's "Optimizing software in C++" (specific to x86 processors for Intel, AMD and VIA) and it states on page 34

布尔变量存储为 8 位整数，值为 0 表示假，值为 1 表示真.布尔变量是超定的，因为所有具有布尔值的运算符变量作为输入检查输入是否具有除 0 或 1 以外的任何其他值，但操作符将布尔值作为输出不能产生除 0 或 1 之外的其他值.这使得操作使用布尔变量作为输入的效率低于必要.

Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true. Boolean variables are overdetermined in the sense that all operators that have Boolean variables as input check if the inputs have any other value than 0 or 1, but operators that have Booleans as output can produce no other value than 0 or 1. This makes operations with Boolean variables as input less efficient than necessary.

今天仍然如此吗?在什么编译器上?你能举个例子吗?作者表示

Is this still true today and on what compilers? Can you please give an example? The author states

布尔运算可以变得更有效率，如果它可以肯定的是，操作数除了 0 和 1 之外没有其他值.原因为什么编译器不做这样的假设是变量可能有其他值，如果它们未初始化或来自未知来源.

The Boolean operations can be made much more efficient if it is known with certainty that the operands have no other values than 0 and 1. The reason why the compiler doesn't make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources.

这是否意味着如果我以函数指针 bool(*)() 为例并调用它，那么对它的操作会产生低效的代码?或者当我通过取消引用指针或从引用中读取来访问布尔值然后对其进行操作时是这样吗?

Does this mean that if I take a function pointer bool(*)() for example and call it, then operations on it produce inefficient code? Or is it the case when I access a boolean by dereferencing a pointer or reading from a reference and then operate on it?

推荐答案

TL:DR:当前的编译器在执行诸如
<代码>(a&&b) ?x : y.但是，不是他们不假设 0/1 的原因，他们只是在做这个.

TL:DR: current compilers still have bool missed-optimizations when doing stuff like
(a&&b) ? x : y. But the reason why is not that they don't assume 0/1, they just suck at this.

bool 的许多用途是用于局部变量或内联函数，因此布尔化为 0/1 可以优化离开和分支(或cmov 或其他)在原始条件下.只需要担心优化 bool 输入/输出，因为它确实必须通过非内联或真正存储在内存中的东西传递/返回.

Many uses of bool are for locals, or inline functions, so booleanizing to a 0 / 1 can optimize away and branch (or cmov or whatever) on the original condition. Only worry about optimizing bool inputs / outputs when it does have to get passed/returned across something that doesn't inline, or really stored in memory.

可能的优化指南:将来自外部源(函数参数/内存)的 bool 与按位运算符(如 a&b)结合起来.MSVC 和 ICC 在这方面做得更好.IDK 如果本地 bool 更糟.请注意，对于 bool，a&b 仅等效于 a&&b，而不是整数类型.<代码>2 &&1 为真，但 2 &1 是 0 是假的.按位 OR 没有这个问题.

Possible optimization guideline: combine bools from external sources (function args / memory) with bitwise operators, like a&b. MSVC and ICC do better with this. IDK if it's ever worse for local bools. Beware that a&b is only equivalent to a&&b for bool, not integer types. 2 && 1 is true, but 2 & 1 is 0 which is false. Bitwise OR doesn't have this problem.

IDK，如果此指南会对通过函数内的比较(或内联的东西)设置的局部变量造成伤害.例如.它可能会导致编译器实际生成整数布尔值，而不是在可能的情况下直接使用比较结果.另请注意，它似乎对当前的 gcc 和 clang 没有帮助.

IDK if this guideline will ever hurt for locals that were set from a comparison within the function (or in something that inlined). E.g. it might lead the compiler to actually make integer booleans instead of just using comparison results directly when possible. Also note that it doesn't seem to help with current gcc and clang.

是的，x86 上的 C++ 实现将 bool 存储在一个始终为 0 或 1 的字节中(至少在编译器必须遵守 ABI/调用约定的函数调用边界中需要这样做.)

Yes, C++ implementations on x86 store bool in a byte that's always 0 or 1 (at least across function-call boundaries where the compiler has to respect the ABI / calling convention which requires this.)

编译器有时会利用这一点，例如对于 bool->int 转换，即使是 gcc 4.4 也只是零扩展到 32 位(movzx eax, dil).Clang 和 MSVC 也这样做.C 和 C++ 规则要求这种转换产生 0 或 1，因此这种行为只有在总是安全地假设 bool 函数 arg 或全局变量具有 0 时才是安全的或 1 个值.

Compilers do sometimes take advantage of this, e.g. for bool->int conversion even gcc 4.4 simply zero-extends to 32-bit (movzx eax, dil). Clang and MSVC do this, too. C and C++ rules require this conversion to produce 0 or 1, so this behaviour is only safe if it's always safe to assume that a bool function arg or global variable has a 0 or 1 value.

即使是旧的编译器通常也会在 bool->int 中利用它，但在其他情况下不会.因此，Agner 说的原因是错误的:

Even old compilers typically did take advantage of it for bool->int, but not in other cases. Thus, Agner is wrong about the reason when he says:

编译器之所以不做这样的假设是因为如果变量未初始化或来自未知来源，它们可能具有其他值.

The reason why the compiler doesn't make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources.

MSVC CL19 生成的代码假定 bool 函数参数为 0 或 1，因此 Windows x86-64 ABI 必须保证这一点.

MSVC CL19 does make code that assumes bool function args are 0 or 1, so the Windows x86-64 ABI must guarantee this.

在 x86-64 System V ABI(用于除 Windows 之外的所有内容)，修订版 0.98 的更改日志说指定 _Bool(又名 bool)在调用者处被布尔化."我认为甚至在这种变化之前，编译器就已经假设了它，但这只是记录了编译器已经依赖的东西.x86-64 SysV ABI 中的当前语言是:

In the x86-64 System V ABI (used by everything other than Windows), the changelog for revision 0.98 says "Specify that _Bool (aka bool) is booleanized at the caller." I think even before that change, compilers were assuming it, but this just documents what compilers were already relying on. The current language in the x86-64 SysV ABI is:

3.1.2 数据表示

布尔值，当存储在内存对象中时，存储为单字节对象，其值始终为 0(假)或 1(真).当存储在整数寄存器中时(除了作为参数传递)，寄存器的所有 8 个字节都是有效的；任何非零值都被认为是真的.

Booleans, when stored in a memory object, are stored as single byte objects the value of which is always 0 (false) or 1 (true). When stored in integer registers (except for passing as arguments), all 8 bytes of the register are significant; any nonzero value is considered true.

第二句是无稽之谈:ABI 没有告诉编译器如何在函数内的寄存器中存储东西，只在不同编译单元(内存/函数参数和返回值)之间的边界处.我不久前在维护它的 github 页面上报告了这个 ABI 缺陷.

The second sentence is nonsense: the ABI has no business telling compilers how to store things in registers inside a function, only at boundaries between different compilation units (memory / function args and return values). I reported this ABI defect a while ago on the github page where it's maintained.

3.2.3 参数传递:

当_Bool 类型的值在寄存器或堆栈中返回或传递时，位 0 包含真值，位 1 到 7 应为零¹⁶.

When a value of type _Bool is returned or passed in a register or on the stack, bit 0 contains the truth value and bits 1 to 7 shall be zero¹⁶.

(脚注 16):其他位未指定，因此这些值的消费者方面可以依赖它在截断为 8 位时是 0 或 1.

(footnote 16): Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit.

i386 System V ABI 中的语言是相同的，IIRC.

The language in the i386 System V ABI is the same, IIRC.

任何为一件事假定 0/1(例如转换为 int)但在其他情况下未能利用它的编译器都会错过优化.不幸的是，这种错过的优化仍然存在，尽管它们比 Agner 写那段关于编译器总是重新布尔化的情况少得多.

Any compiler that assumes 0/1 for one thing (e.g. conversion to int) but fails to take advantage of it in other cases has a missed optimization. Unfortunately such missed-optimizations still exist, although they are rarer than when Agner wrote that paragraph about compilers always re-booleanizing.

(在 Godbolt 编译器资源管理器，用于 gcc4.6/4.7 和 cl昂/MSVC.另请参阅 Matt Godbolt 的 CppCon2017 演讲我的编译器最近为我做了什么?打开编译器的盖子)

bool logical_or(bool a, bool b) { return a||b; } # gcc4.6.4 -O3 for the x86-64 System V ABI test dil, dil # test a against itself (for non-zero) mov eax, 1 cmove eax, esi # return a ? 1 : b; ret

因此，即使 gcc4.6 也没有重新布尔化 b，但它确实错过了 gcc4.7 所做的优化:(以及其他答案中显示的 clang 和更高版本的编译器):

So even gcc4.6 didn't re-booleanize b, but it did miss the optimization that gcc4.7 makes: (and clang and later compilers as shown in other answers):

# gcc4.7 -O3 to present: looks ideal to me. mov eax, esi or eax, edi ret

(Clang 的 或 dil, sil/mov eax, edi 很愚蠢:在阅读 时，它肯定会导致 Nehalem 或更早的 Intel 上的部分寄存器停顿edi 在编写 dil 之后，由于需要一个 REX 前缀来使用 edi 的低 8 部分，它的代码大小更糟.更好的选择可能是 或 dil,sil/movzx eax, dil 如果您想避免读取任何 32 位寄存器，以防您的调用者留下一些带有脏"部分寄存器的 arg-passing 寄存器.)

(Clang's or dil, sil / mov eax, edi is silly: it's guaranteed to cause a partial-register stall on Nehalem or earlier Intel when reading edi after writing dil, and it has worse code size from needing a REX prefix to use the low-8 part of edi. A better choice might be or dil,sil / movzx eax, dil if you want to avoid reading any 32-bit registers in case your caller left some arg-passing registers with "dirty" partial registers.)

MSVC 发出这段代码，分别检查 a 和 b，完全没有利用任何东西，甚至使用 xoral,al 而不是 xor eax,eax.因此，它对大多数 CPU 上的 eax 旧值具有错误的依赖性(包括 Haswell/Skylake，它们不会将低 8 部分 regs 与整个寄存器分开重命名，只有 AH/BH/...).这只是愚蠢的.使用 xor al,al 的唯一原因是您明确希望保留高位字节.

MSVC emits this code that checks a then b separately, completely failing to take advantage of anything, and even using xor al,al instead of xor eax,eax. So it has a false dependency on the old value of eax on most CPUs (including Haswell/Skylake, which don't rename low-8 partial regs separately from the whole register, only AH/BH/...). This is just dumb. The only reason to ever use xor al,al is when you explicitly want to preserve the upper bytes.

logical_or PROC ; x86-64 MSVC CL19 test cl, cl ; Windows ABI passes args in ecx, edx jne SHORT $LN3@logical_or test dl, dl jne SHORT $LN3@logical_or xor al, al ; missed peephole: xor eax,eax is strictly better ret 0 $LN3@logical_or: mov al, 1 ret 0 logical_or ENDP

ICC18 也没有利用输入的已知 0/1 性质，它只是使用 or 指令根据两个输入的按位或来设置标志，并且 setcc 产生一个 0/??1.

ICC18 also doesn't take advantage of the known 0/1 nature of the inputs, it just uses an or instruction to set flags according to the bitwise OR of the two inputs, and setcc to produce a 0/1.

logical_or(bool, bool): # ICC18 xor eax, eax #4.42 movzx edi, dil #4.33 movzx esi, sil #4.33 or edi, esi #4.42 setne al #4.42 ret #4.42

ICC 发出相同的代码，即使对于 bool bitwise_or(bool a, bool b) { return a|b;}.它提升为 int(使用 movzx)，并使用 or 根据按位 OR 设置标志.与 或 dil,sil/setne al 相比，这是愚蠢的.

ICC emits the same code even for bool bitwise_or(bool a, bool b) { return a|b; }. It promotes to int (with movzx), and uses or to set flags according to the bitwise OR. This is dumb compared to or dil,sil / setne al.

对于bitwise_or，MSVC 只使用or 指令(在每个输入的movzx 之后)，但无论如何都不会重新布尔化.

For bitwise_or, MSVC does just use an or instruction (after movzx on each input), but anyway doesn't re-booleanize.

只有 ICC/MSVC 用上面的简单函数制作了愚蠢的代码，但是这个函数仍然给 gcc 和 clang 带来麻烦:

Only ICC/MSVC were making dumb code with the simple function above, but this function still gives gcc and clang trouble:

int select(bool a, bool b, int x, int y) { return (a&&b) ? x : y; }

<强> 源+ ASM在Godbolt编译器资源管理器(相同的来源，选择的编译器与上次不同).

Source+asm on the Godbolt compiler explorer (Same source, different compilers selected vs. last time).

看起来很简单；您希望智能编译器可以通过一个 test/cmov 无分支地完成它.x86 的 test 指令根据位与设置标志.这是一个实际上不写入目标的 AND 指令.(就像 cmp 是一个不写入目的地的 sub 一样).

Looks simple enough; you'd hope that a smart compiler would do it branchlessly with one test/cmov. x86's test instruction sets flags according to a bitwise AND. It's an AND instruction that doesn't actually write the destination. (Just like cmp is a sub that doesn't write the destination).

# hand-written implementation that no compilers come close to making select: mov eax, edx # retval = x test edi, esi # ZF = ((a & b) == 0) cmovz eax, ecx # conditional move: return y if ZF is set ret

但即使在 Godbolt 编译器资源管理器上每天构建 gcc 和 clang，也会使代码变得更加复杂，需要分别检查每个布尔值.如果返回 ab，他们知道如何优化 bool ab = a&&b;，但即使这样写(使用单独的布尔变量来保存结果)没有设法帮助他们编写不烂的代码.

But even the daily builds of gcc and clang on the Godbolt compiler explorer make much more complicated code, checking each boolean separately. They know how to optimize bool ab = a&&b; if you return ab, but even writing it that way (with a separate boolean variable to hold the result) doesn't manage to hand-hold them into making code that doesn't suck.

注意 test same,same 完全等同于 cmp reg, 0，并且更小，所以这是编译器使用的.

Note that test same,same is exactly equivalent to cmp reg, 0, and is smaller, so it's what compilers use.

Clang 的版本比我的手写版本差很多.(请注意，它要求调用者将 bool 参数零扩展到 32 位，就像它对窄整数类型所做的那样，作为它和 gcc 实现的 ABI 的非官方部分，但是只有叮当依赖).

Clang's version is strictly worse than my hand-written version. (Note that it requires that the caller zero-extended the bool args to 32-bit, like it does for narrow integer types as an unofficial part of the ABI which it and gcc implement but only clang depends on).

select: # clang 6.0 trunk 317877 nightly build on Godbolt test esi, esi cmove edx, ecx # x = b ? y : x test edi, edi cmove edx, ecx # x = a ? y : x mov eax, edx # return x ret

gcc 8.0.0 20171110 每晚都会为此制作分支代码，类似于旧版 gcc 所做的.

gcc 8.0.0 20171110 nightly makes branchy code for this, similar to what older gcc versions do.

select(bool, bool, int, int): # gcc 8.0.0-pre 20171110 test dil, dil mov eax, edx ; compiling with -mtune=intel or -mtune=haswell would keep test/jcc together for macro-fusion. je .L8 test sil, sil je .L8 rep ret .L8: mov eax, ecx ret

MSVC x86-64 CL19 生成非常相似的分支代码.它针对 Windows 调用约定，其中整数参数在 rcx、rdx、r8、r9 中.

MSVC x86-64 CL19 makes very similar branchy code. It's targeting the Windows calling convention, where integer args are in rcx, rdx, r8, r9.

select PROC test cl, cl ; a je SHORT $LN3@select mov eax, r8d ; retval = x test dl, dl ; b jne SHORT $LN4@select $LN3@select: mov eax, r9d ; retval = y $LN4@select: ret 0 ; 0 means rsp += 0 after popping the return address, not C return 0. ; MSVC doesn't emit the `ret imm16` opcode here, so IDK why they put an explicit 0 as an operand. select ENDP

ICC18 也生成分支代码，但在分支之后有两个 mov 指令.

ICC18 also makes branchy code, but with both mov instructions after the branches.

select(bool, bool, int, int): test dil, dil #8.13 je ..B4.4 # Prob 50% #8.13 test sil, sil #8.16 jne ..B4.5 # Prob 50% #8.16 ..B4.4: # Preds ..B4.2 ..B4.1 mov edx, ecx #8.13 ..B4.5: # Preds ..B4.2 ..B4.4 mov eax, edx #8.13 ret #8.13

尝试通过使用来帮助编译器

int select2(bool a, bool b, int x, int y) { bool ab = a&&b; return (ab) ? x : y; }

导致 MSVC 编写非常糟糕的代码:

;; MSVC CL19 -Ox = full optimization select2 PROC test cl, cl je SHORT $LN3@select2 test dl, dl je SHORT $LN3@select2 mov al, 1 ; ab = 1 test al, al ;; and then test/cmov on an immediate constant!!! cmovne r9d, r8d mov eax, r9d ret 0 $LN3@select2: xor al, al ;; ab = 0 test al, al ;; and then test/cmov on another path with known-constant condition. cmovne r9d, r8d mov eax, r9d ret 0 select2 ENDP

这仅适用于 MSVC(并且 ICC18 在刚刚设置为常量的寄存器上具有相同的 test/cmov 优化遗漏).

This is only with MSVC (and ICC18 has the same missed optimization of test/cmov on a register that was just set to a constant).

gcc 和 clang 像往常一样不会让代码像 MSVC 一样糟糕；他们为 select() 制作了相同的 asm，这仍然不好，但至少尝试帮助他们不会像 MSVC 那样让事情变得更糟.

gcc and clang as usual don't make code as bad as MSVC; they make the same asm they do for select(), which is still not good but at least trying to help them doesn't make it worse like with MSVC.

在我非常有限的测试中，| 和 & 似乎比 || 和 && 用于 MSVC 和 ICC.使用编译器 + 编译选项查看您自己代码的编译器输出，看看会发生什么.

In my very limited testing, | and & seem to work better than || and && for MSVC and ICC. Look at the compiler output for your own code with your compiler + compile options to see what happens.

int select_bitand(bool a, bool b, int x, int y) { return (a&b) ? x : y; }

Gcc 仍然在两个输入的单独 test 上单独分支，代码与 select 的其他版本相同.clang 仍然有两个单独的 test/cmov，与其他源版本相同.

Gcc still branches separately on separate tests of the two inputs, same code as the other versions of select. clang still does two separate test/cmov, same asm as for the other source versions.

MSVC 通过并正确优化，击败了所有其他编译器(至少在独立定义中):

MSVC comes through and optimizes correctly, beating all the other compilers (at least in the stand-alone definition):

select_bitand PROC ;; MSVC test cl, dl ;; ZF = !(a & b) cmovne r9d, r8d mov eax, r9d ;; could have done the mov to eax in parallel with the test, off the critical path, but close enough. ret 0

ICC18 浪费了两条 movzx 指令，将 bools 零扩展到 int，然后生成与 MSVC 相同的代码

ICC18 wastes two movzx instructions zero-extending the bools to int, but then makes the same code as MSVC

select_bitand: ## ICC18 movzx edi, dil #16.49 movzx esi, sil #16.49 test edi, esi #17.15 cmovne ecx, edx #17.15 mov eax, ecx #17.15 ret #17.15

相关文章