有效的无符号到有符号转换避免实现定义的行为
我想定义一个函数,它接受一个 unsigned int
作为参数,并返回一个 int
全等模 UINT_MAX+1 给参数.
I want to define a function that takes an unsigned int
as argument and returns an int
congruent modulo UINT_MAX+1 to the argument.
第一次尝试可能如下所示:
A first attempt might look like this:
int unsigned_to_signed(unsigned n)
{
return static_cast<int>(n);
}
但任何语言律师都知道,对于大于 INT_MAX 的值,从无符号转换为有符号是实现定义的.
But as any language lawyer knows, casting from unsigned to signed for values larger than INT_MAX is implementation-defined.
我想实现这一点,以便 (a) 它仅依赖于规范规定的行为;(b) 它在任何现代机器和优化编译器上编译为无操作.
I want to implement this such that (a) it only relies on behavior mandated by the spec; and (b) it compiles into a no-op on any modern machine and optimizing compiler.
至于奇怪的机器......如果无符号整数没有符号整数全等模UINT_MAX+1,假设我想抛出异常.如果有多个(我不确定这是否可能),假设我想要最大的一个.
As for bizarre machines... If there is no signed int congruent modulo UINT_MAX+1 to the unsigned int, let's say I want to throw an exception. If there is more than one (I am not sure this is possible), let's say I want the largest one.
好的,第二次尝试:
int unsigned_to_signed(unsigned n)
{
int int_n = static_cast<int>(n);
if (n == static_cast<unsigned>(int_n))
return int_n;
// else do something long and complicated
}
当我不在典型的二进制补码系统上时,我不太关心效率,因为以我的拙见,这不太可能.如果我的代码成为 2050 年无所不在的符号幅度系统的瓶颈,那么我敢打赌有人可以解决这个问题并进行优化.
I do not much care about the efficiency when I am not on a typical twos-complement system, since in my humble opinion that is unlikely. And if my code becomes a bottleneck on the omnipresent sign-magnitude systems of 2050, well, I bet someone can figure that out and optimize it then.
现在,第二次尝试与我想要的非常接近.尽管对某些输入的转换为 int
是实现定义的,但标准保证转换回 unsigned
以保留模 UINT_MAX+1 的值.因此,条件确实会检查我想要的内容,并且在我可能遇到的任何系统上它都不会编译成任何内容.
Now, this second attempt is pretty close to what I want. Although the cast to int
is implementation-defined for some inputs, the cast back to unsigned
is guaranteed by the standard to preserve the value modulo UINT_MAX+1. So the conditional does check exactly what I want, and it will compile into nothing on any system I am likely to encounter.
然而......我仍然在转换到 int
而不先检查它是否会调用实现定义的行为.在 2050 年的某个假设系统上,它可以做谁知道呢.所以假设我想避免这种情况.
However... I am still casting to int
without first checking whether it will invoke implementation-defined behavior. On some hypothetical system in 2050 it could do who-knows-what. So let's say I want to avoid that.
问题:我的第三次尝试"应该是什么样的?
Question: What should my "third attempt" look like?
总结一下,我想:
- 从无符号整数转换为有符号整数
- 保留值 mod UINT_MAX+1
- 仅调用标准规定的行为
- 在带有优化编译器的典型二进制补码机器上编译为无操作
[更新]
让我举个例子来说明为什么这不是一个微不足道的问题.
Let me give an example to show why this is not a trivial question.
考虑具有以下属性的假设 C++ 实现:
Consider a hypothetical C++ implementation with the following properties:
sizeof(int)
等于 4sizeof(unsigned)
等于 4INT_MAX
等于 32767INT_MIN
等于 -232 + 32768UINT_MAX
等于 232 - 1int
上的算术是模 232(在INT_MIN
到INT_MAX
范围内)std::numeric_limits
为真::is_modulo - 将无符号
n
转换为 int 会保留 0 <= n <= 32767 的值,否则产生 零
sizeof(int)
equals 4sizeof(unsigned)
equals 4INT_MAX
equals 32767INT_MIN
equals -232 + 32768UINT_MAX
equals 232 - 1- Arithmetic on
int
is modulo 232 (into the rangeINT_MIN
throughINT_MAX
) std::numeric_limits<int>::is_modulo
is true- Casting unsigned
n
to int preserves the value for 0 <= n <= 32767 and yields zero otherwise
在这个假设的实现中,每个 unsigned
值恰好有一个 int
值一致 (mod UINT_MAX+1).所以我的问题会很明确.
On this hypothetical implementation, there is exactly one int
value congruent (mod UINT_MAX+1) to each unsigned
value. So my question would be well-defined.
我声称这个假设的 C++ 实现完全符合 C++98、C++03 和 C++11 规范.我承认我没有记住所有的单词……但我相信我已经仔细阅读了相关部分.因此,如果您希望我接受您的回答,您要么必须 (a) 引用排除此假设实现的规范,要么 (b) 正确处理它.
I claim that this hypothetical C++ implementation fully conforms to the C++98, C++03, and C++11 specifications. I admit I have not memorized every word of all of them... But I believe I have read the relevant sections carefully. So if you want me to accept your answer, you either must (a) cite a spec that rules out this hypothetical implementation or (b) handle it correctly.
确实,正确的答案必须处理每个标准允许的假设实现.根据定义,这就是仅调用标准规定的行为"的含义.
Indeed, a correct answer must handle every hypothetical implementation permitted by the standard. That is what "invoke only standard-mandated behavior" means, by definition.
顺便说一下,请注意 std::numeric_limits
在这里完全没用,原因有很多.一方面,它可以是 true
,即使 unsigned-to-signed 转换不适用于大的无符号值.另一方面,如果算术只是对整个整数范围取模,则它甚至可以在补码或符号大小系统上为 true
.等等.如果您的答案取决于 is_modulo
,那就错了.
Incidentally, note that std::numeric_limits<int>::is_modulo
is utterly useless here for multiple reasons. For one thing, it can be true
even if unsigned-to-signed casts do not work for large unsigned values. For another, it can be true
even on one's-complement or sign-magnitude systems, if arithmetic is simply modulo the entire integer range. And so on. If your answer depends on is_modulo
, it's wrong.
[更新 2]
hvd 的回答教会了我一些东西:我假设的整数 C++ 实现不允许现代 C.C99 和 C11 标准对带符号整数的表示非常具体;实际上,它们只允许二进制补码、二进制补码和符号大小(第 6.2.6.2 节第 (2) 段;).
hvd's answer taught me something: My hypothetical C++ implementation for integers is not permitted by modern C. The C99 and C11 standards are very specific about the representation of signed integers; indeed, they only permit twos-complement, ones-complement, and sign-magnitude (section 6.2.6.2 paragraph (2); ).
但 C++ 不是 C.事实证明,这个事实是我问题的核心.
But C++ is not C. As it turns out, this fact lies at the very heart of my question.
最初的 C++98 标准基于更老的 C89,它说(第 3.1.2.5 节):
The original C++98 standard was based on the much older C89, which says (section 3.1.2.5):
对于每个有符号整数类型,都有一个对应的(但不同的)无符号整数类型(用关键字指定无符号)使用相同数量的存储(包括符号信息)并具有相同的对齐要求.范围有符号整数类型的非负值是对应的无符号整数类型,以及每种类型中的相同值是相同的.
For each of the signed integer types, there is a corresponding (but different) unsigned integer type (designated with the keyword unsigned) that uses the same amount of storage (including sign information) and has the same alignment requirements. The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the representation of the same value in each type is the same.
C89 没有说明只有一个符号位或只允许二进制补码/一个补码/符号大小.
C89 says nothing about only having one sign bit or only allowing twos-complement/ones-complement/sign-magnitude.
C++98 标准几乎逐字采用了这种语言(第 3.9.1 节第 (3) 段):
The C++98 standard adopted this language nearly verbatim (section 3.9.1 paragraph (3)):
对于每一种有符号整数类型,都存在一个对应的(但不同)无符号整数类型:unsigned char
"、unsignedshort int
"、unsigned int
"和unsigned long int
",每一个占用相同的存储量并具有相同的对齐方式要求(3.9)为对应的有符号整数类型;那也就是说,每个有符号整数类型都具有相同的对象表示它对应的无符号整数类型.非负的范围有符号整数类型的值是相应的子范围无符号整数类型,以及每个的值表示对应的有符号/无符号类型应相同.
For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "
unsigned char
", "unsigned short int
", "unsigned int
", and "unsigned long int
", each of which occupies the same amount of storage and has the same alignment requirements (3.9) as the corresponding signed integer type ; that is, each signed integer type has the same object representation as its corresponding unsigned integer type. The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the value representation of each corresponding signed/unsigned type shall be the same.
C++03 标准使用基本相同的语言,C++11 也是如此.
The C++03 standard uses essentially identical language, as does C++11.
据我所知,没有标准的 C++ 规范将其有符号整数表示限制为任何 C 规范.并且没有强制要求单符号位或任何类型的东西.它只是说非负有符号整数必须是相应无符号整数的子范围.
No standard C++ spec constrains its signed integer representations to any C spec, as far as I can tell. And there is nothing mandating a single sign bit or anything of the kind. All it says is that non-negative signed integers must be a subrange of the corresponding unsigned.
所以,我再次声明 INT_MAX=32767 和 INT_MIN=-232+32768 是允许的.如果您的答案另有假设,则它是不正确的,除非您引用 C++ 标准证明我错了.
So, again I claim that INT_MAX=32767 with INT_MIN=-232+32768 is permitted. If your answer assumes otherwise, it is incorrect unless you cite a C++ standard proving me wrong.
推荐答案
扩展 user71404 的回答:
Expanding on user71404's answer:
int f(unsigned x)
{
if (x <= INT_MAX)
return static_cast<int>(x);
if (x >= INT_MIN)
return static_cast<int>(x - INT_MIN) + INT_MIN;
throw x; // Or whatever else you like
}
如果 x >= INT_MIN
(记住促销规则,INT_MIN
被转换为 unsigned
),那么 x- INT_MIN <= INT_MAX
,所以这不会有任何溢出.
If x >= INT_MIN
(keep the promotion rules in mind, INT_MIN
gets converted to unsigned
), then x - INT_MIN <= INT_MAX
, so this won't have any overflow.
如果这不明显,请看一下声明If x >= -4u
, then x + 4 <= 3
.",和请记住,INT_MAX
将至少等于 -INT_MIN - 1 的数学值.
If that is not obvious, take a look at the claim "If x >= -4u
, then x + 4 <= 3
.", and keep in mind that INT_MAX
will be equal to at least the mathematical value of -INT_MIN - 1.
在最常见的系统上,!(x <= INT_MAX)
意味着 x >= INT_MIN
,优化器应该能够(在我的系统上,可以)去掉第二个检查,确定两个return
语句可以编译成相同的代码,也去掉第一个检查.生成的程序集列表:
On the most common systems, where !(x <= INT_MAX)
implies x >= INT_MIN
, the optimizer should be able (and on my system, is able) to remove the second check, determine that the two return
statements can be compiled to the same code, and remove the first check too. Generated assembly listing:
__Z1fj:
LFB6:
.cfi_startproc
movl 4(%esp), %eax
ret
.cfi_endproc
您问题中的假设实现:
- INT_MAX 等于 32767
- INT_MIN 等于 -232 + 32768
- INT_MAX equals 32767
- INT_MIN equals -232 + 32768
是不可能的,所以不需要特别考虑.INT_MIN
将等于 -INT_MAX
或 -INT_MAX - 1
.这遵循 C 对整数类型的表示 (6.2.6.2),它要求 n
位是值位,一位是符号位,并且只允许一个单一的陷阱表示(不包括表示由于填充位而无效),即表示负零/-INT_MAX - 1
的那个.C++ 不允许任何超出 C 允许的整数表示.
is not possible, so does not need special consideration. INT_MIN
will be equal to either -INT_MAX
, or to -INT_MAX - 1
. This follows from C's representation of integer types (6.2.6.2), which requires n
bits to be value bits, one bit to be a sign bit, and only allows one single trap representation (not including representations that are invalid because of padding bits), namely the one that would otherwise represent negative zero / -INT_MAX - 1
. C++ doesn't allow any integer representations beyond what C allows.
更新:微软的编译器显然没有注意到 x >10
和 x >= 11
测试相同的东西.如果 x >= INT_MIN
被替换为 x > ,它只会生成所需的代码.INT_MIN - 1u
,它可以检测为 x <= INT_MAX
的否定(在此平台上).
Update: Microsoft's compiler apparently does not notice that x > 10
and x >= 11
test the same thing. It only generates the desired code if x >= INT_MIN
is replaced with x > INT_MIN - 1u
, which it can detect as the negation of x <= INT_MAX
(on this platform).
[来自提问者 (Nemo) 的更新,详细说明我们在下面的讨论]
[Update from questioner (Nemo), elaborating on our discussion below]
我现在相信这个答案适用于所有情况,但原因很复杂.我很可能会为此解决方案提供赏金,但我想捕获所有血腥细节,以防万一有人关心.
I now believe this answer works in all cases, but for complicated reasons. I am likely to award the bounty to this solution, but I want to capture all the gory details in case anybody cares.
让我们从 C++11 的 18.3.3 节开始:
Let's start with C++11, section 18.3.3:
表 31 描述了标题
.
Table 31 describes the header
<climits>
.
...
内容与标准C库头文件
相同.
The contents are the same as the Standard C library header <limits.h>
.
这里,标准 C"是指 C99,它的规范严格限制了有符号整数的表示.它们就像无符号整数,但有一位专用于符号",零位或多位专用于填充".填充位不影响整数的值,符号位仅作为二进制补码、一补码或符号大小起作用.
Here, "Standard C" means C99, whose specification severely constrains the representation of signed integers. They are just like unsigned integers, but with one bit dedicated to "sign" and zero or more bits dedicated to "padding". The padding bits do not contribute to the value of the integer, and the sign bit contributes only as twos-complement, ones-complement, or sign-magnitude.
由于 C++11 继承了 C99 的
宏,INT_MIN 要么是 -INT_MAX 要么是 -INT_MAX-1,hvd 的代码保证可以工作.(请注意,由于填充,INT_MAX 可能比 UINT_MAX/2 小得多......但由于有符号-> 无符号强制转换的工作方式,这个答案处理得很好.)
Since C++11 inherits the <climits>
macros from C99, INT_MIN is either -INT_MAX or -INT_MAX-1, and hvd's code is guaranteed to work. (Note that, due to the padding, INT_MAX could be much less than UINT_MAX/2... But thanks to the way signed->unsigned casts work, this answer handles that fine.)
C++03/C++98 比较棘手.它使用相同的措辞从标准C"继承
,但现在标准C"意味着C89/C90.
C++03/C++98 is trickier. It uses the same wording to inherit <climits>
from "Standard C", but now "Standard C" means C89/C90.
所有这些――C++98、C++03、C89/C90――都有我在我的问题中给出的措辞,但也包括这个(C++03 第 3.9.1 节第 7 段):
All of these -- C++98, C++03, C89/C90 -- have the wording I give in my question, but also include this (C++03 section 3.9.1 paragraph 7):
整数类型的表示应通过使用纯二进制计数系统.(44) [示例:这个国际标准允许 2 的补码、1 的补码和有符号幅度整数类型的表示.]
The representations of integral types shall define values by use of a pure binary numeration system.(44) [Example: this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types.]
脚注 (44) 定义了纯二进制计数系统":
Footnote (44) defines "pure binary numeration system":
使用二进制数字 0 的整数位置表示和 1,其中连续位表示的值是加法,从 1 开始,乘以连续积分2 的幂,除了位置最高的位.
A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps for the bit with the highest position.
这个措辞的有趣之处在于它自相矛盾,因为纯二进制计数系统"的定义不允许符号/大小表示!它确实允许高位具有值 -2n-1(二进制补码)或 -(2n-1-1)(一个补码).但是导致符号/大小的高位没有值.
What is interesting about this wording is that it contradicts itself, because the definition of "pure binary numeration system" does not permit a sign/magnitude representation! It does allow the high bit to have, say, the value -2n-1 (twos complement) or -(2n-1-1) (ones complement). But there is no value for the high bit that results in sign/magnitude.
无论如何,我的假设实现"不符合此定义下的纯二进制",因此排除.
Anyway, my "hypothetical implementation" does not qualify as "pure binary" under this definition, so it is ruled out.
然而,高位是特殊的这一事实意味着我们可以想象它贡献了任何价值:小的正值、巨大的正值、小的负值或巨大的负值.(如果符号位可以贡献 -(2n-1-1),为什么不 -(2n-1-2)?等)
However, the fact that the high bit is special means we can imagine it contributing any value at all: A small positive value, huge positive value, small negative value, or huge negative value. (If the sign bit can contribute -(2n-1-1), why not -(2n-1-2)? etc.)
所以,让我们想象一个有符号整数表示,它为符号"位分配一个古怪的值.
So, let's imagine a signed integer representation that assigns a wacky value to the "sign" bit.
符号位的小正值将导致 int
的正范围(可能与 unsigned
一样大),而 hvd 的代码处理得很好.
A small positive value for the sign bit would result in a positive range for int
(possibly as large as unsigned
), and hvd's code handles that just fine.
符号位的巨大正值将导致 int
的最大值大于 unsigned
,这是被禁止的.
A huge positive value for the sign bit would result in int
having a maximum larger than unsigned
, which is is forbidden.
符号位的巨大负值将导致 int
表示值的非连续范围,并且规范中的其他措辞将其排除在外.
A huge negative value for the sign bit would result in int
representing a non-contiguous range of values, and other wording in the spec rules that out.
最后,贡献一个小的负数的符号位怎么样?我们可以在符号位"中有一个 1 对 int 的值有贡献吗,比如说,-37?那么 INT_MAX 将是(比如说)231-1 而 INT_MIN 将是 -37?
Finally, how about a sign bit that contributes a small negative quantity? Could we have a 1 in the "sign bit" contribute, say, -37 to the value of the int? So then INT_MAX would be (say) 231-1 and INT_MIN would be -37?
这会导致一些数字有两种表示形式……但是ones-complement 将两种表示形式归零,根据示例",这是允许的.规范中没有任何地方说零是唯一 可能有两种表示形式的整数.所以我认为这个新的假设是规范允许的.
This would result in some numbers having two representations... But ones-complement gives two representations to zero, and that is allowed according to the "Example". Nowhere does the spec say that zero is the only integer that might have two representations. So I think this new hypothetical is allowed by the spec.
实际上,从 -1 到 -INT_MAX-1
的任何负值似乎都可以作为符号位"的值,但不能更小(以免范围不连续).换句话说,INT_MIN
可能是从 -INT_MAX-1
到 -1 的任何值.
Indeed, any negative value from -1 down to -INT_MAX-1
appears to be permissible as a value for the "sign bit", but nothing smaller (lest the range be non-contiguous). In other words, INT_MIN
might be anything from -INT_MAX-1
to -1.
现在,你猜怎么着?对于 hvd 代码中的第二个强制转换以避免实现定义的行为,我们只需要 x - (unsigned)INT_MIN
小于或等于 INT_MAX
.我们刚刚展示了 INT_MIN
至少是 -INT_MAX-1
.显然,x
至多是 UINT_MAX
.将负数转换为无符号数与添加 UINT_MAX+1
相同.把它们放在一起:
Now, guess what? For the second cast in hvd's code to avoid implementation-defined behavior, we just need x - (unsigned)INT_MIN
less than or equal to INT_MAX
. We just showed INT_MIN
is at least -INT_MAX-1
. Obviously, x
is at most UINT_MAX
. Casting a negative number to unsigned is the same as adding UINT_MAX+1
. Put it all together:
x - (unsigned)INT_MIN <= INT_MAX
当且仅当
UINT_MAX - (INT_MIN + UINT_MAX + 1) <= INT_MAX
-INT_MIN-1 <= INT_MAX
-INT_MIN <= INT_MAX+1
INT_MIN >= -INT_MAX-1
最后一个是我们刚刚展示的,所以即使在这种反常的情况下,代码也确实有效.
That last is what we just showed, so even in this perverse case, the code actually works.
这用尽了所有的可能性,从而结束了这个极其学术的练习.
That exhausts all of the possibilities, thus ending this extremely academic exercise.
底线:C89/C90 中的有符号整数存在一些严重未指定的行为,这些行为被 C++98/C++03 继承.它在 C99 中得到了修复,C++11 通过合并 C99 的
间接继承了这个修复.但即使是 C++11 也保留了自相矛盾的纯二进制表示"措辞...
Bottom line: There is some seriously under-specified behavior for signed integers in C89/C90 that got inherited by C++98/C++03. It is fixed in C99, and C++11 indirectly inherits the fix by incorporating <limits.h>
from C99. But even C++11 retains the self-contradictory "pure binary representation" wording...
相关文章