高效的无符号到有符号转换,避免实现定义的行为
我想定义一个函数,它接受一个 unsigned int
作为参数并返回一个 int
同余模 UINT_MAX+1 到参数.
I want to define a function that takes an unsigned int
as argument and returns an int
congruent modulo UINT_MAX+1 to the argument.
第一次尝试可能如下所示:
A first attempt might look like this:
int unsigned_to_signed(unsigned n)
{
return static_cast<int>(n);
}
但正如任何语言律师所知,对于大于 INT_MAX 的值,从无符号转换为有符号是实现定义的.
But as any language lawyer knows, casting from unsigned to signed for values larger than INT_MAX is implementation-defined.
我想实现这一点,以便 (a) 它仅依赖于规范规定的行为;(b) 它可以在任何现代机器上编译成无操作并优化编译器.
I want to implement this such that (a) it only relies on behavior mandated by the spec; and (b) it compiles into a no-op on any modern machine and optimizing compiler.
至于奇怪的机器......如果没有有符号的整数与无符号整数模UINT_MAX + 1一致,假设我想抛出一个异常.如果有多个(我不确定这是否可能),假设我想要最大的一个.
As for bizarre machines... If there is no signed int congruent modulo UINT_MAX+1 to the unsigned int, let's say I want to throw an exception. If there is more than one (I am not sure this is possible), let's say I want the largest one.
好的,第二次尝试:
int unsigned_to_signed(unsigned n)
{
int int_n = static_cast<int>(n);
if (n == static_cast<unsigned>(int_n))
return int_n;
// else do something long and complicated
}
当我不在典型的二进制补码系统上时,我不太关心效率,因为在我看来这不太可能.如果我的代码成为 2050 年无处不在的符号-幅度系统的瓶颈,那么我敢打赌,届时有人可以解决这个问题并对其进行优化.
I do not much care about the efficiency when I am not on a typical twos-complement system, since in my humble opinion that is unlikely. And if my code becomes a bottleneck on the omnipresent sign-magnitude systems of 2050, well, I bet someone can figure that out and optimize it then.
现在,第二次尝试非常接近我想要的.尽管转换为 int
是针对某些输入的实现定义的,但标准保证转换回 unsigned
以保留模 UINT_MAX+1 的值.所以条件确实检查了我想要什么,它在我可能遇到的任何系统上都不会编译.
Now, this second attempt is pretty close to what I want. Although the cast to int
is implementation-defined for some inputs, the cast back to unsigned
is guaranteed by the standard to preserve the value modulo UINT_MAX+1. So the conditional does check exactly what I want, and it will compile into nothing on any system I am likely to encounter.
但是...我仍在转换为 int
而不首先检查它是否会调用实现定义的行为.在 2050 年的某个假设系统上,它可以做谁知道什么.所以假设我想避免这种情况.
However... I am still casting to int
without first checking whether it will invoke implementation-defined behavior. On some hypothetical system in 2050 it could do who-knows-what. So let's say I want to avoid that.
问题:我的第三次尝试"应该是什么样子?
Question: What should my "third attempt" look like?
回顾一下,我想:
- 从无符号整数转换为有符号整数
- 保留值 mod UINT_MAX+1
- 仅调用标准强制行为
- 在具有优化编译器的典型二进制补码机器上编译为空操作
[更新]
让我举一个例子来说明为什么这不是一个微不足道的问题.
Let me give an example to show why this is not a trivial question.
考虑具有以下属性的假设 C++ 实现:
Consider a hypothetical C++ implementation with the following properties:
sizeof(int)
等于 4sizeof(unsigned)
等于 4INT_MAX
等于 32767INT_MIN
等于 -232 + 32768UINT_MAX
等于 232 - 1int
上的算术是模 232(在INT_MIN
到INT_MAX
范围内)std::numeric_limits<int>::is_modulo
为真- 将无符号
n
转换为 int 会保留 0 <= n <= 32767 的值,否则会产生 0
sizeof(int)
equals 4sizeof(unsigned)
equals 4INT_MAX
equals 32767INT_MIN
equals -232 + 32768UINT_MAX
equals 232 - 1- Arithmetic on
int
is modulo 232 (into the rangeINT_MIN
throughINT_MAX
) std::numeric_limits<int>::is_modulo
is true- Casting unsigned
n
to int preserves the value for 0 <= n <= 32767 and yields zero otherwise
在这个假设的实现中,每个 unsigned
值都有一个 int
值全等 (mod UINT_MAX+1).所以我的问题会很明确.
On this hypothetical implementation, there is exactly one int
value congruent (mod UINT_MAX+1) to each unsigned
value. So my question would be well-defined.
我声称这个假设的 C++ 实现完全符合 C++98、C++03 和 C++11 规范.我承认我没有记住所有的每一个字……但我相信我已经仔细阅读了相关部分.因此,如果您希望我接受您的回答,您必须 (a) 引用排除此假设实现的规范或 (b) 正确处理它.
I claim that this hypothetical C++ implementation fully conforms to the C++98, C++03, and C++11 specifications. I admit I have not memorized every word of all of them... But I believe I have read the relevant sections carefully. So if you want me to accept your answer, you either must (a) cite a spec that rules out this hypothetical implementation or (b) handle it correctly.
确实,正确答案必须处理标准允许的每一个假设实现.根据定义,这就是仅调用标准强制行为"的含义.
Indeed, a correct answer must handle every hypothetical implementation permitted by the standard. That is what "invoke only standard-mandated behavior" means, by definition.
顺便提一下,由于多种原因,std::numeric_limits<int>::is_modulo
在这里完全没用.一方面,它可以是 true
,即使 unsigned-to-signed 强制转换不适用于大的无符号值.另一方面,如果算术只是对整个整数范围取模,即使在补码或符号幅度系统上也可以是 true
.等等.如果你的答案依赖于is_modulo
,那就错了.
Incidentally, note that std::numeric_limits<int>::is_modulo
is utterly useless here for multiple reasons. For one thing, it can be true
even if unsigned-to-signed casts do not work for large unsigned values. For another, it can be true
even on one's-complement or sign-magnitude systems, if arithmetic is simply modulo the entire integer range. And so on. If your answer depends on is_modulo
, it's wrong.
[更新 2]
hvd 的回答 教会了我一些东西:我假设的整数 C++ 实现不允许现代 C.C99 和 C11 标准对有符号整数的表示非常具体;实际上,它们只允许补码、补码和符号幅度(第 6.2.6.2 节第 (2) 节;).
hvd's answer taught me something: My hypothetical C++ implementation for integers is not permitted by modern C. The C99 and C11 standards are very specific about the representation of signed integers; indeed, they only permit twos-complement, ones-complement, and sign-magnitude (section 6.2.6.2 paragraph (2); ).
但 C++ 不是 C.事实证明,这个事实是我问题的核心.
But C++ is not C. As it turns out, this fact lies at the very heart of my question.
最初的 C++98 标准基于更老的 C89,它说(第 3.1.2.5 节):
The original C++98 standard was based on the much older C89, which says (section 3.1.2.5):
对于每个有符号整数类型,都有一个对应的(但不同的)无符号整数类型(用关键字指定无符号),它使用相同的存储量(包括符号信息)并具有相同的对齐要求.的范围有符号整数类型的非负值是对应的无符号整数类型,以及每种类型的相同值都是相同的.
For each of the signed integer types, there is a corresponding (but different) unsigned integer type (designated with the keyword unsigned) that uses the same amount of storage (including sign information) and has the same alignment requirements. The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the representation of the same value in each type is the same.
C89 没有说明只有一个符号位或只允许二进制补码/一个补码/符号幅度.
C89 says nothing about only having one sign bit or only allowing twos-complement/ones-complement/sign-magnitude.
C++98 标准几乎一字不差地采用了这种语言(第 3.9.1 节第 (3) 节):
The C++98 standard adopted this language nearly verbatim (section 3.9.1 paragraph (3)):
对于每一种有符号整数类型,都存在一个对应的(但不同)无符号整数类型:unsigned char
"、unsignedshort int
"、unsigned int
"和unsigned long int
",每个它占用相同的存储量并具有相同的对齐方式要求(3.9)作为对应的有符号整数类型;那也就是说,每个 有符号整数 类型都具有与其对应的无符号整数类型.非负范围有符号整数类型的值是相应的子范围无符号整数类型,以及每个的值表示对应的有符号/无符号类型应相同.
For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "
unsigned char
", "unsigned short int
", "unsigned int
", and "unsigned long int
", each of which occupies the same amount of storage and has the same alignment requirements (3.9) as the corresponding signed integer type ; that is, each signed integer type has the same object representation as its corresponding unsigned integer type. The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the value representation of each corresponding signed/unsigned type shall be the same.
C++03 标准使用与 C++11 基本相同的语言.
The C++03 standard uses essentially identical language, as does C++11.
据我所知,没有标准 C++ 规范将其有符号整数表示限制为任何 C 规范.并且没有任何东西要求单个符号位或任何类似的东西.它只是说非负有符号整数必须是相应无符号的子范围.
No standard C++ spec constrains its signed integer representations to any C spec, as far as I can tell. And there is nothing mandating a single sign bit or anything of the kind. All it says is that non-negative signed integers must be a subrange of the corresponding unsigned.
所以,我再次声明 INT_MAX=32767 和 INT_MIN=-232+32768 是允许的.如果您的答案假设不是这样,那么除非您引用 C++ 标准证明我错了,否则它是不正确的.
So, again I claim that INT_MAX=32767 with INT_MIN=-232+32768 is permitted. If your answer assumes otherwise, it is incorrect unless you cite a C++ standard proving me wrong.
推荐答案
扩展user71404的回答:
Expanding on user71404's answer:
int f(unsigned x)
{
if (x <= INT_MAX)
return static_cast<int>(x);
if (x >= INT_MIN)
return static_cast<int>(x - INT_MIN) + INT_MIN;
throw x; // Or whatever else you like
}
如果 x >= INT_MIN
(记住促销规则,INT_MIN
会被转换为 unsigned
),那么 x- INT_MIN <= INT_MAX
,所以这不会有任何溢出.
If x >= INT_MIN
(keep the promotion rules in mind, INT_MIN
gets converted to unsigned
), then x - INT_MIN <= INT_MAX
, so this won't have any overflow.
如果这不明显,请查看声明如果 x >= -4u
, then x + 4 <= 3
.",以及请记住,INT_MAX
将至少等于 -INT_MIN - 1 的数学值.
If that is not obvious, take a look at the claim "If x >= -4u
, then x + 4 <= 3
.", and keep in mind that INT_MAX
will be equal to at least the mathematical value of -INT_MIN - 1.
在最常见的系统上,!(x <= INT_MAX)
意味着 x >= INT_MIN
,优化器应该能够(在我的系统上,能够)删除第二个检查,确定两个 return
语句可以编译为相同的代码,并删除第一个检查.生成的汇编列表:
On the most common systems, where !(x <= INT_MAX)
implies x >= INT_MIN
, the optimizer should be able (and on my system, is able) to remove the second check, determine that the two return
statements can be compiled to the same code, and remove the first check too. Generated assembly listing:
__Z1fj:
LFB6:
.cfi_startproc
movl 4(%esp), %eax
ret
.cfi_endproc
您问题中的假设实现:
- INT_MAX 等于 32767
- INT_MIN 等于 -232 + 32768
- INT_MAX equals 32767
- INT_MIN equals -232 + 32768
是不可能的,所以不需要特别考虑.INT_MIN
将等于 -INT_MAX
或 -INT_MAX - 1
.这遵循 C 对整数类型的表示(6.2.6.2),它要求 n
位是值位,一位是符号位,并且只允许一个单一的陷阱表示(不包括由于填充位而无效),即表示负零/-INT_MAX - 1
的那个.C++ 不允许任何整数表示超出 C 允许的范围.
is not possible, so does not need special consideration. INT_MIN
will be equal to either -INT_MAX
, or to -INT_MAX - 1
. This follows from C's representation of integer types (6.2.6.2), which requires n
bits to be value bits, one bit to be a sign bit, and only allows one single trap representation (not including representations that are invalid because of padding bits), namely the one that would otherwise represent negative zero / -INT_MAX - 1
. C++ doesn't allow any integer representations beyond what C allows.
更新:微软的编译器显然没有注意到 x >10
和 x >= 11
测试相同的东西.仅当 x >= INT_MIN
被替换为 x > 时,它才会生成所需的代码.INT_MIN - 1u
,它可以将其检测为 x <= INT_MAX
的否定(在此平台上).
Update: Microsoft's compiler apparently does not notice that x > 10
and x >= 11
test the same thing. It only generates the desired code if x >= INT_MIN
is replaced with x > INT_MIN - 1u
, which it can detect as the negation of x <= INT_MAX
(on this platform).
[来自提问者 (Nemo) 的更新,详细说明我们在下面的讨论]
[Update from questioner (Nemo), elaborating on our discussion below]
我现在相信这个答案适用于所有情况,但原因很复杂.我很可能会奖励这个解决方案,但我想捕捉所有血淋淋的细节以防万一.
I now believe this answer works in all cases, but for complicated reasons. I am likely to award the bounty to this solution, but I want to capture all the gory details in case anybody cares.
让我们从 C++11 第 18.3.3 节开始:
Let's start with C++11, section 18.3.3:
表 31 描述了标题 <climits>
.
Table 31 describes the header
<climits>
.
...
内容与标准C库头文件<limits.h>
相同.
The contents are the same as the Standard C library header <limits.h>
.
这里,标准 C"是指 C99,其规范严格限制了有符号整数的表示.它们就像无符号整数,但有一位专用于符号",零位或多位专用于填充".填充位对整数的值没有贡献,符号位仅作为二进制补码、二进制补码或符号大小起作用.
Here, "Standard C" means C99, whose specification severely constrains the representation of signed integers. They are just like unsigned integers, but with one bit dedicated to "sign" and zero or more bits dedicated to "padding". The padding bits do not contribute to the value of the integer, and the sign bit contributes only as twos-complement, ones-complement, or sign-magnitude.
由于 C++11 继承了 C99 的
宏,因此 INT_MIN 是 -INT_MAX 或 -INT_MAX-1,并且 hvd 的代码保证可以工作.(请注意,由于填充,INT_MAX 可能比 UINT_MAX/2 小得多......但是由于有符号->无符号转换的工作方式,这个答案处理得很好.)
Since C++11 inherits the <climits>
macros from C99, INT_MIN is either -INT_MAX or -INT_MAX-1, and hvd's code is guaranteed to work. (Note that, due to the padding, INT_MAX could be much less than UINT_MAX/2... But thanks to the way signed->unsigned casts work, this answer handles that fine.)
C++03/C++98 比较复杂.它使用相同的措辞从Standard C"继承<climits>
,但现在Standard C"表示C89/C90.
C++03/C++98 is trickier. It uses the same wording to inherit <climits>
from "Standard C", but now "Standard C" means C89/C90.
所有这些――C++98、C++03、C89/C90――都有我在我的问题中给出的措辞,但也包括这个(C++03 第 3.9.1 节第 7 段):
All of these -- C++98, C++03, C89/C90 -- have the wording I give in my question, but also include this (C++03 section 3.9.1 paragraph 7):
整数类型的表示应通过使用纯二进制计数系统.(44) [示例:this International标准允许 2 的补码、1 的补码和有符号幅度整数类型的表示.]
The representations of integral types shall define values by use of a pure binary numeration system.(44) [Example: this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types.]
脚注(44)定义纯二进制记数系统":
Footnote (44) defines "pure binary numeration system":
使用二进制数字 0 的整数的位置表示和 1,其中由连续位表示的值是加法,从 1 开始,乘以连续积分2 的幂次方,可能位置最高的位除外.
A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps for the bit with the highest position.
这个措辞的有趣之处在于它自相矛盾,因为纯二进制计数系统"的定义不允许符号/大小表示!它确实允许高位具有值 -2n-1 (二进制补码)或 -(2n-1-1) (二进制补码).但是导致符号/大小的高位没有任何价值.
What is interesting about this wording is that it contradicts itself, because the definition of "pure binary numeration system" does not permit a sign/magnitude representation! It does allow the high bit to have, say, the value -2n-1 (twos complement) or -(2n-1-1) (ones complement). But there is no value for the high bit that results in sign/magnitude.
反正我的假设实现"在这个定义下不符合纯二进制"的条件,所以排除了.
Anyway, my "hypothetical implementation" does not qualify as "pure binary" under this definition, so it is ruled out.
然而,高位是特殊的这一事实意味着我们可以想象它贡献了任何价值:小的正值、巨大的正值、小的负值或巨大的负值.(如果符号位可以贡献 -(2n-1-1),为什么不能 -(2n-1-2)?等等)
However, the fact that the high bit is special means we can imagine it contributing any value at all: A small positive value, huge positive value, small negative value, or huge negative value. (If the sign bit can contribute -(2n-1-1), why not -(2n-1-2)? etc.)
所以,让我们想象一个有符号整数表示,它为符号"位分配一个古怪的值.
So, let's imagine a signed integer representation that assigns a wacky value to the "sign" bit.
符号位的一个小的正值将导致 int
的正范围(可能与 unsigned
一样大),并且 hvd 的代码处理得很好.
A small positive value for the sign bit would result in a positive range for int
(possibly as large as unsigned
), and hvd's code handles that just fine.
符号位的巨大正值会导致 int
的最大值大于 unsigned
,这是被禁止的.
A huge positive value for the sign bit would result in int
having a maximum larger than unsigned
, which is is forbidden.
符号位的巨大负值将导致 int
表示不连续的值范围,而规范中的其他措辞规定了这一点.
A huge negative value for the sign bit would result in int
representing a non-contiguous range of values, and other wording in the spec rules that out.
最后,一个贡献小的负数的符号位怎么样?我们可以在符号位"中有一个 1,例如 -37 对 int 的值有贡献吗?那么 INT_MAX 将是(比如说)231-1 而 INT_MIN 将是 -37?
Finally, how about a sign bit that contributes a small negative quantity? Could we have a 1 in the "sign bit" contribute, say, -37 to the value of the int? So then INT_MAX would be (say) 231-1 and INT_MIN would be -37?
这将导致某些数字具有两种表示形式...但是补码将两种表示形式归零,根据示例",这是允许的.规范中没有任何地方说零是可能有两种表示形式的 only 整数.所以我认为这个新假设是规范允许的.
This would result in some numbers having two representations... But ones-complement gives two representations to zero, and that is allowed according to the "Example". Nowhere does the spec say that zero is the only integer that might have two representations. So I think this new hypothetical is allowed by the spec.
确实,从 -1 到 -INT_MAX-1
的任何负值似乎都可以作为符号位"的值,但不能更小(以免范围不连续).换句话说,INT_MIN
可能是从 -INT_MAX-1
到 -1 的任何值.
Indeed, any negative value from -1 down to -INT_MAX-1
appears to be permissible as a value for the "sign bit", but nothing smaller (lest the range be non-contiguous). In other words, INT_MIN
might be anything from -INT_MAX-1
to -1.
现在,你猜怎么着?对于 hvd 代码中的第二次强制转换以避免实现定义的行为,我们只需要 x - (unsigned)INT_MIN
小于或等于 INT_MAX
.我们刚刚展示了 INT_MIN
至少是 -INT_MAX-1
.显然,x
最多是 UINT_MAX
.将负数转换为无符号与添加 UINT_MAX+1
相同.把它们放在一起:
Now, guess what? For the second cast in hvd's code to avoid implementation-defined behavior, we just need x - (unsigned)INT_MIN
less than or equal to INT_MAX
. We just showed INT_MIN
is at least -INT_MAX-1
. Obviously, x
is at most UINT_MAX
. Casting a negative number to unsigned is the same as adding UINT_MAX+1
. Put it all together:
x - (unsigned)INT_MIN <= INT_MAX
当且仅当
UINT_MAX - (INT_MIN + UINT_MAX + 1) <= INT_MAX
-INT_MIN-1 <= INT_MAX
-INT_MIN <= INT_MAX+1
INT_MIN >= -INT_MAX-1
最后是我们刚刚展示的,所以即使在这种反常的情况下,代码也确实有效.
That last is what we just showed, so even in this perverse case, the code actually works.
这耗尽了所有的可能性,从而结束了这个极其学术的练习.
That exhausts all of the possibilities, thus ending this extremely academic exercise.
底线:在 C++98/C++03 继承的 C89/C90 中有符号整数存在一些严重未指定的行为.它在 C99 中已修复,C++11 通过合并 C99 中的
间接继承了该修复.但即使是 C++11 也保留了自相矛盾的纯二进制表示"措辞……
Bottom line: There is some seriously under-specified behavior for signed integers in C89/C90 that got inherited by C++98/C++03. It is fixed in C99, and C++11 indirectly inherits the fix by incorporating <limits.h>
from C99. But even C++11 retains the self-contradictory "pure binary representation" wording...
相关文章