为什么 C 或 C++ 标准不明确将 char 定义为有符号或无符号?

2021-12-25 00:00:00 c standards types compiler-construction c++
int main()
{
    char c = 0xff;
    bool b = 0xff == c;
    // Under most C/C++ compilers' default options, b is FALSE!!!
}

C 或 C++ 标准都没有将 char 指定为有符号或无符号,它是实现定义的.

Neither the C or C++ standard specify char as signed or unsigned, it is implementation-defined.

为什么 C/C++ 标准没有明确地将 char 定义为有符号或无符号以避免像上面的代码那样危险的误用?

Why does the C/C++ standard not explicitly define char as signed or unsigned for avoiding dangerous misuses like the above code?

推荐答案

历史原因,主要是.

char 类型的表达式在大多数情况下被提升为 int(因为很多 CPU 没有 8 位算术运算).在某些系统上,符号扩展是执行此操作的最有效方法,因此需要对纯 char 进行签名.

Expressions of type char are promoted to int in most contexts (because a lot of CPUs don't have 8-bit arithmetic operations). On some systems, sign extension is the most efficient way to do this, which argues for making plain char signed.

另一方面,EBCDIC 字符集具有设置了高位的基本字符(即值为 128 或更大的字符);在 EBCDIC 平台上,char 几乎必须是未签名的.

On the other hand, the EBCDIC character set has basic characters with the high-order bit set (i.e., characters with values of 128 or greater); on EBCDIC platforms, char pretty much has to be unsigned.

ANSI C 基本原理(对于 1989 标准)不在这个问题上没有很多话要说;第 3.1.2.5 节说:

The ANSI C Rationale (for the 1989 standard) doesn't have a lot to say on the subject; section 3.1.2.5 says:

指定了三种类型的字符:signed、plain 和 unsigned.一个纯 char 可以表示为有符号或无符号,具体取决于在实施时,与以往的做法一样.signed char 类型被引入是为了使一个字节的有符号整数类型可用那些将普通字符实现为无符号的系统.出于以下原因对称性,关键字 signed 允许作为类型名称的一部分其他整数类型.

Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types.

更进一步,C 参考的早期版本1975 年的手册 说:

char 对象可以在 int 所在的任何地方使用.在所有情况下char 被转换为 int结果整数的 8 位.这符合两人的用于字符和整数的补码表示.(但是,符号传播功能在其他实现.)

A char object may be used anywhere an int may be. In all cases the char is converted to an int by propagating its sign through the upper 8 bits of the resultant integer. This is consistent with the two’s complement representation used for both characters and integers. (However, the sign-propagation feature disappears in other implementations.)

这个描述比我们在后面的文档中看到的更特定于实现,但它确实承认 char 可能是有符号的或无符号的.在符号传播消失"的其他实现"上,将 char 对象提升为 int 将零扩展 8 位表示,本质上将其视为 8 位无符号量.(该语言还没有 signedunsigned 关键字.)

This description is more implementation-specific than what we see in later documents, but it does acknowledge that char may be either signed or unsigned. On the "other implementations" on which "the sign-propagation disappears", the promotion of a char object to int would have zero-extended the 8-bit representation, essentially treating it as an 8-bit unsigned quantity. (The language didn't yet have the signed or unsigned keyword.)

C 的直接前身是一种称为 B 的语言.B 是一种无类型语言,因此 char 有符号或无符号的问题不适用.有关 C 早期历史的更多信息,请参阅已故丹尼斯・里奇 (Dennis Ritchie) 的 主页,现在移到这里.

C's immediate predecessor was a language called B. B was a typeless language, so the question of char being signed or unsigned did not apply. For more information about the early history of C, see the late Dennis Ritchie's home page, now moved here.

至于在您的代码中发生了什么(应用现代 C 规则):

As for what's happening in your code (applying modern C rules):

char c = 0xff;
bool b = 0xff == c;

如果普通的char是无符号的,那么c的初始化将它设置为(char)0xff,比较等于0xff 在第二行.但是如果普通的 char 是有符号的,那么 0xff(int 类型的表达式)将被转换为 char --但由于 0xff 超过 CHAR_MAX(假设 CHAR_BIT==8),结果是实现定义的.在大多数实现中,结果是 -1.在比较0xff == c时,两个操作数都被转换为int,使其等价于0xff == -1,或255 == -1,这当然是错误的.

If plain char is unsigned, then the initialization of c sets it to (char)0xff, which compares equal to 0xff in the second line. But if plain char is signed, then 0xff (an expression of type int) is converted to char -- but since 0xff exceeds CHAR_MAX (assuming CHAR_BIT==8), the result is implementation-defined. In most implementations, the result is -1. In the comparison 0xff == c, both operands are converted to int, making it equivalent to 0xff == -1, or 255 == -1, which is of course false.

另一个需要注意的重要事项是unsigned charsigned char 和(普通)char 是三种不同的类型.char 具有与 either unsigned charsigned char 相同的表示;它是实现定义的.(另一方面,signed intint 是同一类型的两个名称;unsigned int 是不同的类型.(除此之外,只是为了增加琐碎,它是实现定义的,声明为纯 int 的位字段是有符号还是无符号.))

Another important thing to note is that unsigned char, signed char, and (plain) char are three distinct types. char has the same representation as either unsigned char or signed char; it's implementation-defined which one it is. (On the other hand, signed int and int are two names for the same type; unsigned int is a distinct type. (Except that, just to add to the frivolity, it's implementation-defined whether a bit field declared as plain int is signed or unsigned.))

是的,这有点乱,我敢肯定,如果今天从头开始设计 C,它会有不同的定义.但是 C 语言的每次修订都必须避免破坏(过多)现有代码,并在较小程度上避免破坏现有实现.

Yes, it's all a bit of a mess, and I'm sure it would have be defined differently if C were being designed from scratch today. But each revision of the C language has had to avoid breaking (too much) existing code, and to a lesser extent existing implementations.

相关文章