char 对象对应于哪个字符编码(Unicode 版本)集?

2022-01-12 00:00:00 unicode char character-encoding c# java

char对象对应的Unicode字符编码在:

What Unicode character encoding does a char object correspond to in:

  • C#

Java

JavaScript (我知道实际上没有 char 类型,但我假设 String 类型仍然实现为Unicode 字符数组)

JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)

一般来说,编程语言之间是否有使用特定字符编码的通用约定?

In general, is there a common convention among programming languages to use a specific character encoding?

  1. 我试图澄清我的问题.我所做的更改将在下面的评论中讨论.
  2. 回复:你想解决什么问题?",我对从语言无关的表达式生成代码感兴趣,并且文件的特定编码是相关的.
  1. I have tried to clarify my question. The changes I made are discussed in the comments below.
  2. Re: "What problem are you trying to solve?", I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.

推荐答案

我不确定我是否在回答你的问题,但让我说几句希望能有所启发.

I'm not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.

在核心,像我们正在讨论的那些通用编程语言(C、C++、C#、Java、PHP)没有文本"的概念,而只有数据"的概念.数据由整数值序列(即数字)组成.这些数字背后没有内在的含义.

At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of "text", merely of "data". Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.

将数字流转化为文本的过程是一种语义,通常留给消费者将相关语义分配给数据流.

The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.

警告:我现在将使用编码"这个词,不幸的是它有多个不等价的含义.编码"的第一个含义是将含义分配给数字.数字的语义解释也称为字符".例如,在 ASCII 编码中,32 表示空格",65 表示大写字母 A".ASCII 只为 128 个数字赋予含义,因此每个 ASCII 字符 都可以方便地用一个 8 位字节(最高位始终为 0)来表示.有许多编码将字符分配给 256 个数字,因此每个字符都使用一个字节.在这些固定宽度的编码中,一个文本字符串的字符数与它表示的字节数一样多.还有其他编码,其中字符采用可变数量的字节来表示.

Warning: I will now use the word "encoding", which unfortunately has multiple inequivalent meanings. The first meaning of "encoding" is the assignment of meaning to a number. The semantic interpretation of a number is also called a "character". For example, in the ASCII encoding, 32 means "space" and 65 means "captial A". ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.

现在,Unicode 也是一种编码,即为数字分配意义.在前 128 个数字上,它与 ASCII 相同,但它为(理论上)2^21 个数字分配了含义.因为有很多含义在写作意义上不是严格意义上的字符"(例如零宽度连接符或变音符号修饰符),所以术语代码点"优于字符".尽管如此,任何至少 21 位宽的整数数据类型都可以表示一个代码点.通常会选择 32 位类型,并且这种每个元素代表一个代码点的编码称为 UTF-32 或 UCS-4.

Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren't strictly "characters" in the sense of writing (such as zero-width joiners or diacritic modifiers), the term "codepoint" is preferred over "character". Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.

现在我们有了编码"的第二个含义:我可以将一串 Unicode 码点转换成一个 8 位或 16 位值的字符串,从而进一步编码"信息.在这种新的转换形式(称为unicode 转换格式"或UTF")中,我们现在有 8 位或 16 位值的字符串(称为代码单元"),但每个单独的值通常不对应任何有意义的东西——它首先必须被解码成一系列 Unicode 代码点.

Now we have a second meaning of "encoding": I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further "encoding" the information. In this new, transformed form (called "unicode transformation format", or "UTF"), we now have strings of 8-bit or 16-bit values (called "code units"), but each individual value does not in general correspond to anything meaningful -- it first has to be decoded into a sequence of Unicode codepoints.

因此,从编程的角度来看,如果您想修改 文本(不是字节),那么您应该将您的文本存储为 Unicode 代码点序列.实际上,这意味着您需要 32 位数据类型.C 和 C++ 中的 char 数据类型通常是 8 位宽(尽管这只是最小值),而在 C# 和 Java 中它总是 16 位宽.一个 8 位 char 可以用来存储 transformed UTF-8 字符串,而 16 位 char 可以存储一个 transformed UTF-16 字符串,但是为了要获得原始的、有意义的 Unicode 代码点(尤其是代码点中字符串的长度),您将始终需要执行解码.

Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that's only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.

通常,您的文本处理库将能够为您进行解码和编码,因此它们会很乐意接受 UTF8 和 UTF16 字符串(但要付出一定的代价),但如果您想省去这种额外的间接性,请存储您的字符串作为足够宽类型的原始 Unicode 代码点.

Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.

相关文章