我应该如何正确使用 g++ 的 -finput-charset 编译器选项来编译非 UTF-8 源文件?

2022-01-23 00:00:00 gcc character-encoding g++ c++

我正在尝试使用 -finput-charset 编译器选项在 g++ 中编译一个 UTF-16BE C++ 源文件，但我总是遇到一堆错误.更多细节如下.

I'm trying to compile a UTF-16BE C++ source file in g++ with -finput-charset compiler option but I'm always getting a bunch of errors. More details follow.

g++:4.1.2
图标:2.5
Linux 语言(在终端中):LANG="en_US.UTF-8"

// main.cpp: #include <iostream> int main() { std::cout << "Hello, UTF-16" << std::endl; return 0; }

我的步骤:
我阅读了关于 -finput-charset 选项的 g++ 手册. g++ 手册说:

My steps:

I read the manual of g++ about the -finput-charset option. The g++ manual says:

-finput-charset=字符集设置输入字符集，用于从输入文件的字符集翻译成源字符集海湾合作委员会.如果 locale 没有指定，或者 GCC 不能得到这个来自语言环境的信息，默认为 UTF-8.这可以是被语言环境或此命令行选项覆盖.目前命令行选项优先，如果有冲突.charset 可以是系统支持的任何编码iconv"库例程.

-finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s "iconv" library routine.

于是我输入了如下命令:

g++ -finput-charset=UTF-16BE main.cpp

g++ -finput-charset=UTF-16BE main.cpp

我得到了这些错误:

在 main.cpp:1 中包含的文件中:

In file included from main.cpp:1:

/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1:错误:程序中出现342"

/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1: error: stray ‘342’ in program

/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1:错误:程序中出现274"错误

/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1: error: stray ‘274’ in program

...(重复，很多，大约 4000+)...

...(repeatedly, A LOT, around 4000+)...

/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1:错误:程序中出现257"

/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../../include/c++/4.1.2/iostream:1: error: stray ‘257’ in program

main.cpp:在函数'int main()'中:

main.cpp: In function ‘int main()’:

main.cpp:5: 错误:‘cout’不是‘std’的成员

main.cpp:5: error: ‘cout’ is not a member of ‘std’

main.cpp:5: 错误:‘endl’不是‘std’的成员

main.cpp:5: error: ‘endl’ is not a member of ‘std’

手册文本表明字符集可以是 'iconv' 例程支持的任何编码，因此我猜测编译错误可能是由我的 iconv 库引起的.然后我测试了 iconv:

iconv --from-code=UTF-16BE --to-code=UTF-8 --output=main_utf8.cpp main.cpp

iconv --from-code=UTF-16BE --to-code=UTF-8 --output=main_utf8.cpp main.cpp

按预期生成main_utf8.cpp"文件.然后我尝试编译它:

A "main_utf8.cpp" file is generated as expected. I then tried to compile it:

g++ -finput-charset=UTF-8 main_utf8.cpp

g++ -finput-charset=UTF-8 main_utf8.cpp

请注意，我明确指定了输入字符集以查看我是否做错了什么，但这次生成了一个a.out"而没有任何错误.当我运行它时，它可以产生正确的输出.

Note that I specified the input-charset explicitly to see if I did anything wrong, but this time a "a.out" was generated without any errors. When I ran it, it could produce the correct output.

我不知道我哪里做错了.我在网上搜索了一些有关此编译器选项的示例，但找不到.

I couldn't figure out where I did wrong. I searched in the web trying to find out some examples for this compiler option but I couldn't.

请指教！谢谢！

谢谢各位！您的回复很快！一些更新:

Thanks, guys! Your replies are quick! Some updates:

当我说UTF-16"时，我的意思是UTF-16 + BOM".事实上，我使用的是 UTF-16BE.我已经更新了上面的文字.
一些答案??说错误是由非 UTF-16 头文件引起的.如果是这种情况，以下是我的想法:在编写 C/C++ 项目时，我们总是会包含一些标准头文件，对吧?例如 stdio.h 或 iostream.如果 G++ 编译器只处理我们创建的源文件的编码，而不处理标准库中的源文件，那么这个 -finput-charset 选项存在的意义是什么??

最终
最后，我的解决方案是这样的:

Final edit:

At last, my solution is like this:

一开始，我把我的源文件的编码改成了GB2312，就像下面李斯特先生"所说的那样.用了一阵子还行，后来发现不适合我的情况，因为系统中的其他部分大部分还是用UTF-8进行通信和接口的，所以很多地方我必须转换编码……不仅我的工作开销，它也可能导致我的程序的一些性能下降.
后来我尝试将所有源文件转换为 UTF-8 + BOM.这样，Windows 中的 Visual Studio 可以愉快地编译它们，但 Linux 中的 GCC 会报错.然后我写了一个 shell 脚本来移除 BOM，在我想用 GCC 编译我的代码之前，我先运行这个脚本.
幸运的是，我不必在 Linux 中手动构建代码，因为我的项目中使用了持续集成工具 TeamCity 来自动生成构建.我可以更改 TeamCity 中的构建步骤，以帮助我在每日构建开始之前运行此脚本.
使用这种 UTF-8 + BOM + 脚本的方法，我决定不在 Linux 中编辑我的源代码，因为如果我想这样做，我必须确保我的代码在提交之前可以成功构建，这意味着我在我构建代码之前必须运行脚本来删除 BOM，这意味着 SVN 会报告每个文件都被修改(BOM 被删除)，因此很容易错误地提交错误的文件.为了解决这个问题，我编写了另一个 shell 脚本来将 BOM 添加回源文件.虽然我仍然不会经常在 Linux 中编辑我的代码，但当我真的需要时，我不必面对提交对话框中非常长的更改列表.

推荐答案

Encoding Blues
您不能对源代码文件使用 UTF-16；因为您包含的标头 <iostream> 不是 UTF-16 编码的.由于 #include 逐字包含文件，这意味着您突然有了一个 UTF-16 编码的文件，其中包含大量无效数据(显然约为 4k).

Encoding Blues

You cannot use UTF-16 for source code files; because the header you are including, <iostream>, is not UTF-16-encoded. As #include includes the files verbatim, this means that you suddenly have an UTF-16-encoded file with a large chunk (approximately 4k, apparently) of invalid data.

几乎没有充分的理由在任何事情上使用 UTF-16，所以这也是一样的.

There is almost no good reason to ever use UTF-16 for anything, so this is just as well.

关于编码支持的问题:操作系统本身不负责提供编码支持，这取决于所使用的编译器.

Regarding problems with encoding support: The OSes themselves are not responsible for providing encoding support, this comes down to the compilers used.

Windows 上的 g++ 完全支持与 Linux 上的 g++ 相同的所有编码，因为它是同一个程序，除非您在 Windows 上使用的任何版本的 g++ 都依赖于严重损坏的 iconv 库.

g++ on Windows supports absolutely all of the same encodings as g++ on Linux, because it's the same program, unless whatever version of g++ you are using on Windows relies on a deeply broken iconv library.

检查您的工具链并确保您的所有工具都处于正常工作状态.

Inspect your toolchain and ensure that all your tools are in working order.

作为替代；不要在源文件中使用中文，而是用英文编写它们，使用英文文字，或简单的 TOKEN_STYLE_PLACEHOLDERs，使用 l10n 和 i18n 在运行的可执行文件中替换它们.

As an alternative; don't use Chinese in the source files, but write them in English, using English-language literals, or simple TOKEN_STYLE_PLACEHOLDERs, using l10n and i18n to replace these in the running executable.

Threedit: -finput-charset 几乎可以肯定是代码页和其他类似废话时代的遗留物；然而;ISO-8859-n 文件几乎总是与 UTF-8 标准标头兼容，但是，请参阅下面的重新编辑.

Threedit: -finput-charset is almost certainly a holdover from the days of codepages and other nonsense of the kind; however; an ISO-8859-n file will almost always be compatible with UTF-8 standard headers, however, see the reedit below.

重新下次；记住一个简单的口头禅:N'DUUH！"；永远不要使用 UTF-8！"

Reedit: For next time; remember a simple mantra: "N'DUUH!"; "Never Don't Use UTF-8!"

此类问题的常见解决方案是完全消除问题，例如，通过 gettext.

A common solution to this kind of problem is to remove the problem entirely, by way of, for instance, gettext.

使用gettext 时，您通常会得到一个函数loc(char *)，它抽象出大部分翻译工具特定代码.所以，而不是

When using gettext, you usually end up with a function loc(char *) that abstracts away most of the translation tool specific code. So, instead of

#include <iostream> int main () { std::cout << "瓜田李下" << std::endl; }

你会的

#include <iostream> #include "translation.h" int main () { std::cout << loc("DEEPER_MEANING") << std::endl; }

并且，在 zh.po 中:

msgid DEEPER_MEANING msgstr "瓜田李下"

当然，你也可以有一个 en.po:

Of course, you could also then have a en.po:

msgid DEEPER_MEANING msgstr "Still waters run deep"

这可以扩展，gettext 包有用于扩展带有变量等的字符串的工具，或者你可以使用 printf，来解释不同的语法.

This can be expanded upon, and the gettext package has tools for expansion of strings with variables and such, or you could use printf, to account for different grammars.

不必处理对文件编码、文件结尾、字节顺序标记和其他类似问题有不同要求的多个编译器；可以使用 MinGW 或类似工具进行交叉编译.

Instead of having to deal with multiple compilers with different requirements for file encodings, file endings, byte order marks, and other problems of the kind; it is possible to cross-compile using MinGW or similar tools.

此选项需要一些设置，但很可能会减少未来的开销和令人头疼的问题.

This option requires some setup, but may very well reduce future overhead and headaches.

相关文章