Visual Studio 字符集“未设置"与“多字节字符集"
我正在使用遗留应用程序,我正在尝试找出使用 多字节字符集
和 Not Set
编译的应用程序之间的区别 >字符集
选项.
I've working with a legacy application and I'm trying to work out the difference between applications compiled with Multi byte character set
and Not Set
under the Character Set
option.
我知道使用 Multi byte character set
编译定义了 _MBCS
允许使用多字节字符集代码页,并使用 Not set
> 没有定义 _MBCS
,在这种情况下,只允许单字节字符集代码页.
I understand that compiling with Multi byte character set
defines _MBCS
which allows multi byte character set code pages to be used, and using Not set
doesn't define _MBCS
, in which case only single byte character set code pages are allowed.
在使用 Not Set
的情况下,我假设我们只能使用在此页面上找到的单字节字符集代码页:http://msdn.microsoft.com/en-gb/goglobal/bb964654.aspx
In the case that Not Set
is used, I'm assuming then that we can only use the single byte character set code pages found on this page: http://msdn.microsoft.com/en-gb/goglobal/bb964654.aspx
因此,我认为使用 Not Set
是否正确,应用程序将无法编码和写入或读取远东语言,因为它们是在双字节字符集代码中定义的页(当然还有 Unicode)?
Therefore, am I correct in thinking that is Not Set
is used, the application won't be able to encode and write or read far eastern languages since they are defined in double byte character set code pages (and of course Unicode)?
接下来,如果定义了多字节字符
集,是单字节和多字节字符集代码页都可用,还是只有多字节字符集代码页?我猜必须同时支持欧洲语言.
Following on from this, if Multi byte character
set is defined, are both single and multi byte character set code pages available, or only multi byte character set code pages? I'm guessing it must be both for European languages to be supported.
谢谢,
安迪
进一步阅读
这些页面上的答案没有回答我的问题,但有助于我的理解:关于字符集"Visual Studio 2010 中的选项
The answers on these pages didn't answer my question, but helped in my understanding: About the "Character set" option in visual studio 2010
研究
所以,就像工作研究一样......我的语言环境设置为日语
So, just as working research... With my locale set as Japanese
对硬编码字符串的影响
char *foo = "Jap text: テスト";
wchar_t *bar = L"Jap text: テスト";
使用 Unicode
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis(代码页 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 或 UCS-2
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
使用多字节字符集
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis(代码页 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 或 UCS-2
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
编译时未设置
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis(代码页 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 或 UCS-2
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
结论:字符编码对硬编码字符串没有任何影响.虽然如上定义字符似乎使用了语言环境定义的代码页,但 wchar_t 似乎使用 UCS-2 或 UTF-16.
Conclusion: The character encoding doesn't have any effect on hard coded strings. Although defining chars as above seems to use the Locale defined codepage and wchar_t seems to use either UCS-2 or UTF-16.
在 W/A 版本的 Win32 API 中使用编码字符串
所以,使用以下代码:
char *foo = "C:\Temp\テスト\テa.txt";
wchar_t *bar = L"C:\Temp\テスト\テw.txt";
CreateFileA(bar, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
CreateFileW(foo, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
使用 Unicode
结果:两个文件都被创建
Result: Both files are created
使用多字节字符集
结果:两个文件都被创建
Result: Both files are created
使用未设置
结果:两个文件都被创建
Result: Both files are created
结论:API 的 A
和 W
版本都需要相同的编码,而不管选择的字符集如何.由此,也许我们可以假设 Character Set
选项所做的只是在 API 版本之间切换.所以 A
版本总是需要当前代码页编码中的字符串,而 W
版本总是需要 UTF-16 或 UCS-2.
Conclusion:
Both the A
and W
version of the API expect the same encoding regardless of the character set chosen. From this, perhaps we can assume that all the Character Set
option does is switch between the version of the API. So the A
version always expects strings in the encoding of the current code page and the W
version always expects UTF-16 or UCS-2.
使用 W 和 A Win32 API 打开文件
所以使用以下代码:
char filea[MAX_PATH] = {0};
OPENFILENAMEA ofna = {0};
ofna.lStructSize = sizeof ( ofna );
ofna.hwndOwner = NULL ;
ofna.lpstrFile = filea ;
ofna.nMaxFile = MAX_PATH;
ofna.lpstrFilter = "All*.*Text*.TXT";
ofna.nFilterIndex =1;
ofna.lpstrFileTitle = NULL ;
ofna.nMaxFileTitle = 0 ;
ofna.lpstrInitialDir=NULL ;
ofna.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;
wchar_t filew[MAX_PATH] = {0};
OPENFILENAMEW ofnw = {0};
ofnw.lStructSize = sizeof ( ofnw );
ofnw.hwndOwner = NULL ;
ofnw.lpstrFile = filew ;
ofnw.nMaxFile = MAX_PATH;
ofnw.lpstrFilter = L"All*.*Text*.TXT";
ofnw.nFilterIndex =1;
ofnw.lpstrFileTitle = NULL;
ofnw.nMaxFileTitle = 0 ;
ofnw.lpstrInitialDir=NULL ;
ofnw.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;
GetOpenFileNameA(&ofna);
GetOpenFileNameW(&ofnw);
并选择其中之一:
- C:Tempテストテopenw.txt
- C:Tempテストテopenw.txt
产量:
当使用 Unicode
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis(代码页 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 0 0 7 6 e 7 0 0 7 6 e00 == UTF-16 或 UCS-2
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2
当用多字节字符集
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis(代码页 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 0 0 7 6 e 7 0 0 7 6 e00 == UTF-16 或 UCS-2
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2
当编译时Not Set
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis(代码页 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 0 0 7 6 e 7 0 0 7 6 e00 == UTF-16 或 UCS-2
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2
结论:同样,Character Set
设置与 Win32 API 的行为无关.A
版本似乎总是返回带有活动代码页编码的字符串,而 W
版本总是返回 UTF-16 或 UCS-2.我实际上可以在这个很棒的答案中看到这一点的解释:https://stackoverflow.com/a/3299860/187100.
Conclusion:
Again, the Character Set
setting doesn't have a bearing on the behaviour of the Win32 API. The A
version always seems to return a string with the encoding of the active code page and the W
one always returns UTF-16 or UCS-2. I can actually see this explained a bit in this great answer: https://stackoverflow.com/a/3299860/187100.
最终结论
Hans 似乎是正确的,他说定义实际上并没有任何魔力,只是将 Win32 API 更改为使用 W
或 A
.因此,我真的看不出 Not Set
和 Multi byte character set
之间有什么区别.
Hans appears to be correct when he says that the define doesn't really have any magic to it, beyond changing the Win32 APIs to use either W
or A
. Therefore, I can't really see any difference between Not Set
and Multi byte character set
.
推荐答案
不,这不是真正的工作方式.唯一发生的事情是宏被定义,否则它不会对编译器产生神奇的影响.实际编写使用 #ifdef _MBCS
测试此宏的代码的情况非常.
No, that's not really the way it works. The only thing that happens is that the macro gets defined, it doesn't otherwise have a magic effect on the compiler. It is very rare to actually write code that uses #ifdef _MBCS
to test this macro.
你几乎总是把它留给一个辅助函数来进行转换.像 WideCharToMultiByte()、OLE2A() 或 wctombs().根据代码页的指导,哪些是始终考虑多字节编码的转换函数._MBCS 是一个历史事故,仅在 25 多年前多字节编码还不常见时才有意义.就像使用非 Unicode 编码也是当今的历史产物一样.
You almost always leave it up to a helper function to make the conversion. Like WideCharToMultiByte(), OLE2A() or wctombs(). Which are conversion functions that always consider multi-byte encodings, as guided by the code page. _MBCS is an historical accident, relevant only 25+ years ago when multi-byte encodings were not common yet. Much like using a non-Unicode encoding is a historical artifact these days as well.
相关文章