如何正确判断文本文件的字符编码?
这是我的情况:我需要正确确定给定文本文件使用哪种字符编码.希望它可以正确返回以下类型之一:
Here is my situation: I need to correctly determine which character encoding is used for given text file. Hopefully, it can correctly return one of the following types:
enum CHARACTER_ENCODING
{
ANSI,
Unicode,
Unicode_big_endian,
UTF8_with_BOM,
UTF8_without_BOM
};
到目前为止,我可以通过调用以下功能.如果给定的文本文件最初不是 UTF-8 without BOM
,它也可以正确确定 ANSI
.问题是当文本文件为UTF-8 without BOM
时,下面的函数会误认为是ANSI
文件.>
Up to now, I can correctly tell a text file is Unicode
, Unicode big endian
or UTF-8 with BOM
by calling the following function. It also can correctly determine for ANSI
if the given text file is not originally a UTF-8 without BOM
. The problem is that when the text file is UTF-8 without BOM
, the following function will mistakenly regard it as a ANSI
file.
CHARACTER_ENCODING get_text_file_encoding(const char *filename)
{
CHARACTER_ENCODING encoding;
unsigned char uniTxt[] = {0xFF, 0xFE};// Unicode file header
unsigned char endianTxt[] = {0xFE, 0xFF};// Unicode big endian file header
unsigned char utf8Txt[] = {0xEF, 0xBB};// UTF_8 file header
DWORD dwBytesRead = 0;
HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE)
{
hFile = NULL;
CloseHandle(hFile);
throw runtime_error("cannot open file");
}
BYTE *lpHeader = new BYTE[2];
ReadFile(hFile, lpHeader, 2, &dwBytesRead, NULL);
CloseHandle(hFile);
if (lpHeader[0] == uniTxt[0] && lpHeader[1] == uniTxt[1])// Unicode file
encoding = CHARACTER_ENCODING::Unicode;
else if (lpHeader[0] == endianTxt[0] && lpHeader[1] == endianTxt[1])// Unicode big endian file
encoding = CHARACTER_ENCODING::Unicode_big_endian;
else if (lpHeader[0] == utf8Txt[0] && lpHeader[1] == utf8Txt[1])// UTF-8 file
encoding = CHARACTER_ENCODING::UTF8_with_BOM;
else
encoding = CHARACTER_ENCODING::ANSI; //Ascii
delete []lpHeader;
return encoding;
}
这个问题困扰了我很久,还是找不到好的解决办法.任何提示将不胜感激.
This problem has blocked me for a long time and I still cannot find a good solution. Any hint will be appreciated.
推荐答案
对于初学者来说,没有像Unicode"这样的物理编码.您可能指的是 UTF-16.其次,任何文件在ANSI"或任何单字节编码中都是有效的.您唯一能做的就是猜测最有可能抛出无效匹配的最佳顺序.
For starters, there's no such physical encoding as "Unicode". What you probably mean by this is UTF-16. Secondly, any file is valid in "ANSI", or any single-byte encoding for that matter. The only thing you can do is guess in the best order which is most likely to throw out invalid matches.
您应该按以下顺序检查:
You should check, in this order:
- 开头是否有 UTF-16 BOM?那么它可能是UTF-16.使用 BOM 作为指示是大端还是小端,然后检查文件的其余部分是否符合.
- 开头是否有 UTF-8 BOM?那么它可能是UTF-8.检查文件的其余部分.
- 如果以上没有导致正匹配,请检查整个文件是否是有效的 UTF-8.如果是,则可能是 UTF-8.
- 如果以上没有导致肯定匹配,则可能是 ANSI.
如果您希望 UTF-16 文件没有 BOM(例如,XML 文件可能会在 XML 声明中指定编码),那么您必须在其中插入该规则以及.尽管上述任何一种情况都可能产生误报,将 ANSI 文件错误地识别为 UTF-*(尽管不太可能).您应该始终拥有元数据来告诉您文件的编码方式,事后不可能以 100% 的准确度对其进行检测.
If you expect UTF-16 files without BOM as well (it's possible for, for example, XML files which specify the encoding in the XML declaration), then you have to shove that rule in there as well. Though any of the above may produce a false positive, falsely identifying an ANSI file as UTF-* (though it's unlikely). You should always have metadata that tells you what encoding a file is in, detecting it after the fact is not possible with 100% accuracy.
相关文章