fgetcsv() 删除带有变音符号的字符(即非 ASCII) - 如何修复?

2022-01-07 00:00:00 csv character-encoding php

类似问题:
PHP 过程中无法读取 CSV 文件中的某些字符fgetcsv() ,
fgetcsv() 忽略特殊字符当它们位于行首时

我的应用程序有一个表单，用户可以在其中上传 CSV 文件(其 5 个内部用户始终上传了一个有效文件 - 逗号分隔、引用、记录以 LF 结尾)，然后使用PHP:

My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:

$fhandle = fopen($uploaded_file,'r'); while($row = fgetcsv($fhandle, 0, ',', '"', '\')) { print_r($row); // further code not relevant as the data is already corrupt at this point }

出于我无法更改的原因，用户正在上传以 Windows-1250 字符集编码的文件 - 一种单字节 8 位字符编码.

For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.

问题:fgetcsv() 中删除了超过 127 个(扩展 ASCII")的一些(不是全部！)字符.示例数据:

The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:

"15","Ústav" "420","Špičák" "7","Tmaň"

变成

Array ( 0 => 15 1 => "stav" ) Array ( 0 => 420 1 => "pičák" ) Array ( 0 => 7 1 => "Tma" )

(注意č被保留，但Ú被删除)

(Note that č is kept, but Ú is dropped)

fgetcsv 的文档说自 4.3.5fgetcsv() 现在是二进制安全的"，但看起来不是.是我做错了什么，还是这个功能被破坏了，我应该寻找一种不同的方式来解析 CSV?

The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

推荐答案

事实证明我没有很好地阅读文档 - fgetcsv() 只是某种程度上二进制安全的.对于普通的 ASCII < 来说是安全的.127，但是文档还说:

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:

注意:

考虑区域设置通过这个功能.如果 LANG 是例如en_US.UTF-8，一字节文件编码被读错了功能

Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function

换句话说，fgetcsv() 试图是二进制安全的，但它实际上不是(因为它同时也弄乱了字符集)，并且它可能会破坏它的数据读取(因为此设置未在 php.ini 中配置，而是从 $LANG 读取).

In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).

我通过使用 fgets(它适用于字节，而不是字符)读取行并使用文档中注释中的 CSV 函数将它们解析为数组:

I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:

$fhandle = fopen($uploaded_file,'r'); while($raw_row = fgets($fhandle)) { // fgets is actually binary safe $row = csvstring_to_array($raw_row, ',', '"', " "); // $row is now read correctly }

相关文章