将 SQL_Latin1_General_CP1_CI_AS 编码为 UTF-8

我正在使用 DomDocument 用 PHP 生成 XML 文件,我需要处理亚洲字符.我正在使用 pdo_mssql 驱动程序从 MSSQL2008 服务器中提取数据,并对 XML 属性值应用 utf8_encode().只要没有特殊字符,一切正常.

I'm generating a XML file with PHP using DomDocument and I need to handle asian characters. I'm pulling data from the MSSQL2008 server using the pdo_mssql driver and I apply utf8_encode() on the XML attribute values. Everything works fine as long as there's no special characters.

服务器是 MS SQL Server 2008 SP3

The server is MS SQL Server 2008 SP3


The database, table and column collation are all SQL_Latin1_General_CP1_CI_AS

我使用的是 PHP 5.2.17

I'm using PHP 5.2.17

这是我的 PDO 对象:

Here's my PDO object:

$pdo = new PDO("mssql:host=MyServer,1433;dbname=MyDatabase", user123, password123);

我的查询是一个基本的 SELECT.

My query is a basic SELECT.

我知道将特殊字符存储到 SQL_Latin1_General_CP1_CI_AS 列中并不是很好,但理想情况下,让它在不更改的情况下工作会很好,因为其他非 PHP 程序已经使用该列并且它工作正常.在 SQL Server Management Studio 中,我可以正确地看到亚洲字符.

I know storing special characters into SQL_Latin1_General_CP1_CI_AS columns isn't great, but ideally it would be nice to make it work without changing it, because other non-PHP programs already use that column and it works fine. In SQL Server Management Studio I can see the asian characters correctly.


Considering all the details above, how should I process the data?



I found how to solve it, so hopefully this will be helpful to someone.

首先,SQL_Latin1_General_CP1_CI_AS 是 CP-1252 和 UTF-8 的奇怪组合.基本字符是 CP-1252,所以这就是为什么我所要做的就是 UTF-8 并且一切正常.亚洲和其他 UTF-8 字符以 2 个字节编码,php pdo_mssql 驱动程序似乎讨厌不同长度的字符,因此它似乎对 varchar(而不是 nvarchar)执行 CAST,然后所有 2 个字节字符都变成问号('?').

First, SQL_Latin1_General_CP1_CI_AS is a strange mix of CP-1252 and UTF-8. The basic characters are CP-1252, so this is why all I had to do was UTF-8 and everything worked. The asian and other UTF-8 characters are encoded on 2 bytes and the php pdo_mssql driver seems to hate varying length characters so it seems to do a CAST to varchar (instead of nvarchar) and then all the 2 byte characters become question marks ('?').

我通过将其转换为二进制文件来修复它,然后使用 php 重建文本:

I fixed it by casting it to binary and then I rebuild the text with php:


在 php 中:

//Binary to hexadecimal
$hex = bin2hex($bin);

//And then from hex to string
$str = "";
for ($i=0;$i<strlen($hex) -1;$i+=2)
    $str .= chr(hexdec($hex[$i].$hex[$i+1]));
//And then from UCS-2LE/SQL_Latin1_General_CP1_CI_AS (that's the column format in the DB) to UTF-8
$str = iconv('UCS-2LE', 'UTF-8', $str);
