在 linux 上使用 pyodbc 在 nvarchar mssql 字段中插入 unicode 或 utf-8 字符
我使用的是 Ubuntu 9.04
我安装了以下软件包版本:
I have installed the following package versions:
unixodbc and unixodbc-dev: 2.2.11-16build3
tdsodbc: 0.82-4
libsybdb5: 0.82-4
freetds-common and freetds-dev: 0.82-4
我已经像这样配置了 /etc/unixodbc.ini
:
I have configured /etc/unixodbc.ini
like this:
[FreeTDS]
Description = TDS driver (Sybase/MS SQL)
Driver = /usr/lib/odbc/libtdsodbc.so
Setup = /usr/lib/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 2
我已经这样配置 /etc/freetds/freetds.conf
:
[global]
tds version = 8.0
client charset = UTF-8
我从 http://github.com/mkleehammer/pyodbc
获取了 pyodbc 修订版 31e2fae4adbf1b2af1726e5668a3414cf46b454f
并使用python setup.py install
安装了它代码>"
I have grabbed pyodbc revision 31e2fae4adbf1b2af1726e5668a3414cf46b454f
from http://github.com/mkleehammer/pyodbc
and installed it using "python setup.py install
"
我的本地网络上安装了一台装有 Microsoft SQL Server 2000 的 Windows 机器,启动并监听本地 IP 地址 10.32.42.69.我创建了一个名为Common"的空数据库.我有用户sa",密码secret",拥有完全权限.
I have a windows machine with Microsoft SQL Server 2000 installed on my local network, up and listening on the local ip address 10.32.42.69. I have an empty database created with name "Common". I have the user "sa" with password "secret" with full priviledges.
我使用以下 python 代码来设置连接:
I am using the following python code to setup the connection:
import pyodbc
odbcstring = "SERVER=10.32.42.69;UID=sa;PWD=secret;DATABASE=Common;DRIVER=FreeTDS"
con = pyodbc.connect(s)
cur = con.cursor()
cur.execute('''
CREATE TABLE testing (
id INTEGER NOT NULL IDENTITY(1,1),
name NVARCHAR(200) NULL,
PRIMARY KEY (id)
)
''')
con.commit()
到目前为止一切有效.我在服务器上使用了 SQLServer 的企业管理器,新表就在那里.现在我想在表上插入一些数据.
Everything WORKS up to this point. I have used SQLServer's Enterprise Manager on the server and the new table is there. Now I want to insert some data on the table.
cur = con.cursor()
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'something',))
失败了!!这是我得到的错误:
That fails!! Here's the error I get:
pyodbc.Error: ('HY004', '[HY004] [FreeTDS][SQL Server]Invalid data type
(0) (SQLBindParameter)'
由于我的客户端配置为使用 UTF-8,我想我可以通过将数据编码为 UTF-8 来解决.那行得通,但后来我得到了奇怪的数据:
Since my client is configured to use UTF-8 I thought I could solve by encoding data to UTF-8. That works, but then I get back strange data:
cur = con.cursor()
cur.execute('DELETE FROM testing')
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'somé string'.encode('utf-8'),))
con.commit()
# fetching data back
cur = con.cursor()
cur.execute('SELECT name FROM testing')
data = cur.fetchone()
print type(data[0]), data[0]
那没有报错,但是返回的数据和发送的数据不一样!我得到:
That gives no error, but the data returned is not the same data sent! I get:
<type 'unicode'> somé string
也就是说,pyodbc 不会直接接受一个 unicode 对象,但它会将 unicode 对象返回给我!并且编码被混淆了!
That is, pyodbc won't accept an unicode object directly, but it returns unicode objects back to me! And the encoding is being mixed up!
现在问题来了:
我想要代码在 NVARCHAR 和/或 NTEXT 字段中插入 unicode 数据.当我回查询时,我想要回插入的相同数据.
I want code to insert unicode data in a NVARCHAR and/or NTEXT field. When I query back, I want the same data I inserted back.
这可以通过对系统进行不同的配置,或者通过使用能够在插入或检索时正确地将数据转换为/从 unicode 转换的包装函数
That can be by configuring the system differently, or by using a wrapper function able to convert the data correctly to/from unicode when inserting or retrieving
这要求不高吧?
推荐答案
我记得在使用 odbc 驱动程序时遇到过这种愚蠢的问题,即使当时是 java+oracle 组合.
I can remember having this kind of stupid problems using odbc drivers, even if that time it was a java+oracle combination.
核心是 odbc 驱动程序在将查询字符串发送到数据库时显然对其进行了编码.即使该字段是 Unicode,如果您提供 Unicode,在某些情况下似乎也无所谓.
The core thing is that odbc driver apparently encodes the query string when sending it to the DB. Even if the field is Unicode, and if you provide Unicode, in some cases it does not seem to matter.
您需要确保驱动程序发送的内容与您的数据库(不仅是服务器,还有数据库)具有相同的编码.否则,当然你会得到时髦的字符,因为客户端或服务器在编码/或解码时会混淆.您是否知道您的服务器用作解码数据的默认字符集(MS 喜欢说的代码点)?
You need to ensure that what is sent by the driver has the same encoding as your Database (not only server, but also database). Otherwise, of course you get funky characters because either the client or the server is mixing things up when encoding/or decoding. Do you have any idea of the charset (codepoint as MS like to say) that your server is using as a default for decoding data?
例如,请参见该 MS 页面.对于 Unicode 字段,排序规则仅用于定义列中的排序顺序,不指定数据的存储方式.
See that MS page for example. For Unicode fields, collation is used only to define the sort order in the column, not to specify how the data is stored.
如果您将数据存储为 Unicode,则有一种独特的方式来表示它,这就是 Unicode 的目的:无需定义与您将要使用的所有语言兼容的字符集 :)
If you store your data as Unicode, there is an Unique way to represent it, that's the purpose of Unicode: no need to define a charset that is compatible with all the languages that you are going to use :)
这里的问题是当我向服务器提供不是 Unicode 的数据时会发生什么?".例如:
The question here is "what happens when I give data to the server that is not Unicode?". For example:
- 当我向服务器发送一个 UTF-8 字符串时,它是如何理解的?
- 当我向服务器发送一个 UTF-16 字符串时,它是如何理解的?
- 当我向服务器发送一个 Latin1 字符串时,它是如何理解的?
从服务器的角度来看,所有这 3 个字符串都只是一个字节流.服务器无法猜测您对它们进行编码的编码.这意味着如果您的 odbc 客户端最终将 bytestrings(一个编码字符串)发送到服务器而不是发送 unicode 数据,您会遇到麻烦:如果你这样做,服务器将使用预定义的编码(这是我的问题:服务器将使用什么编码?因为它不是猜测,它必须是一个参数值),如果字符串已经使用不同的编码进行编码, dzing,数据会损坏.
From the server perspective, all these 3 strings are only a stream of bytes. The server cannot guess the encoding in which you encoded them. Which means that you will get troubles if your odbc client ends up sending bytestrings (an encoded string) to the server instead of sending unicode data: if you do so, the server will use a predefined encoding (that was my question: what encoding the server will use? Since it is not guessing, it must be a parameter value), and if the string had been encoded using a different encoding, dzing, data will get corrupted.
这与在 Python 中所做的完全相似:
It's exactly similar as doing in Python:
uni = u'Hey my name is André'
in_utf8 = uni.encode('utf-8')
# send the utf-8 data to server
# send(in_utf8)
# on server side
# server receives it. But server is Japanese.
# So the server treats the data with the National charset, shift-jis:
some_string = in_utf8 # some_string = receive()
decoded = some_string.decode('sjis')
试试吧.很有趣.解码后的字符串应该是嘿,我的名字是安德烈",但实际上是嘿,我的名字是安德鲁テま".é 被日文 テる
Just try it. It's fun. The decoded string is supposed to be "Hey my name is André", but is "Hey my name is Andrテゥ". é gets replaced by Japanese テゥ
因此我的建议是:您需要确保 pyodbc 能够以 Unicode 格式直接发送数据.如果pyodbc没有做到这一点,你会得到意想不到的结果.
Hence my suggestion: you need to ensure that pyodbc is able to send directly the data as Unicode. If pyodbc fails to do this, you will get unexpected results.
并且我以Client to Server的方式描述了这个问题.但是当从服务器返回到客户端时,也会出现同样的问题.如果客户端无法理解 Unicode 数据,您很可能会遇到麻烦.
And I described the problem in the Client to Server way. But the same sort of issues can arise when communicating back from the Server to the Client. If the Client cannot understand Unicode data, you'll likely get into troubles.
实际上,FreeTDS 会为您处理一切并将所有数据转换为 UCS2 Unicode.(来源).
Actually, FreeTDS takes care of things for you and translates all the data to UCS2 unicode. (Source).
- 服务器<-->FreeTDS:UCS2 数据
- FreeTDS <-->pyodbc : 编码字符串,以 UTF-8 编码(来自
/etc/freetds/freetds.conf
)
因此,如果您将 UTF-8 数据传递给 pyodbc,我希望您的应用程序能够正常工作.事实上,正如这个 django-pyodbc 票 所述,django-pyodbc 以 UTF-8 与 pyodbc 通信,所以你应该没问题.
So I would expect your application to work correctly if you pass UTF-8 data to pyodbc. In fact, as this django-pyodbc ticket states, django-pyodbc communicates in UTF-8 with pyodbc, so you should be fine.
但是,cramm0 说 FreeTDS0.82 并非完全没有错误,0.82 与官方修补的 0.82 版本之间存在显着差异,可以在这里找到.您应该尝试使用打过补丁的 FreeTDS
However, cramm0 says that FreeTDS 0.82 is not completely bugfree, and that there are significant differences between 0.82 and the official patched 0.82 version that can be found here. You should probably try using the patched FreeTDS
已编辑:删除了与 FreeTDS 无关但仅与 Easysoft 商业 odbc 驱动程序相关的旧数据.对不起.
相关文章