将带有错误编码数据的 MySQL 表转换为 UTF-8

2022-01-07 00:00:00 character-encoding mysql

我有一个很大的 MySQL 5.1 数据库,出于各种愚蠢的原因,我相信,我一直在 UTF8 表中存储编码为 LATIN1 的 UTF8 字符.真奇怪.我想修复它.

I've got a big ol' MySQL 5.1 database, and for a variety of stupid reasons, I've been storing, I believe, UTF8 characters encoded as LATIN1 in a UTF8 table. It's... strange. And I'd like to fix it.

MySQL - 在 UTF8 上转换 latin1 字符table into UTF8 问题似乎有效——一次一列.但是我有 24 个表和几十个列要转换.我真的在寻找一种解决方案,可以一次转换至少一个表格.

The MySQL - Convert latin1 characters on a UTF8 table into UTF8 question seems to work -- a column at a time. But I have 24 tables and dozens of columns to convert. I'm really looking for a solution that'll convert at least a table at once.

作为参考,对我有用的单列解决方案是:

For reference, the single-column solution that works for me is:

UPDATE foo SET col1 = CONVERT(CAST(CONVERT(col1 USING latin1) AS binary) USING utf8);

对于表格,我可以:

ALTER TABLE foo CONVERT TO CHARACTER SET latin1;
ALTER TABLE foo CONVERT TO CHARACTER SET binary;
ALTER TABLE foo CHARACTER SET utf8  COLLATE utf8_unicode_ci;

这让我非常接近——但是,CONVERT TO CHARACTER SET binary 步骤一举将我所有的 VARCHAR 列转换为 VARBINARY,将我的 TEXT 列转换为 BLOB.我可以通过并将它们改回来,一切似乎都很好......但后来我又回到了让我们单独修改所有列"的世界——在这种情况下,我也可以

which gets me very close -- however, the CONVERT TO CHARACTER SET binary step turns all my VARCHAR columns into VARBINARY and my TEXT columns into BLOBs in one fell swoop. I can go through and change them back and all appears to be well... but then I'm back in the "let's modify all the columns individually" world -- in which case, I may just as well

我对这些 SQL 语句尝试了大约 50 种变体,但我找不到一种既将我的列保留为字符数据类型又正确编码数据的变体.

I've tried about 50 variations on those SQL statements, but I can't find one that both leaves my columns in character data types and encodes the data properly.

有什么建议吗?

更新:决定只修复列而不是等待数据库或表格解决方案,我想出了:

Update: Deciding to just fix the columns rather than waiting for a database- or table-wise solution, I came up with:

#!/usr/bin/env ruby
require 'rubygems'
require 'mysql2'

CONNECT_OPTS = {} # whatever you want
Mysql2::Client.default_query_options.merge!(:as => :array)
conn = Mysql2::Client.new(CONNECT_OPTS)

tables = conn.query("SHOW TABLES").map {|row| row[0] }

# See http://dev.mysql.com/doc/refman/5.0/en/charset-column.html
# One might want to include enum and set columns; I don't have them
TYPES_TO_CONVERT = %w(char varchar text)
tables.each do |table|
  puts "converting #{table}"
  # Get all the columns and we'll filter for the ones we want
  columns = conn.query("DESCRIBE #{table}")
  columns_to_convert = columns.find_all {|row|
    TYPES_TO_CONVERT.include? row[1].gsub(/(d+)/, '')
  }.map {|row| row[0]}
  next if columns_to_convert.empty?

  query = "UPDATE `#{table}` SET "
  query += columns_to_convert.map {|col|
    "`#{col}` = convert(cast(convert(`#{col}` using latin1) as binary) using utf8)"
  }.join ", "
  puts query
  conn.query query
end

... 完成工作.有趣的是,这在我的数据库上运行了 36 秒,而不是花费 13 分钟的 ALTER TABLE 路由(并且有 VARBINARY 问题)或 mysqldump 解决方案,假设我可以让它们运行,则需要超过 20 分钟.

... which gets the job done. Amusingly, this runs on my database in 36 seconds, rather than the ALTER TABLE route which took 13 minutes (and had the VARBINARY problem) or the mysqldump solutions which would take upwards of twenty assuming I could get them to run.

如果有人知道一种优雅的方法可以一步完成整个数据库或表的操作,我仍然会接受答案.

I'll still accept an answer if someone knows an elegant way to do this for a whole database or table in one step.

推荐答案

下面这个方法看起来很有前景&更好的是,它的简单性很美.这个想法是你 mysqldump 你的整个数据库作为 latin1,&然后将其重新编码为 utf-8 导入.

This method below looks really promising & better yet, beautiful in its simplicity. The idea is you mysqldump your entire database as latin1, & then import it re-encoded as utf-8.

导出:

mysqldump -u [用户] -p --opt --quote-names --skip-set-charset --default-character-set=latin1 [数据库] > dump.sql

mysqldump -u [user] -p --opt --quote-names --skip-set-charset --default-character-set=latin1 [database] > dump.sql

导入:

mysql -u [用户] -p --default-character-set=utf8 [数据库] <转储.sql

mysql -u [user] -p --default-character-set=utf8 [database] < dump.sql

我不信任这个解决方案,它完全来自 Gareth Price 的博客.到目前为止,它对每个给他留下评论的人都有效:哇,伙计,你救了我的命.我没有在这上面花 2 小时,而是 2 天"引起了我的注意.

I take no credit for this solution, it's completely from Gareth Price's blog. It has worked for everyone who has left him a comment so far: "Wow man you just saved my life. I did not spent 2 hours on this, but 2 days" caught my attention.

更新 #1: 看起来像 Gareth 不是第一个发现这一点.

Update #1: Looks like Gareth wasn't the first to discover this.

更新 #2: 我刚试过这个 &它适用于我的 UTF8-stored-as-latin1 数据库.只需确保在导入之前将数据库上的默认字符集切换为 utf8 ,否则最终会在特殊字符所在的位置出现纯问号.当然,这可能会产生很多其他后果,因此请先像地狱一样进行测试.

Update #2: I just tried this & it worked beautifully for my UTF8-stored-as-latin1 database. Just make sure you switch the default charset on your database to utf8 before importing, or else you'll end up with plain question marks where the special characters were. Of course this might have plenty of other ramifications so test like hell first.

ALTER SCHEMA [数据库] DEFAULT CHARACTER SET utf8;

ALTER SCHEMA [database] DEFAULT CHARACTER SET utf8;

如果您有任何未设置为架构默认值的表:

And if you have any tables that aren't set to the schema default:

ALTER TABLE [table] CHARACTER SET = DEFAULT;

ALTER TABLE [table] CHARACTER SET = DEFAULT;

(同样的想法,如果您有任何特定于列的字符集设置,则必须执行 ALTER TABLE [table] CHANGE COLUMN [settings] 而不指定 CHARACTERSET 所以它回到表默认)

(same idea if you have any column-specific charset settings, you'll have to do a ALTER TABLE [table] CHANGE COLUMN [settings] without specifying CHARACTER SET so it goes back to the table default)

相关文章