使用 sql/plsql,你如何找出文本使用的字符集?
我有一个 Oracle 数据库,用于存储来自世界各地的不同语言的文档内容.文档存储在带有 BLOB 列的表中,该列存储文档的内容.
I have an Oracle db which stores the content of documents originating from all over the world, with different languages. The documents are stored in a table with a BLOB column which stores the documents' content.
我想通过 Oracle 过程找出每个文档的字符集.我不想使用实用程序 CSSCAN,因为您似乎必须在程序之外的单独会话中使用它.
I want to find out what the char set is for every doc, with an Oracle procedure. I don't want to use the utility CSSCAN since it seems you have to use it in a separate session, outside of your procedure.
感谢您的帮助!
推荐答案
Oracle Globalization Development Kit 可以检测字符集.
GDK 包含在 Oracle 中,但默认情况下未安装在数据库中.要将 .jar 文件加载到数据库中,请在 Oracle 中找到 jlib 目录home 并运行此操作系统命令:
The GDK is included with Oracle but it is not installed in the database by default. To load the .jar files into the database find the jlib directory in the Oracle home and run this operating system command:
loadjava -u USER_NAME@SID orai18n.jar orai18n-collation.jar orai18n-lcsd.jar orai18n-mapping.jar orai18n-net.jar orai18n-servlet.jar orai18n-tools.jar orai18n-translation.jar orai18n-utility.jar
需要一些额外的 Java 权限,即使您的用户具有 DBA.运行此命令,然后重新连接:
Some extra Java privileges are needed, even if your user has DBA. Run this command and then re-connect:
exec dbms_java.grant_permission( 'YOUR_USER_NAME', 'SYS:java.lang.RuntimePermission', 'getClassLoader', '' );
创建一个 Java 类来进行检测.下面是一个非常简单的例子,它返回一个字符串的最佳猜测:
Create a Java class to do the detection. Below is a very simple example that returns the best guess for a string:
create or replace and compile java source named "Character_Set_Detector"
as
import oracle.i18n.lcsd.*;
import java.sql.*;
import java.io.IOException;
public class Character_Set_Detector
{
public static String detect(Blob some_blob) throws SQLException, IOException
{
LCSDetector detector = new LCSDetector();
detector.detect(some_blob.getBinaryStream());
LCSDResultSet detector_results = detector.getResult();
return detector_results.getORACharacterSet();
}
}
/
将 Java 类包装在 PL/SQL 函数中:
Wrap the Java class in a PL/SQL function:
--Wrap the Java class in a PL/SQL function:
create or replace function detect_character_set(some_blob blob)
return varchar2
as language java
name 'Character_Set_Detector.detect(java.sql.Blob) return java.lang.String';
/
我通过将字符串翻译成不同的语言、使用文本编辑器将文本保存为不同的编码、使用十六进制编辑器打开文件并将十六进制转换为 BLOB 来模拟不同的字符集:
I simulated different character sets by translating a string into different languages, saving the text as different encodings with a text editor, opening the file with hex editor, and converting the hex into a BLOB:
--UTF8
--The quick brown fox jumps over the lazy dog
select 1 id, detect_character_set(hextoraw('54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f67')) character_set from dual union all
--Western European (ISO-8859-1)
--El zorro marrón rápido salta sobre el perro perezoso
select 2 id, detect_character_set(hextoraw('456c207a6f72726f206d617272f36e2072e17069646f2073616c746120736f62726520656c20706572726f20706572657a6f736f')) from dual union all
--Chinese Simplified (GBK)
--敏捷的棕色狐狸跳过懒狗
select 3 id, detect_character_set(hextoraw('c3f4bdddb5c4d7d8c9abbafcc0eaccf8b9fdc0c1b9b7')) from dual union all
--Western European (Windows-1252)
--Der schnelle braune Fuchs springt über den faulen Hund
select 4 id, detect_character_set(hextoraw('446572207363686e656c6c6520627261756e6520467563687320737072696e677420fc6265722064656e206661756c656e2048756e64')) from dual union all
--Cyrillic (KOI8-R)
--Быстрая коричневая лиса прыгает через ленивую собаку
select 5 id, detect_character_set(hextoraw('e2d9d3d4d2c1d120cbcfd2c9decec5d7c1d120ccc9d3c120d0d2d9c7c1c5d420dec5d2c5da20ccc5cec9d7d5c020d3cfc2c1cbd5')) from dual;
ID CHARACTER_SET
-- -------------
1 US7ASCII
2 WE8ISO8859P1
3 ZHS16CGB231280
4 WE8ISO8859P1
5 CL8KOI8R
这个简单的例子效果很好,但我不知道它对现实世界的文件效果如何.GDK中有很多功能,上面的代码只是一个简单的起点.只需稍作改动,代码也可以检测语言,如我的回答此处所示.
That trivial example works well but I don't know how well it will work with real-world files. There are a lot of features in the GDK, the above code is only a simple starting point. With only minor changes the code can also detect languages as demonstrated in my answer here.
相关文章