如何在Javascript中获取日文字符的长度?

2022-01-16 00:00:00 unicode cjk asp-classic javascript shift-jis

我有一个带有 SHIFT_JIS 字符集的 ASP 经典页面.页面head部分下的meta标签是这样的:

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:

<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">

我的页面有一个文本框 (txtName)，它只能允许 200 个字符.我有一个验证字符长度的 Javascript 函数，该函数在我的提交按钮的 onclick() 事件中调用.

My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.

if(document.frmPage.txtName.value.length > 200) { alert("You have exceeded the maximum length of 200."); return false; }

问题是，Javascript 无法获取以 SHIFT_JIS 编码的正确长度的日文字符.例如，字符测的 SHIFT_JIS 长度为 8 个字符，但 Javascript 仅将其识别为一个字符，可能是因为 Javascript 默认使用的 Unicode 编码.在 SHIFT_JIS 中，某些字符(例如ケ)有 2 或 3 个字符.

The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.

如果我只依赖 Javascript 提供的长度，长日文字符将通过页面验证并尝试保存在数据库中，然后由于 DB 列的最大长度为 200 而失败.

If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.

我使用的浏览器是 Internet Explorer.有没有办法使用 Javascript 获取日文字符的 SHIFT_JIS 长度?是否可以使用 Javascript 从 Unicode 转换为 SHIFT_JIS?如何?

The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?

感谢您的帮助！

推荐答案

例如，字符测的 SHIFT_JIS 长度为 8 个字符，但 Javascript 仅将其识别为一个字符，可能是因为 Unicode 编码的原因

For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding

让我们明确一点:测，U+6D4B(汉字'测量，估计，猜想')是单个字符.当您将其编码为特定编码(如 Shift-JIS)时，它很可能会变成多个字节.

Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.

一般而言，JavaScript 不提供编码表，因此您无法确定一个字符将占用多少字节.如果你真的需要，你必须携带足够的数据来自己解决.例如，如果您假设输入仅包含在 Shift-JIS 中有效的字符，则此函数将通过保留所有单字节字符的列表来计算需要多少字节，并假设每个其他字符占用两个字节:

In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:

function getShiftJISByteLength(s) { return s.replace(/[^x00-x80｡｢｣､･ｦｧｨｩｪｫｬｭｮｯｰｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾅﾆﾇﾈﾉﾊﾋﾌﾍﾎﾏﾐﾑﾒﾓﾔﾕﾖﾗﾘﾙﾚﾛﾜﾝﾞﾟ]/g, 'xx').length; }

但是，Shift-JIS 中没有 8 字节序列，而且 Shift-JIS 中根本没有字符测".(这是一个在日本不使用的汉字.)

However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)

你可能会认为它构成一个 8 字节序列的原因是:当浏览器无法在表单中提交字符时，因为它不存在于目标字符集中，它会用 HTML 字符引用替换它:在这种情况下 测.这是一个有损的修改:您无法分辨用户是按字面输入的 测 还是 测.如果您将提交的内容 测 显示为 测 那么这意味着您忘记对输出进行 HTML 编码，这可能意味着您的应用程序很容易受到攻击跨站点脚本.

Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.

唯一明智的答案是使用 UTF-8 而不是 Shift-JIS.UTF-8 可以愉快地对测或任何其他字符进行编码，而无需求助于损坏的 HTML 字符引用.如果您需要在数据库中存储受编码字节长度限制的内容，可以使用一种偷偷摸摸的技巧来获取字符串中 UTF-8 字节的数量:

The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:

function getUTF8ByteLength(s) { return unescape(encodeURIComponent(s)).length; }

虽然在数据库中存储原生 Unicode 字符串可能会更好，这样长度限制指的是实际字符，而不是某些编码中的字节.

although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.

相关文章