Java字符串.getBytes(&Quot；UTF-8&Quot；)javascript等效项

2022-06-28 00:00:00 byte utf-8 javascript java utf-16

我在Java中有这个字符串：

"test.message"

byte[] bytes = plaintext.getBytes("UTF-8");
//result: [116, 101, 115, 116, 46, 109, 101, 115, 115, 97, 103, 101]

如果我在Java脚本中执行相同的操作：

    stringToByteArray: function (str) {         
        str = unescape(encodeURIComponent(str));

        var bytes = new Array(str.length);
        for (var i = 0; i < str.length; ++i)
            bytes[i] = str.charCodeAt(i);

        return bytes;
    },

我得到：

 [7,163,140,72,178,72,244,241,149,43,67,124]

我的印象是取消转义(encodeURIComponent())会将字符串正确地转换为UTF-8。难道不是这样吗？

引用：

http://ecmanaut.blogspot.be/2006/07/encoding-decoding-utf8-in-javascript.html

解决方案

没有字符串的字符编码概念，所有内容都在UTF-16中。大多数情况下，UTF-16中的char的值与UTF-8匹配，因此您可以忘记它有什么不同。

有更好的方法可以做到这一点，但

function s(x) {return x.charCodeAt(0);}
"test.message".split('').map(s);
// [116, 101, 115, 116, 46, 109, 101, 115, 115, 97, 103, 101]

那么unescape(encodeURIComponent(str))在做什么？让我们逐个来看一下

encodeURIComponent正在将str中在URI语法中非法或有意义的每个字符转换为URI转义版本，以便将其用作URI的搜索组件中的键或值没有问题，例如encodeURIComponent('&='); // "%26%3D"请注意，这现在是一个6个字符长的字符串。
unescape实际上是折旧的，但它的工作类似于decodeURI或decodeURIComponent(与encodeURIComponent相反)。如果我们查看ES5 spec，我们可以看到11. Let c be the character whose code unit value is the integer represented by the four hexadecimal digits at positions k+2, k+3, k+4, and k+5 within Result(1).
因此，4数字是2字节是"UTF-8"，但是，正如我所提到的，所有字符串都是UTF-16，所以它实际上是将其自身限制为UTF-8的UTF-16字符串。

相关文章