如何在 PHP 中支持 UTF8(日语、阿拉伯语、西班牙语……) URL

2022-01-18 00:00:00 utf-8 internationalization php

对于网络应用程序,我们需要链接到一些用户生成的内容.用户输入标题,例如一个产品,我们为该产品生成一个 SEO 友好的 URL:

For a web application, we need to link to some user generated content. A users types in a title for e.g. a product and we generate an SEO friendly url for that product:

喜欢这个

title: a nice product

www.user.com/product/a-nice-product

title: أبجد هوز

www.user.com/product/أبجد هوز

问题是这些外语网址不受支持,浏览器拒绝打开这些链接.我已经看到 wordpress 设置支持这种 url,所以我想这样做是可能的.

The problem is that those foreign language url's aren't supported and a browser refuses to open those links. I've seen wordpress setups support that kind of url's so I guess it's possible to do this.

有谁知道我们应该如何在 php 中支持这一点?

Does anyone know how we should support this in php?

维基百科处理得很好:http://ar.wikipedia.org

推荐答案

虽然 URL 本身只允许 US-ASCII 字符,但您可以 在 URI 路径中使用 Unicode 字符 如果您使用 UTF-8 对其进行编码,然后使用 百分比编码:

Although the URL itself only allows US-ASCII characters, you can use Unicode characters in the URI path if you encode them with UTF-8 and then convert them in US-ASCII characters by using the percent-encoding:

内部以不同字符编码形式提供标识符的系统(例如 EBCDIC)通常会将文本标识符的字符转换为 UTF-8 [STD63] (或其他 US-ASCII 字符编码的超集),从而提供比由以下产生的标识符更有意义的标识符只需对原始八位字节进行百分比编码.

A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets.

所以你可以做这样的事情(假设 UTF-8):

So you can do something like this (assuming UTF-8):

$title = 'أبجد هوز';
$path = '/product/'.rawurlencode($title);
echo $path;  // "/product/%D8%A3%D8%A8%D8%AC%D8%AF%20%D9%87%D9%88%D8%B2"

虽然 URI 路径实际上是使用百分比编码进行编码的,但大多数现代浏览器在使用 UTF-8 时会以 Unicode 显示此序列表示的字符.

Although the URI path is actually encoded with the percent-encoding, most modern browsers will display the characters this sequence represents in Unicode when UTF-8 is used.

相关文章