“u"到底是什么意思?和“r"字符串标志,什么是原始字符串文字?

2022-01-29 00:00:00 python unicode python-2.x rawstring

问题描述

在询问时 this question,我意识到我对原始字符串知之甚少.对于自称是 Django 培训师的人来说,这很糟糕.

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.

我知道编码是什么,而且我知道 u'' 单独做什么,因为我知道了什么是 Unicode.

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.

  • 但是 r'' 到底是做什么的呢?会产生什么样的字符串?

  • But what does r'' do exactly? What kind of string does it result in?

最重要的是,ur'' 到底是做什么的?

And above all, what the heck does ur'' do?

最后,有没有可靠的方法可以从 Unicode 字符串返回到简单的原始字符串?

Finally, is there any reliable way to go back from a Unicode string to a simple raw string?

啊,顺便说一句,如果您的系统和文本编辑器字符集设置为 UTF-8,那么 u'' 真的会做什么吗?

Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?


解决方案

实际上没有任何原始字符串";有原始的字符串字面量,它们正是在开头引号之前用 'r' 标记的字符串字面量.

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

原始字符串文字"与字符串文字的语法略有不同,其中反斜杠 被视为只是一个反斜杠"(除非它正好位于否则会终止文字的引用)——没有转义序列"来表示换行符、制表符、退格符、换页符等.在普通的字符串文字中,每个反斜杠都必须加倍以避免被视为转义序列的开始.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, , is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

这种语法变体的存在主要是因为正则表达式模式的语法带有大量反斜杠(但从不在末尾,因此上面的except"子句无关紧要)并且当你避免将每个都加倍时看起来会更好一些其中——仅此而已.表达本机 Windows 文件路径(使用反斜杠而不是其他平台上的常规斜杠)也获得了一定的普及,但这很少需要(因为普通斜杠在 Windows 上也能正常工作)并且不完美(由于except"子句以上).

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' 是一个字节字符串(在 Python 2.* 中),ur'...' 是一个 Unicode 字符串(同样,在 Python 中2.*),其他三种引用也产生完全相同类型的字符串(例如 r'...'r'''...'''r"..."r"""..."""都是字节串,以此类推).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

不确定您所说的返回返回"是什么意思 - 本质上没有前后方向,因为没有原始字符串 type,它只是一种替代语法表达完全正常的字符串对象,可能是字节或 unicode.

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

是的,在 Python 2.* 中,u'...' is 当然总是不同于 '...' -- 前者是unicode字符串,后者是字节字符串.文字可能用什么编码是一个完全正交的问题.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

例如,考虑(Python 2.6):

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

Unicode 对象当然会占用更多内存空间(很明显,对于非常短的字符串来说差异非常小 ;-).

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

相关文章