如何以最聪明的方式替换 PHP 中不同的换行样式?

2021-12-25 00:00:00 replace newline php unify

我有一个可能有不同换行样式的文本.我想用相同的换行符替换所有换行符 ' ', ' ','' (在本例中为 ).

I have a text which might have different newline styles. I want to replace all newlines ' ', ' ','' with the same newline (in this case ).

最快的方法是什么?我目前的解决方案看起来很糟糕:

What's the fastest way to do this? My current solution looks like this which is way sucky:

    $sNicetext = str_replace("
",'%%%%somthing%%%%', $sNicetext);
    $sNicetext = str_replace(array("","
"),array("
","
"), $sNicetext);
    $sNicetext = str_replace('%%%%somthing%%%%',"
", $sNicetext);

问题是你不能用一次替换来做到这一点,因为 将被复制到 .

Problem is that you can't do this with one replace because the will be duplicated to .

感谢您的帮助!

推荐答案

$string = preg_replace('~R~u', "
", $string);

如果您不想替换所有 Unicode 换行符而只想替换 CRLF 样式的换行符,请使用:

If you don't want to replace all Unicode newlines but only CRLF style ones, use:

$string = preg_replace('~(*BSR_ANYCRLF)R~', "
", $string);

R 匹配这些换行符,u 是将输入字符串视为 UTF-8 的修饰符.

R matches these newlines, u is a modifier to treat the input string as UTF-8.

来自 PCRE 文档:

什么R匹配

What R matches

默认情况下,模式中的序列 R 匹配任何 Unicode 换行符序列,无论被选为行尾序列.如果你指定

By default, the sequence R in a pattern matches any Unicode newline sequence, whatever has been selected as the line ending sequence. If you specify

     --enable-bsr-anycrlf

默认值已更改,以便 R 仅匹配 CR、LF 或 CRLF.构建 PCRE 时选择的任何内容都可以在库时被覆盖函数被调用.

the default is changed so that R matches only CR, LF, or CRLF. Whatever is selected when PCRE is built can be overridden when the library functions are called.

换行符序列

在字符类之外,默认情况下,转义序列 R 匹配任何 Unicode 换行序列.在非 UTF-8 模式下,R 等价于以下:

Outside a character class, by default, the escape sequence R matches any Unicode newline sequence. In non-UTF-8 mode R is equivalent to the following:

    (?>
|
|x0b|f||x85)

这是一个原子组"的例子,给出了详细信息以下.此特定组匹配两个字符的序列CR 后跟 LF,或单个字符 LF 之一(换行、U+000A)、VT(垂直标签、U+000B)、FF(换页、U+000C)、CR(托架返回,U+000D)或 NEL(下一行,U+0085).两个字符的序列被视为一个不可分割的单元.

This is an example of an "atomic group", details of which are given below. This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next line, U+0085). The two-character sequence is treated as a single unit that cannot be split.

在 UTF-8 模式下,代码点更大的两个附加字符添加超过 255 个:LS(行分隔符,U+2028)和 PS(段落分隔符,U+2029).不需要 Unicode 字符属性支持这些字符被识别.

In UTF-8 mode, two additional characters whose codepoints are greater than 255 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). Unicode character property support is not needed for these characters to be recognized.

可以限制 R 只匹配 CR、LF 或 CRLF(而不是完整的 Unicode 行尾集)通过设置选项PCRE_BSR_ANYCRLF 在编译时或模式匹配时.(BSR 是反斜杠 R"的缩写.)这可以设为默认值PCRE 构建时;如果是这种情况,其他行为可以是通过 PCRE_BSR_UNICODE 选项请求.也可以通过使用以下选项之一启动模式字符串来指定这些设置以下序列:

It is possible to restrict R to match only CR, LF, or CRLF (instead of the complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. (BSR is an abbrevation for "backslash R".) This can be made the default when PCRE is built; if this is the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option. It is also possible to specify these settings by starting a pattern string with one of the following sequences:

    (*BSR_ANYCRLF)   CR, LF, or CRLF only
    (*BSR_UNICODE)   any Unicode newline sequence

这些覆盖默认值和提供给 pcre_compile() 或的选项pcre_compile2(),但它们可以被提供给的选项覆盖pcre_exec() 或 pcre_dfa_exec().请注意,这些特殊设置,其中与 Perl 不兼容,仅在开始时被识别模式,并且它们必须是大写的.如果其中不止一个存在,则使用最后一个.它们可以结合改变换行约定;例如,一个模式可以以:

These override the default and the options given to pcre_compile() or pcre_compile2(), but they can be overridden by options given to pcre_exec() or pcre_dfa_exec(). Note that these special settings, which are not Perl-compatible, are recognized only at the very start of a pattern, and that they must be in upper case. If more than one of them is present, the last one is used. They can be combined with a change of newline convention; for example, a pattern can start with:

    (*ANY)(*BSR_ANYCRLF)

它们也可以与 (*UTF8) 或 (*UCP) 特殊序列组合.在字符类中,R 被视为无法识别的转义序列,因此默认匹配字母R",但会导致错误如果设置了 PCRE_EXTRA.

They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside a character class, R is treated as an unrecognized escape sequence, and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.

相关文章