RegExp:删除字符串中可以包含其他句点的最后一个句点(挖掘输出)

2022-01-20 00:00:00 python regex find

问题描述

我正在尝试解析 linux dig 命令的输出并执行几个用正则表达式一次性完成.

I am trying to parse the output of the linux dig command and do several things on one shot with regular expressions.

假设我挖主机mail.yahoo.com:

/usr/bin/dig +nocomments +noquestion 
    +noauthority +noadditional +nostats +nocmd 
    mail.yahoo.com A

此命令输出:

mail.yahoo.com.                   0  IN  CNAME  login.yahoo.com.
login.yahoo.com.                  0  IN  CNAME  ats.login.lgg1.b.yahoo.com.
ats.login.lgg1.b.yahoo.com.       0  IN  CNAME  ats.member.g02.yahoodns.net.
ats.member.g02.yahoodns.net.      0  IN  CNAME  any-ats.member.a02.yahoodns.net.
any-ats.member.a02.yahoodns.net. 12  IN  A      98.139.21.169

我想要找到所有 <host><record_type><resolved_name> 部分最后一段只使用一个正则表达式

What I'd like to is finding all the <host>, <record_type> and <resolved_name> parts without the final period using only one regular expression

对于这个带有 mail.yahoo.com 的特定示例,应该是:

For this particular example with mail.yahoo.com, it'd be:

[
    ('mail.yahoo.com', 'CNAME', 'login.yahoo.com'),
    ('login.yahoo.com', 'CNAME', 'ats.login.lgg1.b.yahoo.com'),
    ('ats.login.lgg1.b.yahoo.com', 'CNAME', 'ats.member.g02.yahoodns.net'),
    ('ats.member.g02.yahoodns.net', 'CNAME', 'any-ats.member.a02.yahoodns.net'),
    ('any-ats.member.a02.yahoodns.net', 'A', '98.139.21.169'),
]

但事实证明,dig 命令可能会在名称末尾显示一个句点:

But it turns out that the dig command might be showing a period at the end of the name:

    mail.yahoo.com. 
        ^     ^   ^
        |     |   |
  Good dot    |   |
              |   |
        Good dot  |
                  |
           (!) Baaaad dot

使用正则表达式拆分 dig 的输出并返回带有最后一个句点的名称非常简单:

Doing a regular expression that splits dig's output and returns the name with the final period is fairly straightforward:

regex = re.compile("^(S+).+INs+([A-Z]+)s+(S+).*s*$",re.MULTILINE)

但是使用该正则表达式调用 .findall 确实会返回主机中的最后一个句点,因为 S+ 也会匹配最后一个句点:

But calling .findall with that regex does return the final period in the host, because S+ will match the last period as well:

[
    ('mail.yahoo.com.', 'CNAME', 'login.yahoo.com.'),
    ('login.yahoo.com.', 'CNAME', 'ats.login.lgg1.b.yahoo.com.'),
    ('ats.login.lgg1.b.yahoo.com.', 'CNAME', 'ats.member.g02.yahoodns.net.'),
    ('ats.member.g02.yahoodns.net.', 'CNAME', 'any-ats.member.a02.yahoodns.net.'),
    ('any-ats.member.a02.yahoodns.net.', 'A', '98.139.21.169'),
]

所以我需要 something 匹配所有非空格 S 除非它是一个句点后跟一个空格.

So I'd need something that matches all non-spaces S except if it's a period followed by a whitespace.

我已经做了无数次尝试,但我无法想出一个像样的解决方案.

I've done endless tries, and I haven't been able to come up with a decent solution.

提前谢谢你!

PS:我知道我总是可以使用简单"的正则表达式并(在第二次通过时)删除找到的字符串的最后一个点,但我很好奇这是否可以用正则表达式一次性完成.

PS: I know I can always use the "easy" regular expression and (on a second pass) remove the last dot of the found string, but I'm curious about whether this can be done with a regular expression in one shot.


解决方案

您可以将此模式与多行修饰符一起使用:

You can use this pattern with multiline modifier:

^([^ ]+)(?<!.).?[ ]+[0-9]+[ ]+IN[ ]+([^ ]+)[ ]+(.+(?<!.)).?$

存储在 $1 $2 和 $3 中的组

Groups stored in $1 $2 and $3

演示

试试这个:

^([^ 	]+)(?<!.).?[ 	]+[0-9]+[ 	]+IN[ 	]+([^ 	]+)[ 	]+(.+(?<!.)).?$

相关文章