边界匹配器正则表达式 () 上的以下片段问题

2022-01-17 00:00:00 regex set java

我的意见:

 1. end 
 2. end of the day or end of the week 
 3. endline
 4. something 
 5. "something" end

基于上述讨论,如果我尝试使用此代码段替换单个字符串,它会成功从该行中删除相应的单词

Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully

public class DeleteTest {

    public static void main(String[] args) {

        // TODO Auto-generated method stub
        try {
        File file = new File("C:/Java samples/myfile.txt");
        File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
        String delete="end";
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));

        for (String line; (line = reader.readLine()) != null;) {
            line = line.replaceAll("\b"+delete+"\b", "");
       writer.println(line);
        }
        reader.close();
        writer.close();
        }
        catch (Exception e) {
            System.out.println("Something went Wrong");
        }
    }
}

我的输出如果我使用上面的片段:(也是我的预期输出)

My output If I use the above snippet:(Also my expected output)

 1.  
 2. of the day or of the week
 3. endline
 4. something
 5. "something"

但是当我包含更多要删除的单词时,并且为此我使用 Set 时,我使用以下代码片段:

But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:

public static void main(String[] args) {

    // TODO Auto-generated method stub
    try {

    File file = new File("C:/Java samples/myfile.txt");
    File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));

        Set<String> toDelete = new HashSet<>();
        toDelete.add("end");
        toDelete.add("something");

    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("\b"+toDelete+"\b", "");
    writer.println(line);
    }
    reader.close();
    writer.close();
    }
    catch (Exception e) {
        System.out.println("Something went Wrong");
    }
}

我的输出是:(它只是删除了空间)

I get my output as: (It just removes the space)

 1. end
 2. endofthedayorendoftheweek
 3. endline
 4. something
 5. "something" end 

你们能帮我解决这个问题吗?

Can u guys help me on this?

点击这里关注线程

推荐答案

你需要创建一个 交替组出组与

You need to create an alternation group out of the set with

String.join("|", toDelete)

并用作

line = line.replaceAll("\b(?:"+String.join("|", toDelete)+")\b", "");

图案看起来像

(?:end|something)

请参阅 正则表达式演示.这里,(?:...) 是一个非捕获组,用于分组几个备选方案,而不为捕获(您不需要它,因为您删除了匹配项).

See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).

或者,更好的是,在进入循环之前编译正则表达式:

Or, better, compile the regex before entering the loop:

Pattern pat = Pattern.compile("\b(?:" + String.join("|", toDelete) + ")\b");
...
    line = pat.matcher(line).replaceAll("");

更新:

要允许匹配可能包含特殊字符的整个单词",您需要 Pattern.quote 这些单词以转义这些特殊字符,然后您需要使用明确的单词边界,(?<!w) 而不是初始的  以确保之前没有单词 char 和 (?!w) 负前瞻而不是最后的  以确保匹配后没有单词 char.

To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!w) instead of the initial  to make sure there is no word char before and (?!w) negative lookahead instead of the final  to make sure there is no word char after the match.

在 Java 8 中,您可以使用以下代码:

In Java 8, you may use this code:

Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
    .map(Pattern::quote)
    .collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\w)(?:" + String.join("|", nToDel) + ")(?!\w)";

正则表达式看起来像 (?<!w)(?:Q+endE|Qsomething-E)(?!w).请注意,QE 之间的符号被解析为 文字符号.

The regex will look like (?<!w)(?:Q+endE|Qsomething-E)(?!w). Note that the symbols between Q and E is parsed as literal symbols.

相关文章