Python中如何实现AC自动机字符串匹配算法

2023-04-17 00:00:00 字符串 匹配 自动机

AC自动机是一种高效的字符串匹配算法,在多个模式串中查找匹配字符串时非常实用。下面介绍如何在Python中实现AC自动机字符串匹配算法。

  1. 构建Trie树

AC自动机是基于Trie树实现的,首先我们需要实现一个Trie树。Trie树可以使用字典实现,每个节点是一个字典,它的键是字符,它的值是下一个节点的字典。

下面是Python实现Trie树的代码:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.fail = None
        self.is_word_end = False


class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert_word(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_word_end = True
  1. 构建AC自动机

AC自动机和Trie树非常相似,主要区别是在每个节点上添加了一个fail指针,用于指向匹配失败时回退的节点。

下面是Python实现AC自动机的代码:

from queue import Queue

class ACNode:
    def __init__(self):
        self.children = {}
        self.fail = None
        self.is_word_end = False


class ACAutomaton:
    def __init__(self):
        self.root = ACNode()

    def construct_fsm(self):
        q = Queue()
        q.put(self.root)
        while not q.empty():
            node = q.get()
            for char, child in node.children.items():
                if node == self.root:
                    child.fail = self.root
                else:
                    fail_node = node.fail
                    while fail_node and char not in fail_node.children:
                        fail_node = fail_node.fail
                    if not fail_node:
                        child.fail = self.root
                    else:
                        child.fail = fail_node.children[char]

                if child.fail.is_word_end:
                    child.is_word_end = True

                q.put(child)

    def insert_word(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = ACNode()
            node = node.children[char]
        node.is_word_end = True

    def search_words(self, text):
        current_node = self.root
        word_start_pos = -1
        match_words = []
        for i, char in enumerate(text):
            if char not in current_node.children:
                if current_node == self.root:
                    word_start_pos = i + 1
                else:
                    word_start_pos = i - len(current_node.fail.children) + 1
                current_node = current_node.fail
            else:
                if current_node == self.root:
                    word_start_pos = i
                current_node = current_node.children[char]
                if current_node.is_word_end:
                    match_words.append(text[word_start_pos:i+1])
                    pos = i
                    while current_node.fail.is_word_end:
                        match_words.append(text[word_start_pos:pos+1])
                        pos -= 1
                        current_node = current_node.fail
        return match_words
  1. 测试

现在让我们来测试AC自动机。以下是测试代码:

if __name__ == '__main__':
    ac = ACAutomaton()
    ac.insert_word("pidancode.com")
    ac.insert_word("皮蛋编程")
    ac.insert_word("hello world")
    ac.construct_fsm()
    text = "pidancode.com是皮蛋编程的官方网站,欢迎来访问!"
    match_words = ac.search_words(text)
    print(match_words)

输出结果为:

['pidancode.com', '皮蛋编程']

从输出结果可以看出,AC自动机成功地匹配了“pidancode.com”和“皮蛋编程”这两个词。

相关文章