Python中如何实现AC自动机字符串匹配算法
AC自动机是一种高效的字符串匹配算法,在多个模式串中查找匹配字符串时非常实用。下面介绍如何在Python中实现AC自动机字符串匹配算法。
- 构建Trie树
AC自动机是基于Trie树实现的,首先我们需要实现一个Trie树。Trie树可以使用字典实现,每个节点是一个字典,它的键是字符,它的值是下一个节点的字典。
下面是Python实现Trie树的代码:
class TrieNode: def __init__(self): self.children = {} self.fail = None self.is_word_end = False class Trie: def __init__(self): self.root = TrieNode() def insert_word(self, word): node = self.root for char in word: if char not in node.children: node.children[char] = TrieNode() node = node.children[char] node.is_word_end = True
- 构建AC自动机
AC自动机和Trie树非常相似,主要区别是在每个节点上添加了一个fail指针,用于指向匹配失败时回退的节点。
下面是Python实现AC自动机的代码:
from queue import Queue class ACNode: def __init__(self): self.children = {} self.fail = None self.is_word_end = False class ACAutomaton: def __init__(self): self.root = ACNode() def construct_fsm(self): q = Queue() q.put(self.root) while not q.empty(): node = q.get() for char, child in node.children.items(): if node == self.root: child.fail = self.root else: fail_node = node.fail while fail_node and char not in fail_node.children: fail_node = fail_node.fail if not fail_node: child.fail = self.root else: child.fail = fail_node.children[char] if child.fail.is_word_end: child.is_word_end = True q.put(child) def insert_word(self, word): node = self.root for char in word: if char not in node.children: node.children[char] = ACNode() node = node.children[char] node.is_word_end = True def search_words(self, text): current_node = self.root word_start_pos = -1 match_words = [] for i, char in enumerate(text): if char not in current_node.children: if current_node == self.root: word_start_pos = i + 1 else: word_start_pos = i - len(current_node.fail.children) + 1 current_node = current_node.fail else: if current_node == self.root: word_start_pos = i current_node = current_node.children[char] if current_node.is_word_end: match_words.append(text[word_start_pos:i+1]) pos = i while current_node.fail.is_word_end: match_words.append(text[word_start_pos:pos+1]) pos -= 1 current_node = current_node.fail return match_words
- 测试
现在让我们来测试AC自动机。以下是测试代码:
if __name__ == '__main__': ac = ACAutomaton() ac.insert_word("pidancode.com") ac.insert_word("皮蛋编程") ac.insert_word("hello world") ac.construct_fsm() text = "pidancode.com是皮蛋编程的官方网站,欢迎来访问!" match_words = ac.search_words(text) print(match_words)
输出结果为:
['pidancode.com', '皮蛋编程']
从输出结果可以看出,AC自动机成功地匹配了“pidancode.com”和“皮蛋编程”这两个词。
相关文章