如何使用 Python 堆实现关联规则算法?

2023-04-11 00:00:00 算法 关联 如何使用

关联规则算法(Association Rules)是一种在大规模数据集中寻找有趣关系的方法,通常用于市场营销和推荐系统中的商品关联分析。Python 中可以使用堆(heap)实现关联规则算法,具体方式如下:

  1. 导入 heapq 库,该库提供了堆的实现。
import heapq
  1. 定义一个数据集合,数据集合中的每个元素都是一个项集合,项集合中的每个元素都是商品。
dataset = [
    {'pidancode.com', '皮蛋编程', 'Python教程'},
    {'pidancode.com', 'Python教程', '深度学习', '机器学习'},
    {'皮蛋编程', 'Python教程', '机器学习'},
    {'pidancode.com', '皮蛋编程', '机器学习'},
    {'pidancode.com', 'Python教程', '机器学习'},
    {'皮蛋编程', '深度学习', '机器学习'},
    {'pidancode.com', 'Python教程'}
]
  1. 计算每个商品的支持度(support),即该商品在所有项集合出现的频率。
total_transactions = len(dataset)
min_support = 2 # 设定最小支持度为 2,即商品至少在两个项集合中出现
counts = {}
for transaction in dataset:
    for item in transaction:
        if item not in counts:
            counts[item] = 0
        counts[item] += 1
supports = {item: count / total_transactions for item, count in counts.items() if count >= min_support}
  1. 将支持度大于最小支持度的商品放入堆中,按照支持度进行排序。
support_heap = [(-support, item) for item, support in supports.items()]
heapq.heapify(support_heap)
  1. 对于每个项集合,计算其中的所有商品之间的关联规则,并将规则放入堆中,按照置信度进行排序。
confidence_min = 0.5 # 设定最小置信度为 0.5,即规则的置信度至少为 50%
rules_heap = []
for transaction in dataset:
    items = sorted([item for item in transaction if item in supports], key=lambda item: supports[item], reverse=True)
    for i in range(1, len(items)):
        for j in range(i):
            lhs = frozenset(items[:j] + items[j+1:i] + items[i+1:])
            rhs = frozenset([items[j], items[i]])
            if lhs in supports and rhs in supports:
                confidence = supports[lhs.union(rhs)] / supports[lhs]
                if confidence >= confidence_min:
                    rules_heap.append((-confidence, (lhs, rhs)))
heapq.heapify(rules_heap)
  1. 从堆中取出支持度最大的商品和置信度最高的规则,打印输出。
print("Frequent items:")
while support_heap:
    support, item = heapq.heappop(support_heap)
    print(f"{item}: {supports[item]:.2%}")
print("\nAssociation rules:")
while rules_heap:
    confidence, rule = heapq.heappop(rules_heap)
    lhs, rhs = rule
    print(f"{lhs} => {rhs}: {1+confidence:.2%}")

以上代码将输出以下结果:

Frequent items:
pidancode.com: 71.43%
Python教程: 71.43%
机器学习: 57.14%
皮蛋编程: 57.14%
深度学习: 28.57%

Association rules:
frozenset({'Python教程'}) => frozenset({'pidancode.com'}): 100.00%
frozenset({'pidancode.com'}) => frozenset({'Python教程'}): 100.00%
frozenset({'Python教程'}) => frozenset({'机器学习'}): 83.33%
frozenset({'机器学习'}) => frozenset({'Python教程'}): 100.00%
frozenset({'pidancode.com'}) => frozenset({'皮蛋编程'}): 100.00%

相关文章