对集合列表进行减法
问题描述
给定一个集合列表:
allsets = [set([1, 2, 4]), set([4, 5, 6]), set([4, 5, 7])]
什么是计算与其他集合不重叠的元素集合的相应列表的pythonic方法?
What is a pythonic way to compute the corresponding list of sets of elements having no overlap with other sets?
only = [set([1, 2]), set([6]), set([7])]
有没有办法通过列表理解来做到这一点?
Is there a way to do this with a list comprehension?
解决方案
为避免二次运行时,您需要进行初始传递以确定哪些元素出现在多个集合中:
To avoid quadratic runtime, you'd want to make an initial pass to figure out which elements appear in more than one set:
import itertools
import collections
element_counts = collections.Counter(itertools.chain.from_iterable(allsets))
然后您可以简单地制作一个集合列表,保留所有只出现一次的元素:
Then you can simply make a list of sets retaining all elements that only appear once:
nondupes = [{elem for elem in original if element_counts[elem] == 1}
for original in allsets]
<小时>
或者,我们可以不直接从 element_counts
构造 nondupes
,而是进行额外的传递来构造一组恰好出现在一个输入中的所有元素.这需要一个额外的语句,但它允许我们利用 &
运算符来设置交集,以使列表理解更短更高效:
Alternatively, instead of constructing nondupes
from element_counts
directly, we can make an additional pass to construct a set of all elements that appear in exactly one input. This requires an additional statement, but it allows us to take advantage of the &
operator for set intersection to make the list comprehension shorter and more efficient:
element_counts = collections.Counter(itertools.chain.from_iterable(allsets))
all_uniques = {elem for elem, count in element_counts.items() if count == 1}
# ^ viewitems() in Python 2.7
nondupes = [original & all_uniques for original in allsets]
时间似乎表明使用 all_uniques
集可以显着加快整个重复消除过程.对于大量重复的输入集,Python 3 上的 3.5x 加速 左右,尽管只有 30% 加速 用于 Python 2 上的整体重复消除过程,因为更多的运行时由构造计数器控制.这种加速是相当可观的,尽管不如首先使用 element_counts
避免二次运行时那么重要.如果您使用的是 Python 2 并且此代码对速度至关重要,您可能希望使用普通的 dict
或 collections.defaultdict
而不是 Counter
.
Timing seems to indicate that using an all_uniques
set produces a substantial speedup for the overall duplicate-elimination process. It's up to about a 3.5x speedup on Python 3 for heavily-duplicate input sets, though only about a 30% speedup for the overall duplicate-elimination process on Python 2 due to more of the runtime being dominated by constructing the Counter. This speedup is fairly substantial, though not nearly as important as avoiding quadratic runtime by using element_counts
in the first place. If you're on Python 2 and this code is speed-critical, you'd want to use an ordinary dict
or a collections.defaultdict
instead of a Counter
.
另一种方法是从 element_counts
构造一个 dupes
集并使用 original - dupes
而不是 original &列表理解中的 all_uniques
,正如 munk 建议.这是否比使用 all_uniques
集和 &
表现更好或更差将取决于您输入的重复程度以及您使用的 Python 版本,但它 没有 似乎无论哪种方式都有很大的不同.
Another way would be to construct a dupes
set from element_counts
and use original - dupes
instead of original & all_uniques
in the list comprehension, as suggested by munk. Whether this performs better or worse than using an all_uniques
set and &
would depend on the degree of duplication in your input and what Python version you're on, but it doesn't seem to make much of a difference either way.
相关文章