为什么将列表转换为集合比仅使用列表来计算列表差异更快?

2022-01-17 00:00:00 python python-2.7 list performance set

问题描述

说,我想计算两个列表的差异 C = A - B:

Say, I wish to compute the difference of two lists C = A - B:

A = [1,2,3,4,5,6,7,8,9] 
B = [1,3,5,8,9]
C = [2,4,6,7]          #Result

AB 都使用唯一整数排序(不确定是否有办法告诉 Python 列表的这个属性).我需要保留元素的顺序.AFAIK有两种可能的方法

A and B are both sorted with unique integers (not sure if there is a way to tell Python about this property of the list). I need to preserve the order of the elements. AFAIK there are two possible ways of doing it

方法一:将B转化为集合,使用列表推导生成C:

s = set(B)
C = [x for x in A if x not in s]

方法二:直接使用列表推导:

C = [x for x in A if x not in B]

为什么 #1#2 更高效?转换为集合没有开销吗?我在这里错过了什么?

Why is #1 more efficient than #2? Isn't there an overhead to convert to a set? What am I missing here?

此答案中给出了一些性能基准.

更新:我知道一个集合的平均 O(1) 查找时间比列表的 O(n) 要好,但是如果原始列表 A 包含大约一百万个整数,那么创建集合实际上不会花费更长的时间吗?

UPDATE: I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?


解决方案

将列表转换为集合是有开销的,但是对于那些 in 测试.

There is overhead to convert a list to a set, but a set is substantially faster than a list for those in tests.

您可以立即查看项目 x 是否在集合 y 中,因为下面使用了一个哈希表.无论您的集合有多大,查找时间都是相同的(基本上是瞬时的)——这在 Big-O 表示法中称为 O(1).对于列表,您必须单独检查每个元素以查看项目 x 是否在列表 z 中.随着列表的增长,检查将花费更长的时间 - 这是 O(n),这意味着操作的长度与列表的长度直接相关.

You can instantly see if item x is in set y because there's a hash table being used underneath. No matter how large your set is, the lookup time is the same (basically instantaneous) - this is known in Big-O notation as O(1). For a list, you have to individually check every element to see if item x is in list z. As your list grows, the check will take longer - this is O(n), meaning the length of the operation is directly tied to how long the list is.

提高的速度可以抵消集合创建开销,这就是您的集合检查最终变得更快的原因.

That increased speed can offset the set creation overhead, which is how your set check ends up being faster.

要回答另一个问题,Python 无法确定您的列表是否已排序 - 无论如何,如果您使用的是标准 list 对象,则无法确定.因此,它无法通过列表理解实现 O(log n) 性能.如果您想编写自己的二进制搜索方法,假设列表已排序,您当然可以这样做,但 O(1) 总比 O(log n) 好.

to answer that other question, Python has no way of determining that your list is sorted - not if you're using a standard list object, anyway. So it can't achieve O(log n) performance with a list comprehension. If you wanted to write your own binary search method which assumes the list is sorted, you can certainly do so, but O(1) beats O(log n) any day.

编辑 2:

我知道集合的平均 O(1) 查找时间比列表的平均查找时间要快O(n) 但如果原始列表 A 包含大约一百万左右整数,集合创建实际上不会花费更长的时间吗?

I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?

不,一点也不.从列表中创建一个集合是一个 O(n) 操作,因为将一个项目插入一个集合是 O(1) 并且您正在这样做 n 次.如果您有一个包含一百万个整数的列表,将其转换为一个集合需要两个 O(n) 步骤,而重复扫描该列表将是 n O(n) 步骤.实际上,对于具有一百万个整数的列表,创建集合的速度大约会快 250,000 倍,而且列表中的项目越多,速度差异就会越来越大.

No, not at all. Creating a set out of a list is a O(n) operation, as inserting an item into a set is O(1) and you're doing that n times. If you have a list with a million integers in it, converting it into a set involves two O(n) steps, while repeatedly scanning the list is going to be n O(n) steps. In practice, creating the set is going to be about 250,000 times faster for a list with a million integers, and the speed difference will grow larger and larger the more items you have in your list.

相关文章