将 NumPy 数组转换为集合需要太长时间
问题描述
我正在尝试执行以下操作
从 numpy 导入 *x = array([[3,2,3],[711,4,104],.......,[4,4,782,7845]]) # 大的 nparray对于 x 中的项目:设置(项目)
与以下相比,它需要很长时间:
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # 大 nparray对于 x 中的项目:item.tolist()
为什么将 NumPy 数组转换为 set
比转换为 list
需要更长的时间?我的意思是基本上两者都有复杂性O(n)
?
TL;DR: set()
函数使用 Python 的迭代协议创建一个集合.但是在 NumPy 数组上迭代(在 Python 级别上)非常慢,以至于在进行迭代之前使用 tolist()
将数组转换为 Python 列表会(快得多).
要了解迭代 NumPy 数组为何如此缓慢,了解 Python 对象、Python 列表和 NumPy 数组是如何存储在内存中的非常重要.
Python 对象需要一些簿记属性(如引用计数、指向其类的链接……)和它所代表的值.例如整数 ten = 10
可能如下所示:
蓝色圆圈是您在 Python 解释器中为变量 ten
使用的名称",下部对象(实例)是实际表示整数的对象(因为簿记属性并不重要这里我在图片中忽略了它们).
一个 Python list
只是 Python 对象的集合,例如 mylist = [1, 2, 3]
会这样保存:
这次列表引用了 Python 整数 1
、2
和 3
,而名称 mylist
只是引用list
实例.
但是数组 myarray = np.array([1, 2, 3])
不会将 Python 对象存储为元素:
值 1
、2
和 3
直接存储在 NumPy array
实例中.
有了这些信息,我可以解释为什么迭代 array
比迭代 list
慢得多:
每次访问 list
中的下一个元素时,list
只会返回一个存储的对象.这非常快,因为该元素已经作为 Python 对象存在(它只需要将引用计数加一).
另一方面,当你想要一个 array
的元素时,它需要在返回之前为包含所有簿记内容的值创建一个新的 Python盒子".当您遍历数组时,它需要为数组中的每个元素创建一个 Python 框:
创建这些盒子很慢,这是迭代 NumPy 数组比迭代存储值及其盒子的 Python 集合(列表/元组/集合/字典)慢得多的主要原因:
将 numpy 导入为 nparr = np.arange(100000)lst = 列表(范围(100000))定义迭代(obj):对于 obj 中的项目:经过%timeit 迭代(arr)# 每个循环 20.2 毫秒 ± 155 微秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit 迭代(lst)# 每个循环 3.96 毫秒 ± 26.6 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)
set
构造函数"只是对对象进行迭代.
我不能肯定地回答的一件事是为什么 tolist
方法要快得多.最后,生成的 Python 列表中的每个值都需要位于Python 盒子"中,因此 tolist
可以 避免的工作并不多.但我肯定知道的一件事是 list(array)
比 array.tolist()
慢:
arr = np.arange(100000)%timeit 列表(arr)# 每个循环 20 毫秒 ± 114 微秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit arr.tolist()# 每个循环 10.3 毫秒 ± 253 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)
其中每一个都有 O(n)
运行时复杂度,但常数因素非常不同.
在您的情况下,您确实将 set()
与 tolist()
进行了比较 - 这不是一个特别好的比较.将 set(arr)
与 list(arr)
或 set(arr.tolist())
与 进行比较会更有意义arr.tolist()
:
arr = np.random.randint(0, 1000, (10000, 3))def tosets(arr):对于 arr 中的行:设置(线)def tolists(arr):对于 arr 中的行:列表(行)def tolists_method(arr):对于 arr 中的行:line.tolist()def tosets_intermediatelist(arr):对于 arr 中的行:设置(line.tolist())%timeit tosets(arr)# 每个循环 72.2 ms ± 2.68 ms(平均值 ± 标准偏差.7 次运行,每次 10 个循环)%timeit tolists(arr)# 每个循环 80.5 毫秒 ± 2.18 毫秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit tolists_method(arr)# 每个循环 16.3 毫秒 ± 140 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)%timeit tosets_intermediatelist(arr)# 每个循环 38.5 毫秒 ± 200 微秒(平均值 ± 标准偏差.7 次运行,每次 10 次循环)
因此,如果您想要 set
,最好使用 set(arr.tolist())
.对于更大的数组,可以使用
集合比 list
s 更复杂,因为它涉及 hash
es 和空槽(Python 对集合使用开放寻址,所以我假设 numba 也会这样做).
与 NumPy array
一样,numba set
直接保存值.因此,当您将 NumPy array
转换为 numba set
(或反之亦然)时,它根本不需要使用Python 盒子",所以当您创建set
s 在 numba nopython 函数中它甚至比 set(arr.tolist())
操作要快得多:
import numba as nb@nb.njitdef tosets_numba(arr):对于范围内的线(arr.shape [0]):设置(arr [lineno])tosets_numba(arr) # 热身%timeit tosets_numba(arr)# 每个循环 6.55 毫秒 ± 105 微秒(平均值 ± 标准偏差.7 次运行,每次 100 次循环)
这大约比 set(arr.tolist())
方法快五倍.但重要的是要强调我没有从函数中返回 set
.当您 return 从 nopython numba 函数到 Python Numba 的 set
会创建一个 python 集 - 包括为集中的所有值创建框"(这是 numba 隐藏的东西).
仅供参考:如果您将 list
s 传递给 Numba nopython 函数或从这些函数返回列表,则会发生相同的装箱/拆箱.那么 Python 中的 O(1)
操作是 Numba 中的 O(n)
操作!这就是为什么通常最好将 NumPy 数组传递给 numba nopython 函数(即 O(1)
).
我假设如果你从函数中返回这些集合(现在真的不可能,因为 numba 目前不支持集合列表)它会更慢(因为它会创建一个 numba 集合 并且 稍后将其转换为 python 集)或仅稍微快一点(如果转换 numbaset -> pythonset 真的非常快).
就我个人而言,只有当我不需要从函数中返回它们并在函数内对集合执行所有操作时,我才会将 numba 用于集合并且只有当集合上的所有操作都是在 nopython 模式下支持.在任何其他情况下,我都不会在这里使用 numba.
<小时>请注意:from numpy import *
应该避免,这样做时会隐藏几个 python 内置函数(sum
, min
, max
, ...) 并且它把很多东西放到你的全局变量中.最好使用 import numpy as np
.函数调用前面的 np.
使代码更清晰,无需输入太多内容.
I'm trying to execute the following
from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
set(item)
and it takes very long compared to:
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
item.tolist()
Why does it take much longer to convert a NumPy array to a set
than to a list
?
I mean basically both have complexity O(n)
?
TL;DR: The set()
function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist()
to convert the array to a Python list before doing the iteration is (much) faster.
To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.
A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10
could look like this:
The blue circle is the "name" you use in the Python interpreter for the variable ten
and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).
A Python list
is just a collection of Python objects, for example mylist = [1, 2, 3]
would be saved like this:
This time the list references the Python integers 1
, 2
and 3
and the name mylist
just references the list
instance.
But an array myarray = np.array([1, 2, 3])
doesn't store Python objects as elements:
The values 1
, 2
and 3
are stored directly in the NumPy array
instance.
With this information I can explain why iterating over an array
is so much slower compared to an iteration over a list
:
Each time you access the next element in a list
the list
just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).
On the other hand when you want an element of an array
it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:
Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:
import numpy as np
arr = np.arange(100000)
lst = list(range(100000))
def iterateover(obj):
for item in obj:
pass
%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The set
"constructor" just does an iteration over the object.
One thing I can't answer definitely is why the tolist
method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist
could avoid. But one thing I know for sure is that list(array)
is slower than array.tolist()
:
arr = np.arange(100000)
%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Each of these has O(n)
runtime complexity but the constant factors are very different.
In your case you did compare set()
to tolist()
- which isn't a particular good comparison. It would make more sense to compare set(arr)
to list(arr)
or set(arr.tolist())
to arr.tolist()
:
arr = np.random.randint(0, 1000, (10000, 3))
def tosets(arr):
for line in arr:
set(line)
def tolists(arr):
for line in arr:
list(line)
def tolists_method(arr):
for line in arr:
line.tolist()
def tosets_intermediatelist(arr):
for line in arr:
set(line.tolist())
%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So if you want set
s you are better off using set(arr.tolist())
. For bigger arrays it could make sense to use np.unique
but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).
In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.
I'm not sure how numba (re-)implements set
s but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set
:
Sets are more complicated than list
s because it they involve hash
es and empty slots (Python uses open-addressing for sets, so I assume numba will too).
Like the NumPy array
the numba set
saves the values directly. So when you convert a NumPy array
to a numba set
(or vise-versa) it won't need to use "Python boxes" at all, so when you create the set
s in a numba nopython function it will be much faster even than the set(arr.tolist())
operation:
import numba as nb
@nb.njit
def tosets_numba(arr):
for lineno in range(arr.shape[0]):
set(arr[lineno])
tosets_numba(arr) # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's roughly five times faster than the set(arr.tolist())
approach. But it's important to highlight that I did not return the set
s from the function. When you return a set
from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).
Just FYI: The same boxing/unboxing happens if you pass list
s to Numba nopython functions or return lists from these functions. So what's a O(1)
operation in Python is an O(n)
operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)
).
I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).
Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.
Just a note: from numpy import *
should be avoided, you hide several python built-in functions when you do that (sum
, min
, max
, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np
. The np.
in front of function calls makes the code clearer and isn't much to type.
相关文章