从 numPy 数组列表中删除重复项
问题描述
我有一个普通的 Python 列表,其中包含(多维)numPy 数组,所有数组都具有相同的形状和相同数量的值.列表中的一些数组与之前的数组重复.
I have an ordinary Python list that contains (multidimensional) numPy arrays, all of the same shape and with the same number of values. Some of the arrays in the list are duplicates of earlier ones.
我的问题是我想删除所有重复项,但是数据类型是 numPy 数组这一事实使这有点复杂......
I have the problem that I want to remove all the duplicates, but the fact that the data type is numPy arrays complicates this a bit...
• 我不能使用 set(),因为 numPy 数组不可散列.
• 我无法在插入过程中检查重复项,因为数组是由函数批量生成并使用 .extend() 添加到列表中的.
• numPy 数组在不使用 numPy 自己的函数之一的情况下无法直接进行比较,所以我不能只使用if x in list"...
• 列表的内容需要在进程结束时保留为numPy 数组;我可以比较转换为嵌套列表的数组的副本,但我不能将数组永久转换为直接的 python 列表.
• I can't use set() as numPy arrays are not hashable.
• I can't check for duplicates during insertion, as the arrays are generated in batches by a function and added to the list with .extend().
• numPy arrays aren't directly comparable without resorting to one of numPy's own functions, so I can't just go something that uses "if x in list"...
• The contents of the list need to remain numPy arrays at the end of the process; I could compare copies of the arrays converted to nested lists, but I can't convert the arrays to straight python lists permanently.
关于如何在这里有效地删除重复项有什么建议吗?
Any suggestions on how I can remove duplicates efficiently here?
解决方案
在这里使用解决方案:numpy 数组最有效的散列属性 我们看到,如果 a 是一个 numpy 数组,散列最适合使用 a.tostring().所以:
Using the solutions here: Most efficient property to hash for numpy array we see that hashing works best with a.tostring() if a is an numpy array. So:
import numpy as np
arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])]
L = {array.tostring(): array for array in arraylist}
L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])]
相关文章