在 python 的元组列表中有效且更快地迭代超过 3600 万个项目

2022-01-24 00:00:00 python numpy list performance iteration

问题描述

首先，在有人将其标记为重复之前，请阅读以下内容.我不确定迭代中的延迟是由于庞大的规模还是我的逻辑.我有一个用例，我必须在元组列表中迭代 3600 万个项目.我的主要要求是速度和效率.样品清单:

Firstly, before anyone marks it as a duplicate, please read below. I am unsure if the delay in the iteration is due to the huge size or my logic. I have a use case where I have to iterate over 36 million items in a list of tuples. My main requirement is speed and efficiency. Sample list:

[ ('how are you', 'I am fine'), ('how are you', 'I am not fine'), ...36 million items... ]

到目前为止我做了什么:

What I have done so far:

for query_question in combined: query = "{}".format(word_tokenize(query_question[0])) question = "{}".format(word_tokenize(query_question[1])) # the function uses a naive doc2vec extension of GLOVE word vectors vec1 = np.mean([ word_vector_dict[word] for word in literal_eval(query) if word in word_vector_dict ], axis=0) vec2 = np.mean([ word_vector_dict[word] for word in literal_eval(question) if word in word_vector_dict ], axis=0) similarity_score = 1 - distance.cosine(vec1, vec2) store_question_score = store_question_score.append( (query_question[1], similarity_score) ) count += 1 if(count == len(data_list)): store_question_score_descending = store_question_score.sort( key=itemgetter(1), reverse=True ) result_dict[query_question[0]] = store_question_score_descending[:5] store_question_score =[] count = 1

上述逻辑旨在计算问题之间的相似度分数并执行文本相似度算法.我怀疑迭代中的延迟可能是 vec1 和 vec2 的计算. 如果是这样，我怎样才能做得更好?我正在寻找如何加快这个过程.

The above logic aims to calculate the similarity scores between questions and perform a text similarity algorithm. I'm suspecting the delay in the iteration could be the calculation of vec1 and vec2. If so, how can I do this better? I am looking for how to speed up the process.

还有很多其他问题类似于迭代巨大列表，但我找不到任何可以解决我的问题的问题.

There are plenty of other questions similar to iterative over huge lists, but I could not find any that solved my problem.

非常感谢您提供的任何帮助.

I really appreciate any help you can provide.

解决方案

尝试缓存:

from functools import lru_cache @lru_cache(maxsize=None) def compute_vector(s): return np.mean([ word_vector_dict[word] for word in literal_eval(s) if word in word_vector_dict ], axis=0)

然后改用这个:

vec1 = compute_vector(query) vec2 = compute_vector(question)

如果向量的大小是固定的，您可以通过缓存到形状为 (num_unique_keys, len(vec1)) 的 numpy 数组做得更好，在您的情况下 num_unique_keys =370000 + 100:

If the size of the vectors is fixed, you can do even better by caching to a numpy array of shape (num_unique_keys, len(vec1)), where in your case num_unique_keys = 370000 + 100:

class VectorCache: def __init__(self, func, num_keys, item_size): self.func = func self.cache = np.empty((num_keys, item_size), dtype=float) self.keys = {} def __getitem__(self, key): if key in self.keys return self.cache[self.keys[key]] self.keys[key] = len(self.keys) item = self.func(key) self.cache[self.keys[key]] = item return item def compute_vector(s): return np.mean([ word_vector_dict[word] for word in literal_eval(s) if word in word_vector_dict ], axis=0) vector_cache = VectorCache(compute_vector, num_keys, item_size)

然后:

vec1 = vector_cache[query] vec2 = vector_cache[question]

使用类似的技术，您还可以缓存余弦距离:

Using a similar technique, you can also cache the cosine distances:

@lru_cache(maxsize=None) def cosine_distance(query, question): return distance.cosine(vector_cache[query], vector_cache[question])

相关文章