如何过滤一组(int,str)元组,仅返回第一个元素中具有最小值的元组?

2022-01-20 00:00:00 python min set tuples filter

问题描述

假设我有一组用分数"表示 URL 的元组:

Suppose I have a set of tuples representing URLS with "scores":

{(0.75, 'http://www.foo.com'), (0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com'), (0.66, 'http://www.bar.com')}.

我有什么简洁的方法可以过滤掉重复的 URL,只返回得分最低的 URL?也就是从上面的例子集合中,我想得到如下集合,其中每个 URL 只出现一次,与原始集合对应的分数最低:

What is a concise way for me to filter out duplicate URLS, returning only the URL with the lowest score? That is, from the example set above, I want to get the following set, where each URL appears only once, with the lowest corresponding score from the original set:

{(0.5, 'http://www.foo.com'),(0.33, 'http://www.bar.com')}

我想出了以下解决方案:

I came up with the following solution:

from collections import defaultdict

seen = defaultdict(lambda:1)
for score, url in s:
    if score < seen[url]:
        seen[url] = score

filtered = {(v,k) for k,v in seen.items()}

...但我觉得可能有一些更简单、更有效的方法可以做到这一点,而无需使用中间 dict 来跟踪最大元素,然后从中重新生成集合.按第一个元素的最小值/最大值过滤一组元组的最佳方法是什么?

... but I feel like there is probably some simpler and more efficient way to do this without using the intermediary dict to keep track of the max element, and then regenerate the set from that. What is the best way to filter a set of tuples by the min/max of the first element?


解决方案

你已经实现了我能想到的最简单的方法.我要做的唯一改变是循环——一个稍微简洁一点的版本是使用 min.

You've already implemented the simplest approach I can think of. The only change I'd make would be to the loop—a slightly more concise version is using min.

seen = defaultdict(lambda: 1)  # `lambda: float('inf')` if scores can be > 1
for score, url in s:
    seen[url] = min(seen[url], score)

{(v,k) for k,v in seen.items()}
# {(0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com')}

<小时>

如果您真的想要一个更短的解决方案,就像我说的那样,这不是最简单的方法,但它是一种单一的方法.大多数挑战是交换 URL 和分数,因此您可以在删除重复项时使用 URL 作为键.不用说,排序是这里的先决条件(这就是为什么我不像上面那样喜欢这个解决方案).


If you really want a shorter solution, like I said, it isn't the simplest approach, but it is a one liner. Most of the challenge is interchanging the URL and the score so you can use the URL as a key when dropping duplicates. It goes without saying that sorting is a pre-condition here (that's why I don't like this solution as much as the one above).

{(v, k) for k, v in dict(sorted(((v, k) for k, v in s), reverse=True)).items()}
# {(0.33, 'http://www.bar.com'), (0.5, 'http://www.foo.com')}

如果 s 看起来像这样,这个解决方案就会变得更短:

This solution becomes so much shorter if s looks like this:

s2 = {(v,k) for k, v in s}
s2 
# {('http://www.bar.com', 0.33), ('http://www.bar.com', 0.66), ...}

你只需要这样做

list(dict(sorted(s2, reverse=True)).items())
# [('http://www.foo.com', 0.5), ('http://www.bar.com', 0.33)]

相关文章