将列表的字典(2 级深)展平

2022-01-13 00:00:00 python dictionary data-structures mapreduce

问题描述

我正在努力解决这个问题，但它不够灵活.

I'm trying to wrap my brain around this but it's not flexible enough.

在我的 Python 脚本中，我有一个列表字典.(实际上它会更深一点，但这个问题不涉及这个级别.)我想将所有这些扁平化为一个长列表，丢弃所有字典键.

In my Python script I have a dictionary of dictionaries of lists. (Actually it gets a little deeper but that level is not involved in this question.) I want to flatten all this into one long list, throwing away all the dictionary keys.

所以我想变身

{1: {'a': [1, 2, 3], 'b': [0]}, 2: {'c': [4, 5, 1], 'd': [3, 8]}}

到

[1, 2, 3, 0, 4, 5, 1, 3, 8]

我可能会设置一个 map-reduce 来迭代外部字典的项目，以从每个子字典构建一个子列表，然后将所有子列表连接在一起.

I could probably set up a map-reduce to iterate over items of the outer dictionary to build a sublist from each subdictionary and then concatenate all the sublists together.

但这对于大型数据集似乎效率低下，因为中间数据结构(子列表)会被丢弃.有没有办法一次性完成?

But that seems inefficient for large data sets, because of the intermediate data structures (sublists) that will get thrown away. Is there a way to do it in one pass?

除此之外，我很乐意接受一个有效的两级实现......我的 map-reduce 生锈了！

Barring that, I would be happy to accept a two-level implementation that works... my map-reduce is rusty!

更新:对于那些感兴趣的人，下面是我最终使用的代码.

Update: For those who are interested, below is the code I ended up using.

请注意，尽管我在上面要求一个列表作为输出，但我真正需要的是一个排序列表；即展平的输出可以是任何可以排序的迭代.

Note that although I asked above for a list as output, what I really needed was a sorted list; i.e. the output of the flattening could be any iterable that can be sorted.

def genSessions(d): """Given the ipDict, return an iterator that provides all the sessions, one by one, converted to tuples.""" for uaDict in d.itervalues(): for sessions in uaDict.itervalues(): for session in sessions: yield tuple(session)

...

# Flatten dict of dicts of lists of sessions into a list of sessions. # Sort that list by start time sessionsByStartTime = sorted(genSessions(ipDict), key=operator.itemgetter(0)) # Then make another copy sorted by end time. sessionsByEndTime = sorted(sessionsByStartTime, key=operator.itemgetter(1))

再次感谢所有提供帮助的人.

Thanks again to all who helped.

[更新:将 nthGetter() 替换为 operator.itemgetter()，感谢@intuited.]

[Update: replaced nthGetter() with operator.itemgetter(), thanks to @intuited.]

解决方案

编辑:重新阅读原始问题并重新编写答案以假设所有非字典都是要展平的列表.

edit: re-read the original question and reworked answer to assume that all non-dictionaries are lists to be flattened.

如果您不确定字典的深度，您可能需要使用递归函数.@Arrieta 已经发布递归构建非字典值列表的函数.

In cases where you're not sure how far down the dictionaries go, you would want to use a recursive function. @Arrieta has already posted a function that recursively builds a list of non-dictionary values.

这是一个生成器，在字典树中产生连续的非字典值:

This one is a generator that yields successive non-dictionary values in the dictionary tree:

def flatten(d): """Recursively flatten dictionary values in `d`. >>> hat = {'cat': ['images/cat-in-the-hat.png'], ... 'fish': {'colours': {'red': [0xFF0000], 'blue': [0x0000FF]}, ... 'numbers': {'one': [1], 'two': [2]}}, ... 'food': {'eggs': {'green': [0x00FF00]}, ... 'ham': ['lean', 'medium', 'fat']}} >>> set_of_values = set(flatten(hat)) >>> sorted(set_of_values) [1, 2, 255, 65280, 16711680, 'fat', 'images/cat-in-the-hat.png', 'lean', 'medium'] """ try: for v in d.itervalues(): for nested_v in flatten(v): yield nested_v except AttributeError: for list_v in d: yield list_v

doctest 将生成的迭代器传递给 set 函数.这很可能是您想要的，因为正如 Martelli 先生指出的那样，字典的值没有内在的顺序，因此没有理由跟踪它们被发现的顺序.

The doctest passes the resulting iterator to the set function. This is likely to be what you want, since, as Mr. Martelli points out, there's no intrinsic order to the values of a dictionary, and therefore no reason to keep track of the order in which they were found.

您可能希望跟踪每个值的出现次数；如果将迭代器传递给 set，此信息将丢失.如果你想跟踪它，只需将 flatten(hat) 的结果传递给其他函数，而不是 set.在 Python 2.7 下，其他函数可能是 collections.Counter.为了与进化较少的 python 兼容，您可以编写自己的函数或(在效率上有所损失)将 sorted 与 itertools.groupby 结合起来.

You may want to keep track of the number of occurrences of each value; this information will be lost if you pass the iterator to set. If you want to track that, just pass the result of flatten(hat) to some other function instead of set. Under Python 2.7, that other function could be collections.Counter. For compatibility with less-evolved pythons, you can write your own function or (with some loss of efficiency) combine sorted with itertools.groupby.

相关文章