有很多字典的列表 VS 有很少列表的字典?

2022-01-21 00:00:00 python pandas dataset

问题描述

我正在用这样的数据集做一些练习:

I am doing some exercises with datasets like so:

包含许多词典的列表

users = [
    {"id": 0, "name": "Ashley"},
    {"id": 1, "name": "Ben"},
    {"id": 2, "name": "Conrad"},
    {"id": 3, "name": "Doug"},
    {"id": 4, "name": "Evin"},
    {"id": 5, "name": "Florian"},
    {"id": 6, "name": "Gerald"}
]

列表很少的字典

users2 = {
    "id": [0, 1, 2, 3, 4, 5, 6],
    "name": ["Ashley", "Ben", "Conrad", "Doug","Evin", "Florian", "Gerald"]
}

Pandas 数据框

import pandas as pd
pd_users = pd.DataFrame(users)
pd_users2 = pd.DataFrame(users2)
print pd_users == pd_users2

问题:

  1. 我应该像 users 还是 users2 那样构建数据集?
  2. 是否存在性能差异?
  3. 一个比另一个更易读吗?
  4. 是否有我应该遵循的标准?
  5. 我通常将这些转换为 pandas 数据帧.当我这样做时,两个版本是相同的......对吗?
  6. 每个元素的输出都是正确的,所以我是否使用 panda df 并不重要?


解决方案

这涉及到 面向列的数据库 与面向行.您的第一个示例是面向行的数据结构,第二个示例是面向列的.在 Python 的特定情况下,第一个可以使用 slots 显着提高效率,这样不需要为每一行复制列字典.

This relates to column oriented databases versus row oriented. Your first example is a row oriented data structure, and the second is column oriented. In the particular case of Python, the first could be made notably more efficient using slots, such that the dictionary of columns doesn't need to be duplicated for every row.

哪种形式效果更好在很大程度上取决于您如何处理数据;例如,如果您只访问任何行的所有内容,那么面向行是很自然的.同时,当您按特定字段搜索时,面向列可以更好地利用缓存等(在 Python 中,这可能会因大量使用引用而减少;类型如 array 可以优化它).传统的面向行的数据库经常使用面向列的排序索引来加快查找速度,了解这些技术后,您可以使用键值存储实现任意组合.

Which form works better depends a lot on what you do with the data; for instance, row oriented is natural if you only ever access all of any row. Column oriented meanwhile makes much better use of caches and such when you're searching by a particular field (in Python, this may be reduced by the heavy use of references; types like array can optimize that). Traditional row oriented databases frequently use column oriented sorted indices to speed up lookups, and knowing these techniques you can implement any combination using a key-value store.

Pandas 确实将您的两个示例转换为相同的格式,但是对于面向行的结构,转换本身的成本更高,因为必须读取每个单独的字典.所有这些成本都可能是微不足道的.

Pandas does convert both your examples to the same format, but the conversion itself is more expensive for the row oriented structure, simply because every individual dictionary must be read. All of these costs may be marginal.

第三个选项在您的示例中不明显:在这种情况下,您只有两列,其中一列是从 0 开始的连续范围内的整数 ID.这可以按条目本身的顺序存储,这意味着整个结构将在您调用的列表中找到 users2['name'];但值得注意的是,没有位置的条目是不完整的.该列表使用 enumerate() 转换为行.数据库通常也有这种特殊情况(例如,sqlite rowid).

There's a third option not evident in your example: In this case, you only have two columns, one of which is an integer ID in a contiguous range from 0. This can be stored in the order of the entries itself, meaning the entire structure would be found in the list you've called users2['name']; but notably, the entries are incomplete without their position. The list translates into rows using enumerate(). It is common for databases to have this special case also (for instance, sqlite rowid).

一般来说,从保持代码合理的数据结构开始,并仅在您了解自己的用例并存在可衡量的性能问题时进行优化.Pandas 之类的工具可能意味着大多数项目无需微调即可正常运行.

In general, start with a data structure that keeps your code sensible, and optimize only when you know your use cases and have a measurable performance issue. Tools like Pandas probably means most projects will function just fine without finetuning.

相关文章