在“from_delayed"JSON 文件中发现 DASK 元数据不匹配

2022-01-21 00:00:00 python dask bigdata dataset

问题描述

我刚刚从 DASK 开始我的冒险,并且我正在学习一个 json 格式的示例数据集.我知道对于初学者来说这不是世界上最简单的数据格式:)

I'm just starting my adventure with DASK and land I'm learning on an example dataset in json format. I know that this is not the easiest data format in the world for a beginner :)

我有一个 json 格式的数据集.我通过 dd.read_json 将数据加载到数据框,一切顺利.例如,compute()len() 函数会出现问题.

I have a dataset in the json format. I loaded the data via dd.read_json to dataframe and everything goes well. The problem occurred with, for example, the compute() or len() function.

我收到此错误:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `DataFrame`
+----------+-------+----------+
| Column   | Found | Expected |
+----------+-------+----------+
| column1  |   -   | object   |
| column2  |   -   | object   |
+----------+-------+----------+

我尝试了不同的方法,但没有任何帮助.我不知道如何处理这个错误.

I tried different things, but nothing helps. I don't know how to handle this error.

请帮忙,我将不胜感激!

Please help, I will be very grateful !


解决方案

我的猜测是你的 JSON 数据在数据的不同部分有不同的列.当 Dask DataFrame 加载您的 JSON 数据时,它会查看第一块数据以确定列名和数据类型是什么.然后它假设您的所有数据看起来像这样.

My guess is that your JSON data has different columns in different parts of the data. When Dask DataFrame loads your JSON data it looks at the first chunk of data to determine what the column names and datatypes are. It then assumes that all of your data looks like this.

这个假设在你的情况下是错误的,可能有一些列只出现在文件的后面.

This assumption turns out to be wrong in your case and probably there is some column that only appears later on in the file.

在确定列名等元数据时,您可能会考虑增加 Dask 读取的样本大小.

You might consider increasing the size of the sample that Dask reads when determining metadata like column names.

df = dd.read_json(..., sample=2**26)

默认为 1MB (2**20)

The default is 1MB (2**20)

相关文章