在python中读取巨大的sas数据集

2022-01-08 00:00:00 python-3.x pandas sas

问题描述

我有一个 50 GB 的 SAS 数据集.我想在熊猫数据框中阅读它.快速读取 sas 数据集的最佳方法是什么.

I have a 50 gb SAS dataset. I want to read it in pandas dataframe. What is the best way to fast read the sas dataset.

我使用下面的代码太慢了:

I used the below code which is way too slow:

import pandas as pd
df = pd.read_sas("xxxx.sas7bdat", chunksize = 10000000)
dfs = []
for chunk in df:
    dfs.append(chunk)
df_final = pd.concat(dfs)

有什么方法可以更快地在 python 中读取大型数据集?可以并行运行这个过程吗?

Is there any way faster way to read large dataset in python? Can run this process parallely?


解决方案

我知道这是一个很晚的回应,但我认为我的回答将对未来的读者有用.几个月前,当我必须读取和处理 SAS 数据时,无论是 SAS7BDAT 还是 xpt 格式的 SAS 数据,我都是寻找可用于读取这些数据集的不同库和包,其中,我将库入围如下:

I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS data either SAS7BDAT or xpt format SAS data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:

  1. pandas(由于社区原因,它在高优先级列表中支持和性能)
  2. SAS7BDAT(能够读取SAS7BDAT 仅限文件,最后一次发布于 2019 年 7 月)
  3. pyreadstat(有希望的性能根据文档以及读取元数据的能力)
  1. pandas (It was on high priority list due to community support and performance)
  2. SAS7BDAT (Is able to read SAS7BDAT files only, and last release July 2019)
  3. pyreadstat (Promising performance as per the documentation plus ability to read meta data)

在拿起任何包之前,我做了一些性能基准测试,虽然在发布此答案时我没有基准测试结果,但我发现 pyreadstatpandas,(似乎它在读取文档中提到的数据时使用了多处理,但我不确定),并且在使用 pyreadstat 时内存消耗和占用空间要小得多对比pandas,加上它可以读取元数据,甚至只允许读取元数据,所以我最终选择了pyreadstat.

Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat to be faster than pandas, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat in comparison to pandas, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat.

pyreadstat读取的数据也是dataframe的形式,不需要手动转换成pandas dataframe.

The data read using pyreadstat is also in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.

说到读取大的SAS数据,pyreadstatrow_limitoffset参数可以用来读取在块中,因此内存不会成为瓶颈,此外,在读取块中的 SAS 数据时,您可以将每个块转换为分类并将其附加到结果数据中,然后再读取另一个块;它将压缩数据大小,因此内存消耗极低(取决于数据,数据帧中唯一值的数量越少,内存使用量就越少).以下代码片段可能对愿意阅读大型 SAS 数据的人有用:

Talking about reading large SAS data, pyreadstat has row_limit and offset parameters which can be used to read in chunk, so the Memory is not going to be a bottleneck, furthermore, while reading the SAS data in chunk, you can convert each chunk to categorical and append it to the resulting data before reading another chunk; it will compress the data size so the Memory consumption is extremely low (depends on the data, the lesser the number of unique values in the dataframe, is lesser the memory usage). The following code snippet might be useful for someone who is willing to read large SAS data:

import pandas as pd
import pyreadstat
filename = 'foo.SAS7BDAT'
CHUNKSIZE = 50000
offset = 0
allChunk,_ = getChunk(row['filePath'], row_limit=CHUNKSIZE, row_offset=offset)
allChunk = allChunk.astype('category')

while True:
    offset += CHUNKSIZE
    # for xpt data, use pyreadstat.read_xpt()
    chunk, _ = pyreadstat.read_sas7bdat(filename, row_limit=CHUNKSIZE, row_offset=offset)
    if chunk.empty: break  # if chunk is empty, it means the entire data has been read, so break

    for eachCol in chunk:  #converting each column to categorical 
        colUnion = pd.api.types.union_categoricals([allChunk[eachCol], chunk[eachCol]])
        allChunk[eachCol] = pd.Categorical(allChunk[eachCol], categories=colUnion.categories)
        chunk[eachCol] = pd.Categorical(chunk[eachCol], categories=colUnion.categories)

    allChunk = pd.concat([allChunk, chunk])  #Append each chunk to the resulting dataframe

PS:请注意,生成的数据帧 allChunk 将所有列作为 Categorical 数据

PS: Please be noted that the resulting dataframe allChunk is going to have all column as Categorical data

这是针对 CDISC 的真实数据(原始和标准化)执行的一些基准测试(将文件读取到数据帧的时间),文件大小范围从几 KB 到几 MB,包括 xpt 和 sas7bdat 文件格式:

Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:

Reading ADAE.xpt 49.06 KB for 100 loops:
    Pandas Average time : 0.02232 seconds
    Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
    Pandas Average time : 0.01610 seconds
    Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
    Pandas Average time : 0.03248 seconds
    Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
    Pandas Average time : 5.30275 seconds
    Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
    Pandas Average time : 15.53942 seconds
    Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04017 seconds
    Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01052 seconds
    Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04446 seconds
    Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01273 seconds
    Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
    Pandas Average time : 7.93266 seconds
    Pyreadstat Average time : 0.92295 seconds

如您所见,对于 xpt 文件,读取文件的时间并不好,但对于 sas7bdat 文件,pyreadstat 的性能优于 pandas.

As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.

上述基准测试是在 pyreadstat 1.0.9、pandas 1.2.4 和 Python 3.7.5 上执行的.

相关文章