有使用 h5py 在 Python 中对大数据进行分析工作的经验吗?
问题描述
我做了大量的统计工作,并使用 Python 作为我的主要语言.虽然我使用的一些数据集可能占用 20GB 的内存,这使得使用 numpy、scipy 和 PyIMSL 中的内存函数对它们进行操作几乎是不可能的.统计分析语言 SAS 在这里有一个很大的优势,它可以对来自硬盘的数据进行操作,而不是严格的内存处理.但是,我想避免在 SAS 中编写大量代码(出于各种原因),因此我试图确定我对 Python 有哪些选择(除了购买更多的硬件和内存).
I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).
我应该澄清一下,像 map-reduce 这样的方法对我的大部分工作没有帮助,因为我需要对 完整 组数据进行操作(例如计算分位数或拟合逻辑回归模型).
I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).
最近我开始玩 h5py 并认为这是我找到的最佳选择允许 Python 像 SAS 一样操作并操作磁盘中的数据(通过 hdf5 文件),同时仍然能够利用 numpy/scipy/matplotlib 等.我想知道是否有人在类似的设置中使用 Python 和 h5py 的经验和他们发现了什么.有没有人能够在迄今为止由 SAS 主导的大数据"设置中使用 Python?
Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?
购买更多的硬件/内存肯定会有所帮助,但从 IT 的角度来看,我很难将 Python 卖给需要分析大量数据集的组织,而 Python(或 R,或 MATLAB 等)需要持有内存中的数据.SAS 在这方面继续拥有强大的卖点,因为虽然基于磁盘的分析可能较慢,但您可以自信地处理庞大的数据集.所以,我希望 Stackoverflow 能帮助我弄清楚如何降低使用 Python 作为主要大数据分析语言的风险.
Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.
解决方案
我们使用Python结合h5py、numpy/scipy和boost::python来做数据分析.我们的典型数据集大小可达数百 GB.
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.
HDF5 的优势:
- 可以使用 h5view 应用程序、h5py/ipython 和 h5* 命令行工具方便地检查数据
- API 可用于不同的平台和语言
- 使用组构造数据
- 使用属性注释数据
- 无忧的内置数据压缩
- 单个数据集上的 io 速度很快
HDF5 陷阱:
- 如果 h5 文件包含太多数据集/组 (> 1000),性能会下降,因为遍历它们非常慢.另一方面,io 对于一些大型数据集来说速度很快.
- 高级数据查询(类似 SQL)实施起来笨拙且速度慢(在这种情况下考虑 SQLite)
- HDF5 并非在所有情况下都是线程安全的:必须确保使用正确的选项编译库
- 更改 h5 数据集(调整大小、删除等)会增大文件大小(在最好的情况下)或不可能(在最坏的情况下)(必须复制整个 h5 文件以再次展平)
相关文章