以 SQLite 和 HDF5 格式从/导入到 numpy、scipy

2021-12-31 00:00:00 python hdf5 numpy scipy sqlite

Python 与 SQLite(sqlite3、atpy)和 HDF5(h5py、pyTables)的接口似乎有很多选择——我想知道是否有人有将它们与 numpy 数组或数据表(结构化/记录数组)一起使用的经验,以及其中哪些与每种数据格式(SQLite 和 HDF5)的科学"模块(numpy、scipy)无缝集成.

There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).

推荐答案

大部分取决于您的用例.

Most of it depends on your use case.

与传统的关系数据库相比,我在处理各种基于 HDF5 的方法方面拥有更多的经验,因此我无法对 Python 的 SQLite 库发表过多评论......

I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...

至少就 h5pypyTables 而言,它们都通过 numpy 数组提供非常无缝的访问,但它们面向非常不同的用例.

At least as far as h5py vs pyTables, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.

如果您有想要快速访问任意 基于索引 切片的 n 维数据,那么使用 h5py 会简单得多.如果您有更像表格的数据,并且想要查询它,那么 pyTables 是一个更好的选择.

If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py. If you have data that's more table-like, and you want to query it, then pyTables is a much better option.

h5py 是围绕 HDF5 库的相对普通"的包装器.如果您要定期从另一种语言访问 HDF 文件(pyTables 添加一些额外的元数据),这是一件非常好的事情.h5py 可以做一个 lot,但是对于某些用例(例如 pyTables 做什么),您将需要花更多时间来调整一些东西.

h5py is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables adds some extra metadata). h5py can do a lot, but for some use cases (e.g. what pyTables does) you're going to need to spend more time tweaking things.

pyTables 有一些非常很好的功能.但是,如果您的数据看起来不太像表格,那么它可能不是最佳选择.

pyTables has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.

举一个更具体的例子,我经常使用相当大(数十 GB)的 3 维和 4 维数据数组.它们是浮点数、整数、uint8s 等的同构数组.我通常想访问整个数据集的一小部分.h5py 使这个非常变得简单,并且在自动猜测合理的块大小方面做得相当好.从磁盘获取任意块或切片比简单的 memapped 文件快得多.(强调任意……显然,如果你想抓取整个X"切片,那么 C 序 memapped 数组是不可能被击败的,因为X"切片中的所有数据在磁盘上都是相邻的.)

To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)

举个反例,我的妻子从各种传感器收集数据,这些传感器在几年内以分钟到秒的间隔进行采样.她需要在她的数据上存储和运行任意查询(以及相对简单的计算).pyTables 使这个用例变得非常简单和快速,并且仍然比传统的关系数据库有一些优势.(特别是在磁盘使用率和将大(基于索引的)数据块读入内存的速度方面)

As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)

相关文章