来自多个 HDF5 文件/数据集的链数据集

2022-01-21 00:00:00 python h5py numpy arrays dataset

问题描述

h5py(通过 HDF5)为在磁盘上持久化数据集提供的好处和简单的映射是非常出色的.我对一组文件进行了一些分析,并将结果存储到一个数据集中,每个文件一个.在这一步结束时,我有一组包含二维数组的 h5py.Dataset 对象.数组的列数相同,但行数不同,即 (A,N)、(B,N)、(C,N) 等.

The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.

我现在想将这些多个 2D 数组作为单个数组 2D 数组进行访问.也就是说,我想将它们按需读取为形状数组(A+B+C,N).

I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).

为此,h5py.Link 类没有帮助,因为它在 HDF5 节点级别工作.

For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.

这是一些伪代码:

import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))

# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)

出于我的目的,将数组复制到新文件中的建议不起作用.我也愿意在 numpy 级别解决这个问题,但我没有找到任何合适的选项 numpy.viewnumpy.concatenate 可以在不复制的情况下工作数据.

For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.

有没有人知道一种方法可以将多个数组视为一组堆叠的数组,而无需从 h5py.Dataset 复制?

Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?


解决方案

首先,我认为没有办法在不复制数据以返回单个数组的情况下做到这一点.据我所知,不可能将 numpy 视图连接到一个数组中 - 当然,除非您创建自己的包装器.

First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.

在这里,我使用 对象/区域引用来演示概念证明.基本前提是我们在文件中创建一个新数据集,它是对组成子数组的引用数组.通过像这样存储引用,子数组可以动态更改大小,并且索引包装器将始终索引正确的子数组.

Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.

由于这只是一个概念证明,我没有实现正确的切片,只是非常简单的索引.也没有尝试进行错误检查 - 这几乎肯定会在生产中中断.

As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.

class MagicArray(object):
    """Magically index an array of references
    """
    def __init__(self, file, references, axis=0):
        self.file = file
        self.references = references
        self.axis = axis

    def __getitem__(self, items):
        # We need to modify the indices, so make sure items is a list
        items = list(items)

        for item in items:
            if hasattr(item, 'start'):
                # items is a slice object
                raise ValueError('Slices not implemented')

        for ref in self.references:
            size = self.file[ref].shape[self.axis]

            # Check if the requested index is in this subarray
            # If not, subtract the subarray size and move on
            if items[self.axis] < size:
                item_ref = ref
                break
            else:
                items[self.axis] = items[self.axis] - size

        return self.file[item_ref][tuple(items)]

这是你如何使用它:

with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
    a = f.create_dataset('a',data=np.random.random((100, 50)))
    b = f.create_dataset('b',data=np.random.random((300, 50)))
    c = f.create_dataset('c',data=np.random.random((253, 50)))

    ref_dtype = h5py.special_dtype(ref=h5py.Reference)
    ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)

    for i, key in enumerate([a, b, c]):
        ref_dataset[i] = key.ref

with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
    foo = MagicArray(f, f['refs'], axis=0)
    print(foo[104, 4])
    print(f['b'][4,4])

<小时>

这应该很容易扩展到更高级的索引(即能够处理切片),但如果不复制数据,我看不出如何做到这一点.


This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.

您也许可以从 numpy.ndarray 继承并获得所有常用方法.

You might be able to subclass from numpy.ndarray and get all the usual methods as well.

相关文章