写作与将浮点数组附加到 C++ 中 hdf5 文件中的唯一数据集

2022-01-22 00:00:00 hdf5 file arrays append c++

我正在处理多个文件,每次处理文件都会输出数千个浮点数组,我会将所有文件的数据存储在单个 hdf5 中的一个巨大数据集中以供进一步处理.

I am processing number of files, each processing of the file will output several thousand of arrays of float and I will store the data of all files in one huge dataset in a single hdf5 for further processing.

目前我对如何将数据附加到 hdf5 文件感到困惑.(在上面的代码中注释)在上面的 2 个 for 循环中,如您所见,我想一次将浮点的一维数组附加到 hdf5 中,而不是全部.我的数据是TB,我们只能将数据追加到文件中.

The thing is currently I am confused about how to append my data into the hdf5 file. (comment in the code above) In 2 for loops above, as you can see, I want to append 1 dimensional array of float into hdf5 at a time, and not as the whole thing. My data is in terabytes, and we can only append the data into the file.

有几个问题:

  1. 在这种情况下如何追加数据?我必须使用什么样的功能?
  2. 现在,我有 fdim[0] = 928347543,我尝试将 HDF5 的无限标志放入,但运行时执行抱怨.有没有办法做到这一点?我不想计算我每次拥有的数据;有没有办法只是简单地继续添加数据,而不关心 fdim 的价值?

或者这不可能?

我一直在遵循 Simon 的建议,目前这里是我更新的代码:

I've been following Simon's suggestion, and currently here is my updated code:

hid_t desFi5;
hid_t fid1;
hid_t propList;
hsize_t fdim[2];

desFi5 = H5Fcreate(saveFilePath, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

fdim[0] = 3;
fdim[1] = 1;//H5S_UNLIMITED;

fid1 = H5Screate_simple(2, fdim, NULL);

cout << "----------------------------------Space done
";

propList = H5Pcreate( H5P_DATASET_CREATE);

H5Pset_layout( propList, H5D_CHUNKED );

int ndims = 2;
hsize_t chunk_dims[2];
chunk_dims[0] = 3;
chunk_dims[1] = 1;

H5Pset_chunk( propList, ndims, chunk_dims );

cout << "----------------------------------Property done
";

hid_t dataset1 = H5Dcreate( desFi5, "des", H5T_NATIVE_FLOAT, fid1, H5P_DEFAULT, propList, H5P_DEFAULT);

cout << "----------------------------------Dataset done
";

bufi = new float*[1];
bufi[0] = new float[3];
bufi[0][0] = 0;
bufi[0][1] = 1;
bufi[0][2] = 2;

//hyperslab
hsize_t start[2] = {0,0};
hsize_t stride[2] = {1,1};
hsize_t count[2] = {1,1};
hsize_t block[2] = {1,3};

H5Sselect_hyperslab( fid1, H5S_SELECT_OR, start, stride, count, block);     
cout << "----------------------------------hyperslab done
";   

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, *bufi);

fdim[0] = 3;
fdim[1] = H5S_UNLIMITED;    // COMPLAINS HERE
H5Dset_extent( dataset1, fdim );

cout << "----------------------------------extent done
";

//hyperslab2
hsize_t start2[2] = {1,0};
hsize_t stride2[2] = {1,1};
hsize_t count2[2] = {1,1};
hsize_t block2[2] = {1,3};

H5Sselect_hyperslab( fid1, H5S_SELECT_OR, start2, stride2, count2, block2);     
cout << "----------------------------------hyperslab2 done
";  

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, *bufi);

cout << "----------------------------------H5Dwrite done
";        
H5Dclose(dataset1);
cout << "----------------------------------dataset closed
";   
H5Pclose( propList );   
cout << "----------------------------------property list closed
"; 
H5Sclose(fid1); 
cout << "----------------------------------dataspace fid1 closed
";    
H5Fclose(desFi5);       
cout << "----------------------------------desFi5 closed
";    

我目前的输出是:

bash-3.2$ ./hdf5AppendTest.out
----------------------------------Space done
----------------------------------Property done
----------------------------------Dataset done
----------------------------------hyperslab done
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5D.c line 1103 in H5Dset_extent(): unable to set extend dataset
    major: Dataset
    minor: Unable to initialize object
  #001: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dint.c line 2179 in H5D__set_extent(): unable to modify size of data space
    major: Dataset
    minor: Unable to initialize object
  #002: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5S.c line 1874 in H5S_set_extent(): dimension cannot exceed the existing maximal size (new: 18446744073709551615 max: 1)
    major: Dataspace
    minor: Bad value
----------------------------------extent done
----------------------------------hyperslab2 done
----------------------------------H5Dwrite done
----------------------------------dataset closed
----------------------------------property list closed
----------------------------------dataspace fid1 closed
----------------------------------desFi5 closed

目前,我发现使用 H5Dset_extent 将内容设置为无限制在运行时仍然会导致问题.(有问题的函数在上面的代码中用//COMPLAINS HERE标记.)我已经得到了Simon指定的块数据,那么这里有什么问题?

Currently, I see that setting things in unlimited with H5Dset_extent still causes a problem during runtime. (problematic function is marked with //COMPLAINS HERE in the code above.) I already got a chunk data as specified by Simon, so what's the problem here?

另一方面,如果没有 H5Dset_extent,我可以写一个 [0, 1, 2] 的测试数组就好了,但是我们怎样才能让上面的代码输出测试数组到文件中,如下所示:

On the other hand, without H5Dset_extent, I can write a test array of [0, 1, 2] just fine, but how can we make the code above the output the test array to the file like this:

[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
...
...

回想一下:这只是一个测试数组,实际数据更大,我无法将整个数据保存在 RAM 中,所以我必须一次一个地逐个放入数据.

Recall: this is just a test array, the real data is bigger, and I cannot hold the whole thing in the RAM, so I must put data in part by part one at a time.

编辑 2:

我更多地听从了西蒙的建议.这是关键部分:

I've followed more of Simon's suggestion. Here is the critical part:

hsize_t n = 3, p = 1;
float *bufi_data = new float[n * p];
float ** bufi = new float*[n];
for (hsize_t i = 0; i < n; ++i){
    bufi[i] = &bufi_data[i * n];
}

bufi[0][0] = 0.1;
bufi[0][1] = 0.2;
bufi[0][2] = 0.3;

//hyperslab
hsize_t start[2] = {0,0};
hsize_t count[2] = {3,1};

H5Sselect_hyperslab( fid1, H5S_SELECT_SET, start, NULL, count, NULL);
cout << "----------------------------------hyperslab done
";   

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, fid1, H5P_DEFAULT, *bufi);

bufi[0][0] = 0.4;
bufi[0][1] = 0.5;
bufi[0][2] = 0.6;

hsize_t fdimNew[2];
fdimNew[0] = 3;
fdimNew[1] = 2;
H5Dset_extent( dataset1, fdimNew );

cout << "----------------------------------extent done
";

//hyperslab2
hsize_t start2[2] = {0,0}; //PROBLEM
hsize_t count2[2] = {3,1};

H5Sselect_hyperslab( fid1, H5S_SELECT_SET, start2, NULL, count2, NULL);     
cout << "----------------------------------hyperslab2 done
";  

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, fid1, H5P_DEFAULT, *bufi);

从上面,我得到了 hdf5 的以下输出:

From the above, I got the following output for hdf5:

0.4 0.5 0.6
  0   0   0

在进一步试验start2count2 后,我发现这些变量只影响bufi 的起始索引和递增索引.它根本不会移动我的数据集的写作索引的位置.

After further experiment with start2 and count2, I see these variables only affect starting index and incrementing index for bufi. It does not move the position of the writing index of my dataset at all.

回想一下:最终结果一定是:

Recall: the final result must be:

0.1 0.2 0.3
0.4 0.5 0.6

另外,对于 H5Dwrite,它必须是 bufi 而不是 *bufi,Simon,因为 bufi 给了我完全随机数.

Also, it must be bufi instead of *bufi for H5Dwrite, Simon, because bufi gives me completely random numbers.

更新 3:

对于西蒙建议的选择部分:

For the selection part suggested by Simon:

hsize_t start[2] = {0, 0};
hsize_t count[2] = {1, 3};

hsize_t start[2] = {1, 0};
hsize_t count[2] = {1, 3};

这些将给出以下错误:

HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c line 245 in H5Dwrite(): file selection+offset not within extent
    major: Dataspace
    minor: Out of range

count[2] 应该是 {3,1},而不是 {1,3},我想?而对于start[2],如果我不设置为{0,0},它总是会喊出上面的错误.

count[2] should be {3,1}, rather than {1,3}, I suppose? And for start[2], if I don't set it as {0,0}, it will always yell out the error above.

你确定这是正确的吗?

推荐答案

在这种情况下如何追加数据?我必须使用什么样的功能?

How to append the data in this case? What kind of function must I use?

您必须使用 hyperslabs.这就是您只需要编写数据集的一部分.执行此操作的函数是 H5Sselect_hyperslab.在 fd1 上使用它并在 H5Dwrite 调用中使用 fd1 作为文件数据空间.

You must use hyperslabs. That's what you need to write only part of a dataset. The function to do that is H5Sselect_hyperslab. Use it on fd1 and use fd1 as your file dataspace in your H5Dwrite call.

我试过把 HDF5 的 infinity flag 放进去,但是运行时执行报错.

I have tried put infinity flag of HDF5 in, but the runtime execution complains.

您需要创建一个分块数据集,以便能够将其最大大小设置为无穷大.创建数据集创建属性列表并使用H5Pset_layout 使其分块.使用 H5Pset_chunk 设置块大小.然后使用此属性列表创建您的数据集.

You need to create a chunked dataset in order to be able to set its maximum size to infinity. Create a dataset creation property list and use H5Pset_layout to make it chunked. Use H5Pset_chunk to set the chunk size. Then create your dataset using this property list.

我不想每次都计算我拥有的数据;有没有办法只是简单地继续添加数据,而不关心 fdim 的值?

I don't want to calculate the data that I have each time; is there a way to just simply keep on adding data in, without caring the value of fdim?

你可以做两件事:

  1. 预先计算最终大小,以便创建足够大的数据集.看起来这就是你正在做的事情.

  1. Precompute the final size so you can create a dataset big enough. It looks like that's what you are doing.

使用 H5Dset_extent.为此,您需要将最大维度设置为无穷大,因此您需要一个分块数据集(见上文).

Extend your dataset as you go using H5Dset_extent. For this you need to set the maximum dimensions to infinity so you need a chunked dataset (see above).

在这两种情况下,您都需要在 H5Dwrite 调用中选择文件数据空间上的 hyperslab(见上文).

In both case, you need to select an hyperslab on the file dataspace in your H5Dwrite call (see above).

#include <iostream>
#include <hdf5.h>

// Constants
const char saveFilePath[] = "test.h5";
const hsize_t ndims = 2;
const hsize_t ncols = 3;

int main()
{

首先,创建一个 hdf5 文件.

First, create a hdf5 file.

    hid_t file = H5Fcreate(saveFilePath, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
    std::cout << "- File created" << std::endl;

然后创建一个 2D 数据空间.第一个维度的大小是无限的.我们最初将其设置为 0 以显示如何在每个步骤中扩展数据集.例如,您还可以将其设置为您要写入的第一个缓冲区的大小.第二维的大小是固定的.

Then create a 2D dataspace. The size of the first dimension is unlimited. We set it initially to 0 to show how you can extend the dataset at each step. You could also set it to the size of the first buffer you are going to write for instance. The size of the second dimension is fixed.

    hsize_t dims[ndims] = {0, ncols};
    hsize_t max_dims[ndims] = {H5S_UNLIMITED, ncols};
    hid_t file_space = H5Screate_simple(ndims, dims, max_dims);
    std::cout << "- Dataspace created" << std::endl;

然后创建一个数据集创建属性列表.使用无限维度时,必须对数据集的布局进行分块.块大小的选择会影响性能,包括时间和磁盘空间.如果块非常小,您将有很多开销.如果它们太大,您可能会分配不需要的空间,并且您的文件最终可能会太大.这是一个玩具示例,因此我们将选择一行的块.

Then create a dataset creation property list. The layout of the dataset have to be chunked when using unlimited dimensions. The choice of the chunk size affects performances, both in time and disk space. If the chunks are very small, you will have a lot of overhead. If they are too large, you might allocate space that you don't need and your files might end up being too large. This is a toy example so we will choose chunks of one line.

    hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
    H5Pset_layout(plist, H5D_CHUNKED);
    hsize_t chunk_dims[ndims] = {2, ncols};
    H5Pset_chunk(plist, ndims, chunk_dims);
    std::cout << "- Property list created" << std::endl;

创建数据集.

    hid_t dset = H5Dcreate(file, "dset1", H5T_NATIVE_FLOAT, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);
    std::cout << "- Dataset 'dset1' created" << std::endl;

关闭资源.数据集现已创建,因此我们不再需要属性列表.我们不再需要文件数据空间,因为当数据集被扩展时,它将变得无效,因为它仍将保留以前的范围.所以无论如何我们都必须获取更新后的文件数据空间.

Close resources. The dataset is now created so we don't need the property list anymore. We don't need the file dataspace anymore because when the dataset will be extended, it will become invalid as it will still hold the previous extent. So we will have to grab the updated file dataspace anyway.

    H5Pclose(plist);
    H5Sclose(file_space);

我们现在将两个缓冲区附加到数据集的末尾.第一个将是两行长.第二个将是三行.

We will now append two buffers to the end of the dataset. The first one will be two lines long. The second one will be three lines long.

我们创建一个 2D 缓冲区(在内存中连续,行主要顺序).我们将分配足够的内存来存储 3 行,所以我们可以重用缓冲区.让我们创建一个指针数组,以便我们可以使用 b[i][j] 表示法而不是 buffer[i * ncols + j].这纯粹是审美.

We create a 2D buffer (contigous in memory, row major order). We will allocate enough memory to store 3 lines, so we can reuse the buffer. Let us create an array of pointers so we can use the b[i][j] notation instead of buffer[i * ncols + j]. This is purely esthetic.

    hsize_t nlines = 3;
    float *buffer = new float[nlines * ncols];
    float **b = new float*[nlines];
    for (hsize_t i = 0; i < nlines; ++i){
        b[i] = &buffer[i * ncols];
    }

缓冲区中要写入数据集的初始值:

Initial values in buffer to be written in the dataset:

    b[0][0] = 0.1;
    b[0][1] = 0.2;
    b[0][2] = 0.3;
    b[1][0] = 0.4;
    b[1][1] = 0.5;
    b[1][2] = 0.6;

我们创建一个内存数据空间来指示内存中缓冲区的大小.记住第一个缓冲区只有两行长.

We create a memory dataspace to indicate the size of our buffer in memory. Remember the first buffer is only two lines long.

    dims[0] = 2;
    dims[1] = ncols;
    hid_t mem_space = H5Screate_simple(ndims, dims, NULL);
    std::cout << "- Memory dataspace created" << std::endl;

我们现在需要扩展数据集.我们将数据集的初始大小设置为 0x3,因此我们需要先扩展它.请注意,我们扩展了数据集本身,而不是它的数据空间.记住第一个缓冲区只有两行长.

We now need to extend the dataset. We set the initial size of the dataset to 0x3, we thus need to extend it first. Note that we extend the dataset itself, not its dataspace. Remember the first buffer is only two lines long.

    dims[0] = 2;
    dims[1] = ncols;
    H5Dset_extent(dset, dims);
    std::cout << "- Dataset extended" << std::endl;

在文件数据集上选择 hyperslab.

Select hyperslab on file dataset.

    file_space = H5Dget_space(dset);
    hsize_t start[2] = {0, 0};
    hsize_t count[2] = {2, ncols};
    H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);
    std::cout << "- First hyperslab selected" << std::endl;

将缓冲区写入数据集.mem_spacefile_space 现在应该选择相同数量的元素.注意 buffer&b[0][0] 是等价的.

Write buffer to dataset. mem_space and file_space should now have the same number of elements selected. Note that buffer and &b[0][0] are equivalent.

    H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT, buffer);
    std::cout << "- First buffer written" << std::endl;

我们现在可以关闭文件数据空间.我们现在可以关闭内存数据空间并为第二个缓冲区创建一个新的,但我们会简单地更新它的大小.

We can now close the file dataspace. We could close the memory dataspace now and create a new one for the second buffer, but we will simply update its size.

    H5Sclose(file_space);

第二个缓冲区

缓冲区中要附加到数据集的新值:

Second buffer

New values in buffer to be appended to the dataset:

    b[0][0] = 1.1;
    b[0][1] = 1.2;
    b[0][2] = 1.3;
    b[1][0] = 1.4;
    b[1][1] = 1.5;
    b[1][2] = 1.6;
    b[2][0] = 1.7;
    b[2][1] = 1.8;
    b[2][2] = 1.9;

调整内存数据空间的大小以指示缓冲区的新大小.第二个缓冲区是三行长.

Resize the memory dataspace to indicate the new size of our buffer. The second buffer is three lines long.

    dims[0] = 3;
    dims[1] = ncols;
    H5Sset_extent_simple(mem_space, ndims, dims, NULL);
    std::cout << "- Memory dataspace resized" << std::endl;

扩展数据集.请注意,在这个简单的示例中,我们知道 2 + 3 = 5.通常,您可以从文件数据空间中读取当前范围并添加所需的行数.

Extend dataset. Note that in this simple example, we know that 2 + 3 = 5. In general, you could read the current extent from the file dataspace and add the desired number of lines to it.

    dims[0] = 5;
    dims[1] = ncols;
    H5Dset_extent(dset, dims);
    std::cout << "- Dataset extended" << std::endl;

在文件数据集上选择 hyperslab.同样在这个简单的例子中,我们知道 0 + 2 = 2.通常,您可以从文件数据空间中读取当前范围.第二个缓冲区是三行长.

Select hyperslab on file dataset. Again in this simple example, we know that 0 + 2 = 2. In general, you could read the current extent from the file dataspace. The second buffer is three lines long.

    file_space = H5Dget_space(dset);
    start[0] = 2;
    start[1] = 0;
    count[0] = 3;
    count[1] = ncols;
    H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);
    std::cout << "- Second hyperslab selected" << std::endl;

将缓冲区附加到数据集

    H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT, buffer);
    std::cout << "- Second buffer written" << std::endl;

结束:让我们关闭所有资源:

The end: let's close all the resources:

    delete[] b;
    delete[] buffer;
    H5Sclose(file_space);
    H5Sclose(mem_space);
    H5Dclose(dset);
    H5Fclose(file);
    std::cout << "- Resources released" << std::endl;
}

<小时>

注意: 我删除了之前的更新,因为答案太长了.如果您有兴趣,请浏览历史.

相关文章