将 .npy(numpy 文件)输入 tensorflow 数据管道

2022-01-21 00:00:00 numpy tensorflow dataset data-pipeline

问题描述

Tensorflow 似乎缺少.npy"文件的阅读器.如何将我的数据文件读入新的 tensorflow.data.Dataset 管道?我的数据不适合内存.

每个对象都保存在单独的.npy"文件中.每个文件包含 2 个不同的 ndarray 作为特征和一个标量作为它们的标签.

解决方案

你的数据适合内存吗?如果是这样，您可以按照文档的 Consuming NumPy Arrays 部分中的说明进行操作:

<块引用>

使用 NumPy 数组

如果您的所有输入数据都适合内存，从它们创建数据集的最简单方法是将它们转换为 tf.Tensor 对象并使用 Dataset.from_tensor_slices().

# 将训练数据加载到两个 NumPy 数组中，例如使用 `np.load()`.使用 np.load("/var/data/training_data.npy") 作为数据:特征=数据[特征"]标签=数据[标签"]# 假设 `features` 的每一行对应于 `labels` 的同一行.断言 features.shape[0] == labels.shape[0]数据集 = tf.data.Dataset.from_tensor_slices((特征，标签))

如果文件不适合内存，似乎唯一推荐的方法是首先将 npy 数据转换为 TFRecord 格式，然后然后使用 TFRecord 数据集格式，无需完全加载到内存即可流式传输.

这是一个包含一些说明的帖子.

FWIW，我觉得 TFRecord 不能直接用 npy 文件的目录名或文件名实例化，但这似乎是普通 Tensorflow 的限制.p>

如果您可以将单个大型 npy 文件拆分为较小的文件，每个文件大致代表一个批次进行训练，那么您可以在 Keras 中编写一个自定义数据生成器，该生成器仅生成当前批次所需的数据.

一般来说，如果您的数据集无法放入内存中，将其存储为一个大的 npy 文件会使其非常难以处理，最好先将数据重新格式化为 TFRecord 或多个 npy 文件，然后使用其他方法.

Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory.

Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.

解决方案

Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

Consuming NumPy arrays

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

Here is a post with some instructions.

FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.

相关文章