在 TensorFlow 2.0 中,如何查看数据集中的元素数量?

问题描述

当我加载数据集时,我想知道是否有任何快速方法可以找到该数据集中的样本数或批次数.我知道如果我使用 with_info=True 加载数据集,我可以看到例如 total_num_examples=6000, 但如果我拆分数据集,则此信息不可用.

When I load a dataset, I wonder if there is any quick way to find the number of samples or batches in that dataset. I know that if I load a dataset with with_info=True, I can see for example total_num_examples=6000, but this information is not available if I split a dataset.

目前,我统计样本数如下,但想知道是否有更好的解决方案:

Currently, I count the number of samples as follows, but wondering if there is any better solution:

train_subsplit_1, train_subsplit_2, train_subsplit_3 = tfds.Split.TRAIN.subsplit(3)

cifar10_trainsub3 = tfds.load("cifar10", split=train_subsplit_3)

cifar10_trainsub3 = cifar10_trainsub3.batch(1000)

n = 0
for i, batch in enumerate(cifar10_trainsub3.take(-1)):
    print(i, n, batch['image'].shape)
    n += len(batch['image'])

print(i, n)


解决方案

如果可以知道长度,那么你可以使用:

If it's possible to know the length then you could use:

tf.data.experimental.cardinality(dataset)

但问题是 TF 数据集本质上是延迟加载的.所以我们可能事先不知道数据集的大小.确实,完全有可能让一个数据集代表无限的数据集!

but the problem is that a TF dataset is inherently lazily loaded. So we might not know the size of the dataset up front. Indeed, it's perfectly possible to have a dataset represent an infinite set of data!

如果它是一个足够小的数据集,您也可以对其进行迭代以获得长度.我之前使用过以下丑陋的小结构,但它取决于数据集足够小,我们可以很高兴地加载到内存中,而且它实际上并不是对上面的 for 循环的改进!

If it is a small enough dataset you could also just iterate over it to get the length. I've used the following ugly little construct before but it depends on the dataset being small enough for us to be happy to load into memory and it's really not an improvement over your for loop above!

dataset_length = [i for i,_ in enumerate(dataset)][-1] + 1

相关文章