随机选择子目录中的 x 个文件
问题描述
我需要在一个数据集中随机抽取 10 个文件(图像),但这个数据集是分层结构的.
I need to take exactly 10 files (images) in a dataset randomly, but this dataset is hierarchically structured.
所以我需要为每个包含图像的子目录随机保存其中的 10 个.有没有一种简单的方法可以做到这一点,或者我应该手动做到这一点?
So I need that for each subdirectory that contains images hold just 10 of them randomly. Is there an easy way to do that or I should do it manually?
def getListOfFiles(dirName):
### create a list of file and sub directories
### names in the given directory
listOfFile = os.listdir(dirName)
allFiles = list()
### Iterate over all the entries
for entry in listOfFile:
### Create full path
fullPath = os.path.join(dirName, entry)
### If entry is a directory then get the list of files in this directory
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(random.sample(fullPath, 10))
return allFiles
dirName = 'C:/Users/bla/bla'
### Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)
with open("elements.txt", mode='x') as f:
for elem in listOfFiles:
f.write(elem + '
')
解决方案
从未知大小目录列表中采样的好方法是使用 水库采样.使用这种方法,您不必预先运行并列出目录中的所有文件.逐一阅读并示例.当您必须跨多个目录对固定数量的文件进行采样时,它甚至可以工作.
Good approach to sample from unknown size directory listing is to use Reservoir Sampling. With this approach, you don't have to run upfront and list all files in the directory. Read it one-by-one and sample. It even works when you have to sample fixed number of files across multiple directories.
最好使用基于生成器的目录扫描代码,它一次选择一个文件,因此您不必预先使用大量内存来保存所有文件名.
It would be good to use generator-based directory scanning code, which picks one file at a time, thus you don't use gobs of memory upfront to hold all file names.
顺理成章(注意!未指定的代码!)
Along the lines (NB! undested code!)
import numpy as np
import os
def ResSampleFiles(dirname, N):
"""pick N files from directory"""
sampled_files = list()
k = 0
for item in scandir(dirname):
if item.is_dir():
continue
full_path = os.path.join(dirname, item.name)
if k < N:
sampled_files.append(full_path)
else:
idx = np.random.randint(0, k+1)
if (idx < N):
sampled_files[idx] = full_path
k += 1
return sampled_files
相关文章