创建具有多个输入的 TimeseriesGenerator
问题描述
我正在尝试根据约 4000 只股票的每日基本面和价格数据训练 LSTM 模型,由于内存限制,在转换为模型的序列后,我无法将所有内容都保存在内存中.
这导致我改用生成器,例如
相反,我想要的是类似于此:
稍微类似的问题:合并或附加多个 Keras TimeseriesGenerator 对象合二为一
我探索了像这样组合生成器的选项 SO 建议:我如何组合两个 keras 生成器函数,但是在大约 4000 个生成器的情况下这不是主意.
我希望我的问题有意义.
解决方案所以我最终做的是手动进行所有预处理并为每个包含预处理序列的股票保存一个 .npy 文件,然后使用手动创建的生成器我像这样批量制作:
类 seq_generator():def __init__(self, list_of_filepaths):self.usedDict = dict()对于 list_of_filepaths 中的路径:self.usedDict[路径] = []定义生成(自我):而真:路径 = np.random.choice(list(self.usedDict.keys()))stock_array = np.load(路径)random_sequence = np.random.randint(stock_array.shape[0])如果 random_sequence 不在 self.usedDict[path] 中:self.usedDict[path].append(random_sequence)产量 stock_array[random_sequence, :, :]train_generator = seq_generator(list_of_filepaths)train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),output_types=(tf.float32, tf.float32),output_shapes=(n_timesteps, n_features))train_dataset = train_dataset.batch(batch_size)
其中 list_of_filepaths
只是预处理 .npy 数据的路径列表.
这将:
- 加载随机股票的预处理 .npy 数据
- 随机选择一个序列
- 检查序列的索引是否已经在
usedDict
中使用过 - 如果不是:
- 将该序列的索引附加到
usedDict
以跟踪不向模型提供两次相同的数据 - 产生序列
- 将该序列的索引附加到
这意味着生成器将在每次调用"时从随机股票中提供一个唯一序列,使我能够使用 .from_generator()
和 .batch()来自 Tensorflows Dataset 类型的 code> 方法.
I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.
This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"
Instead what I would want is similar to this:
Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one
I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.
I hope my question makes sense.
解决方案So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
Where list_of_filepaths
is simply a list of paths to preprocessed .npy data.
This will:
- Load a random stock's preprocessed .npy data
- Pick a sequence at random
- Check if the index of the sequence has already been used in
usedDict
- If not:
- Append the index of that sequence to
usedDict
to keep track as to not feed the same data twice to the model - Yield the sequence
- Append the index of that sequence to
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator()
and .batch()
methods from Tensorflows Dataset type.
相关文章