高效存储时间序列数据:mySQL 还是平面文件?许多表(或文件)或带有 WHERE 条件的查询?

2022-01-11 00:00:00 time-series mysql

存储数千个(但可能很快会变成数百万个)真实世界硬件传感器的时间序列数据的最佳方式是什么?传感器本身是不同的,有些只捕获一个变量,有些则多达十几个.我需要每小时存储这些值,并且我不想删除早于 x 的数据,即数据将继续增长.

What's the best way to store time series data of thousands (but could become millions soon) real-world hardware sensors? The sensors itself are different, some just capture one variable, some up to a dozen. I need to store these values every hour, and I don't want to delete data that is older than x, i.e. the data will just keep growing.

目前,我使用 mySQL 数据库来存储这些时间序列(它还提供一个 Web 前端,为每个传感器显示漂亮的时间序列图).我为每个传感器准备了一张桌子,现在总共大约有 11000 个.每个表都有一个类似timestamp, value1, [value2] ..."的布局.

Currently, I use a mySQL database to store these time series (which also serves a web frontend that shows nice time series graphs for every sensor). I have one table for every sensor, which right now equals about 11000 total. Each table has a layout like "timestamp, value1, [value2] ... ".

数据库的主要任务是更多的选择(每次 sombebody 查看图表)而不是插入/更新(每小时一次).用于显示图表的选择查询只是SELECT * FROM $sensor_id ORDER BY timestamp",因此从我的选择语句中获取信息非常简单/高效.

The main task of the database are more selects (every time sombebody looks at the graphs) than inserts/updates (once an hour). The select query for showing the graph is simply a "SELECT * FROM $sensor_id ORDER BY timestamp", so getting the info from my select statements is pretty simple/efficient.

但是,在备份数据库时,拥有这么多表已经存在一些问题,因为我遇到了 LOCK 限制(例如 mysqldump: Got error: 23: Out of resources when opening file './database/table_xyz.MYD' (Errcode: 24) when using LOCK TABLES"). 我可以绕过这个错误,但显然这让我开始思考......

However, having that many tables already presents some problems when backing up the database, because I run into LOCK limits (e.g. mysqldump: Got error: 23: Out of resources when opening file './database/table_xyz.MYD' (Errcode: 24) when using LOCK TABLES"). I can get around that error, but obviously that got me thinking...

所以,真正的问题,分解成子问题:

So, the real question, broken down into sub-questions:

  • 我为每个传感器设置一个表的方法有多糟糕?如果我有几百万张桌子而不是几千张桌子怎么办(在不久的将来我可能不得不处理这么多传感器)?
  • 将所有传感器的数据存储在一个带有额外列的组合表中,该列保存 sensor_id 是一种更好的方法,因为它可能会大大减慢我的选择语句(SELECT * from all_sensors WHEREsensor_id='$sensor_id')?请记住,不同的传感器测量不同的东西,所以如果我每个传感器都有自己的表,那么这个表会有几十列而不是一到几列?
  • 我还考虑过不将时间序列数据存储在 mySQL 中,而是存储在平面 (CSV) 文件中.我用于前端的图形库(dygraphs)可以很好地处理 CSV 文件(另外它可以让我选择让这些文件可供下载,这将是一个奖励,但目前不是必需的).我仍然需要数据库来处理其他与前端相关的事情,但这意味着有几十个表而不是 11000 个(如果我们添加更多传感器,甚至更多).
  • 如果我为每个表创建一个文件,那么我最终可能会遇到文件系统限制(这是一个 ext3 分区,因此每个目录限制为 ~32k 个文件).所以这里也适用与上述相同的问题:然后我应该将它存储在一个包含所有传感器数据的大文件中吗?这可能会减慢我的读取速度,甚至更糟,因为每次有人查看图表时,图形库都需要将一个大得多的文件读入内存?
  • How bad is my approach of having one table for every sensor? What if instead of a few thousand tables, I had a few millions (I might have to deal with that many sensors in the near future)?
  • Is storing all sensors' data in one combined table with an extra column that holds the sensor_id a better approach, since it would probably slow down my select statement by a lot (SELECT * from all_sensors WHERE sensor_id='$sensor_id')? Keep in mind that different sensors measure different things, so this table would have a few dozen columns instead of just one to a few, if I every sensor has its own table?
  • I also thought about storing the time series data NOT in mySQL, but instead in flat (CSV) files. The graphing library I use for the frontend (dygraphs) deals fine with CSV files (plus it would give me the option of making these available for download, which would be a bonus but is not a requirement currently). I still need the database for other front-end related things, but it would mean having a few dozen tables instead of 11000 (or even more if we add more sensors).
  • If I create one file for every table, then I would probably run into filesystem limits eventually (this is an ext3 partition, so there's the ~32k files per directory limit). So also here the same question as above applies: should I then store it in one large file that holds all sensors' data? This would probably slow down my reads even worse, as the graphing libary would need to read a much,much bigger file into memory every time someone looks at a graph?

你会怎么做?

谢谢!

推荐答案

要回答这个问题,我们首先要分析一下你所面临的真正问题.

To answer this question, we must first analyse the real issue you're facing.

真正的问题是写入和检索数据的最有效组合.

The real issue would be the most efficient combination of writing and retrieving data.

让我们回顾一下你的结论:

Let's review your conclusions:

  • 数以千计的表 - 嗯,这违反了数据库的目的并使其更难使用.你也一无所获.仍然涉及磁盘查找,这一次使用了许多文件描述符.您还必须知道表名,而且有数千个.提取数据也很困难,这就是数据库的用途 - 以您可以轻松交叉引用记录的方式构建数据.数以千计的桌子 - 效率不高.观点看法.从使用的角度来看效率不高.糟糕的选择.

  • thousands of tables - well, that violates the purpose of databases and makes it harder to work with. You also gain nothing. There is still disk seeking involved, this time with many file descriptors in use. You also have to know the table names, and there's thousands of them. It's also difficult to extract data, which is what databases are for - to structure the data in such a way that you can easily cross-reference the records. Thousands of tables - not efficient from perf. point of view. Not efficient from use point of view. Bad choice.

一个 csv 文件 - 如果您一次需要全部内容,它可能非常适合获取数据.但它远非操作或转换数据的好处.鉴于您依赖于特定布局的事实 - 在写入 CSV 时必须格外小心.如果这增长到数以千计的 CSV 文件,您就没有帮自己一个忙.您消除了 SQL 的所有开销(不是那么大),但您没有为检索部分数据集做任何事情.您在获取历史数据或交叉引用任何内容时也会遇到问题.糟糕的选择.

a csv file - it is probably excellent for fetching the data, if you need entire contents at once. But it's far from remotely good for manipulating or transforming the data. Given the fact you rely on a specific layout - you have to be extra careful while writing to CSV. If this grows to thousands of CSV files, you didn't do yourself a favor. You removed all the overhead of SQL (which isn't that big) but you did nothing for retrieving parts of the data set. You also have problems fetching historic data or cross referencing anything. Bad choice.

理想的情况是能够以高效、快速的方式访问数据集的任何部分,而无需更改任何类型的结构.

The ideal scenario would be being able to access any part of the data set in an efficient and quick way without any kind of structure change.

这正是我们使用关系数据库以及将具有大量 RAM 的整个服务器专用于这些数据库的原因.

And this is exactly the reason why we use relational databases and why we dedicate entire servers with a lot of RAM to those databases.

在您的情况下,您使用的是 MyISAM 表(.MYD 文件扩展名).这是一种旧的存储格式,适用于当时使用的低端硬件.但是这些天来,我们拥有出色而快速的计算机.这就是我们使用 InnoDB 并允许它使用大量 RAM 以降低 I/O 成本的原因.控制它的变量称为 innodb_buffer_pool_size - 谷歌搜索会产生有意义的结果.

In your case, you are using MyISAM tables (.MYD file extension). It's an old storage format that worked great for low end hardware which was used back in the day. But these days, we have excellent and fast computers. That's why we use InnoDB and allow it to use a lot of RAM so the I/O costs are reduced. The variable in question that controls it is called innodb_buffer_pool_size - googling that will produce meaningful results.

要回答这个问题 - 一个有效且令人满意的解决方案是使用一个存储传感器信息(ID、标题、描述)的表和存储传感器读数的另一个表.您分配了足够的 RAM 或足够快的存储空间(SSD).表格如下所示:

To answer the question - an efficient, satisfiable solution would be to use one table where you store sensor information (id, title, description) and another table where you store sensor readings. You allocate sufficient RAM or sufficiently fast storage (an SSD). The tables would look like this:

CREATE TABLE sensors ( 
    id int unsigned not null auto_increment,
    sensor_title varchar(255) not null,
    description varchar(255) not null,
    date_created datetime,
    PRIMARY KEY(id)
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

CREATE TABLE sensor_readings (
    id int unsigned not null auto_increment,
    sensor_id int unsigned not null,
    date_created datetime,
    reading_value varchar(255), -- note: this column's value might vary, I do not know what data type you need to hold value(s)
    PRIMARY KEY(id),
    FOREIGN KEY (sensor_id) REFERENCES sensors (id) ON DELETE CASCADE
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

默认情况下,InnoDB 为整个数据库/安装使用一个平面文件.这缓解了超出操作系统/文件系统的文件描述符限制的问题.如果您要分配 5-6 GB 的 RAM 来将工作数据集保存在内存中,那么几条甚至几千万条记录应该不是问题 - 这将允许您快速访问数据.

InnoDB, by default, uses one flat-file for entire database/installation. That alleviates the problem of exceeding file descriptor limit of the OS / filesystem. Several, or even tens of millions of records should not be a problem if you were to allocate 5-6 gigs of RAM to hold the working data set in memory - that would allow you quick access to the data.

如果我要设计这样一个系统,这是我(个人)会采用的第一种方法.从那里开始,您可以根据需要对这些信息做什么来轻松进行调整.

If I were to design such a system, this is the first approach I would make (personally). From there on it's easy to adjust depending on what you need to do with that information.

相关文章