时间序列数据的 Cassandra 分区键
我正在测试 Cassandra 作为时间序列数据库.
I'm testing Cassandra as time series database.
我创建的数据模型如下:
I create data model as below:
CREATE KEYSPACE sm WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
USE sm;
CREATE TABLE newdata (timestamp timestamp,
deviceid int, tagid int,
decvalue decimal,
alphavalue text,
PRIMARY KEY (deviceid,tagid,timestamp));
在主键中,我将 deviceid 设置为分区键,这意味着具有相同设备 ID 的所有数据都将写入一个节点(是指一台机器还是一个分区.每个分区最多可以有 20 亿行)如果我查询同一个节点内的数据,检索会很快,我说的对吗?我是 Cassandra 的新手,对分区键和集群键有点困惑.
In the Primary key, I set deviceid as the partition key which mean all data with same device id will write into one node (does it mean one machine or one partition. Each partition can have max 2 billion rows) also if I query data within the same node, the retrieval will be fast, am I correct? I’m new to Cassandra and a bit confused about the partition key and clustering key.
我的大部分查询如下:
- 选择知道 deviceid 和 tagid 的最新时间戳
- 选择已知 deviceid 和 tagid 的 decvalue 和时间戳
- 选择已知 deviceid 和 tagid 的 alphavalue 和时间戳
- 选择 * 知道 deviceid 和 tagid 的时间范围
- 选择 * 已知设备 ID 和时间范围
我将有大约 2000 个设备 ID,每个设备 ID 将有 60 个标签 ID/值对.我不确定它是否会是一排很宽的 deviceid、timestamp、tagid/value、tagid/value....
I will have around 2000 deviceid, each deviceid will have 60 tagid/value pair. I'm not sure if it will be a wide rows of deviceid, timestamp, tagid/value, tagid/value....
推荐答案
我是 Cassandra 的新手,对分区键和集群键有点困惑.
I’m new to Cassandra and a bit confused about the partition key and clustering key.
听起来您了解分区键,所以我将补充一点,您的分区键可帮助 Cassandra 确定集群中存储数据的位置(哪个令牌范围).每个节点负责几个主要令牌范围(假设为 vnode).当您的数据写入数据分区时,它会按您的集群键排序.这也是它在磁盘上的存储方式,因此请记住,您的集群键决定了数据在磁盘上的存储顺序.
It sounds like you understand partition keys, so I'll just add that your partition key helps Cassandra figure out where (which token range) in the cluster to store your data. Each node is responsible for several primary token ranges (assuming vnodes). When your data is written to a data partition, it is sorted by your clustering keys. This is also how it is stored on-disk, so remember that your clustering keys determine the order in which your data is stored on disk.
每个分区最多可以有 20 亿行
Each partition can have max 2 billion rows
这并不完全正确.每个分区最多可支持 20 亿个单元.单元格本质上是一个列名/值对.并且您的集群键本身加起来就是一个单元.因此,通过计算您为每个 CQL 行存储的列值来计算您的单元格,如果您使用聚类列,则再添加一个.
That's not exactly true. Each partition can support up to 2 billion cells. A cell is essentially a column name/value pair. And your clustering keys add up to a single cell by themselves. So compute your cells by counting the column values that you store for each CQL row, and add one more if you use clustering columns.
根据您的宽行结构,您的行数限制可能远少于 20 亿行.此外,这只是存储限制.即使您设法在单个分区中存储了 100 万个 CQL 行,查询该分区也会返回如此多的数据,以至于它会很笨拙并且可能会超时.
Depending on your wide row structure you will probably have a limitation of far fewer than 2 billion rows. Additionally, that's just the storage limitation. Even if you managed to store 1 million CQL rows in a single partition, querying that partition would return so much data that it would be ungainly and probably time-out.
如果我在同一个节点内查询数据,检索会很快,我说的对吗?
if I query data within the same node, the retrieval will be fast, am I correct?
它至少比命中多个节点的多键查询要快.但它是否快"取决于其他因素,例如行的宽度,以及执行删除和就地更新等操作的频率.
It'll at least be faster than multi-key queries that hit multiple nodes. But whether or not it will be "fast" depends on other things, like how wide your rows are, and how often you do things like deletes and in-place updates.
我的大部分查询如下:
select lastest timestamp of know deviceid and tagid
Select decvalue of known deviceid and tagid and timestamp
Select alphavalue of known deviceid and tagid and timestamp
select * of know deviceid and tagid with time range
select * of known deviceid with time range
您当前的数据模型可以支持所有这些查询,最后一个除外.为了对 timestamp
执行范围查询,您需要将数据复制到一个新表中,并构建一个 PRIMARY KEY 来支持该查询模式.这称为基于查询的建模".我会像这样构建一个查询表:
Your current data model can support all of those queries, except for the last one. In order to perform a range query on timestamp
, you'll need to duplicate your data into a new table, and build a PRIMARY KEY to support that query pattern. This is called "query-based modeling." I would build a query table like this:
CREATE TABLE newdata_by_deviceid_and_time (
timestamp timestamp,
deviceid int,
tagid int,
decvalue decimal,
alphavalue text,
PRIMARY KEY (deviceid,timestamp));
该表可以支持对 timestamp
的范围查询,同时在 deviceid
上进行分区.
That table can support a range query on timestamp
, while partitioning on deviceid
.
但我发现这两种模型最大的问题是无界行增长".基本上,随着您为设备收集越来越多的值,您将接近每个分区 20 亿个单元的限制(而且在此之前事情可能会变得缓慢).您需要做的是使用一种称为时间分桶"的建模技术.
But the biggest problem I see with either of these models, is that of "unbounded row growth." Basically, as you collect more and more values for your devices, you will approach the 2 billion cell limit per partition (and again, things will probably get slow way before that). What you need to do, is use a modeling technique called "time bucketing."
例如,我会说我确定按月分桶将使我远低于 20 亿个单元格限制并且允许我需要的日期范围灵活性类型.如果是这样,我会添加一个额外的分区键 monthbucket
,我的(新)表将如下所示:
For the example, I'll say that I determined that bucketing by month would keep me well under the 2 billion cells limit and allow for the type of date range flexibility that I needed. If so, I would add an additional partition key monthbucket
and my (new) table would look like this:
CREATE TABLE newdata_by_deviceid_and_time (
timestamp timestamp,
deviceid int,
tagid int,
decvalue decimal,
alphavalue text,
monthbucket text,
PRIMARY KEY ((deviceid,monthbucket),timestamp));
现在,当我想查询特定设备和日期范围内的数据时,我还会指定monthbucket
:
Now when I wanted to query for data in a specific device and date range, I would also specify the monthbucket
:
SELECT * FROM newdata_by_deviceid_and_time
WHERE deviceid='AA23' AND monthbucket='201603'
AND timestamp >= '2016-03-01 00:00:00-0500'
AND timestamp < '2016-03-16 00:00:00-0500';
记住,monthbucket
只是一个例子.对您来说,使用季度甚至年份可能更有意义(假设您在一年内每个 deviceid
不会存储太多值).
Remember, monthbucket
is just an example. For you, it may make more sense to use quarter or even year (assuming that you don't store too many values per deviceid
in a year).
相关文章