基于 Mysql json 的趋势​​标签实现

2022-01-11 00:00:00 time-series json analytics mysql

我正在尝试使用 mysql json 功能识别时间序列上的趋势标签(基于最大点击数).下面是我的桌子

I am trying to identifying the trending tags (based on maximum hits) on time series using mysql json feature. Below is my table

CREATE TABLE TAG_COUNTER (
    account       varchar(36) NOT NULL,
    time_id       INT NOT NULL,
    counters      JSON,
    PRIMARY KEY   (account, time_id)
)

在每个 web api 请求中,我将获得每个帐户的多个不同标签,并且根据标签的数量,我将准备 INSERT ON DUPLICATE KEY UPDATE 查询.下面的示例显示了带有两个标签的插入.

In every web api request, i will be getting the multiple different tags per account, and based on number of tags, i will prepare the INSERT ON DUPLICATE KEY UPDATE query. Below example is showing insertion with two tags.

INSERT INTO `TAG_COUNTER`
  (`account`, `time_id`, `counters`)
VALUES
  ('google', '2018061023', '{"tag1": 1, "tag2": 1}')
ON DUPLICATE KEY UPDATE `counters` =
  JSON_SET(`counters`,
           '$."tag1"',
           IFNULL(JSON_EXTRACT(`counters`,
                        '$."tag1"'), 0) + 1,
           '$."tag2"',
           IFNULL(JSON_EXTRACT(`counters`,
                        '$."tag2"'), 0) + 1
  );

time_id为yyyyMMddhh,每行按小时聚合.

time_id is yyyyMMddhh, and it is hourly aggregation on each row.

现在我的问题是检索标签.下面的查询将为我提供 tag1 的聚合,但在进行此查询之前我们不会知道标签.

Now my problem is retrival of treding tags. Below query will give me aggregation for tag1, but we will not be knowing the tags before making this query.

SELECT
SUBSTRING(time_id, 1, 6) AS month,
SUM(counters->>'$.tag1')
FROM TAG_COUNTER
WHERE counters->>'$.tag1' > 0
GROUP BY month;

所以我需要通用 group by 查询以及 order by 来获取每小时/每天/每月时间的趋势标签.

So i need generic group by query along with order by to get the trending tags for the time hourly/daily/monthly.

预期的输出样本是

Time(hour/day/month)  Tag_name  Tag_count_value(total hits)

当我在网上搜索时,提到的每个地方都如下所示{"tag_name": "tag1", "tag_count": 1} 而不是直接 {"tag1" : 1}他们在 group by 中使用 tag_name.

When i have searched the web, every where it is mentioned like below {"tag_name": "tag1", "tag_count": 1} instead of direct {"tag1" : 1} and they were using tag_name in the group by.

Q1) 那么是否总是必须有一个公知的 json 密钥来执行 group by ..?

Q1) So is it always mandatory to have common known json key to perform group by ..?

Q2) 如果我必须采用这种方式,那么对于这个新的 json 标签/值结构,我的 INSERT ON DUPLICATE KEY UPDATE 查询有何变化?因为计数器必须在它不存在时创建,并且在它存在时应该加一.

Q2) If i have to go with this way, then what is the change in my INSERT ON DUPLICATE KEY UPDATE query for this new json label/value struture? Since the counter has to be created when it is not existing and should increment by one when it is existing.

Q3)我是否必须维护对象数组

Q3) do i have to maintain array of objects

[
 {"tag_name": "tag1", "tag_count": 2},
 {"tag_name": "tag2", "tag_count": 3}
]

OR 对象如下?

{
 {"tag_name": "tag1", "tag_count": 2},
 {"tag_name": "tag2", "tag_count": 3}
}

那么在趋势计数的 INSERT 和 RETRIEVAL 方面哪个优于 json 结构?

So which is better above json structure interms of INSERT and RETRIEVAL of trending count?

Q4) 我可以使用现有的 {"key" : "value"} 格式而不是 {"key_label" : key, "value_lable" :"value"} 并且可以提取趋势..?因为我认为 {"key" : "value"} 非常直接并且擅长性能方面.

Q4) Can i go with existing {"key" : "value"} format instead of {"key_label" : key, "value_lable" : "value"} and possible to extract trending ..? since i am thinking that {"key" : "value"} is very straight forward and good at performance wise.

Q5) 检索时我使用 SUBSTRING(time_id, 1, 6) AS month.它可以使用索引吗?

Q5) While retrieving i am using SUBSTRING(time_id, 1, 6) AS month. Will it be able to use index?

或者我是否需要创建多个列,例如 time_hour(2018061023)time_day(20180610)time_month(201806) 并使用查询在特定列上?

OR do i need to create multiple columns like time_hour(2018061023), time_day(20180610), time_month(201806) and use query on specific columns?

或者我可以使用 mysql 日期-时间函数?那会使用索引来加快检索速度吗?

OR can i use mysql date-time functions? will that use index for faster retrieval?

请帮忙.

推荐答案

我看不出一个很好的理由,为什么你在这里使用 JSON.也不清楚,为什么您认为 MySQL 中的nosql 模式"会做得更好.

I don't see a good reason, why you use JSON here. It's also not clear, why you believe that a "nosql schema" within MySQL would do anything better.

你可能需要这样的东西:

What you probably need is something like this:

CREATE TABLE TAG_COUNTER (
    account       varchar(36) NOT NULL,
    time_id       INT NOT NULL,
    tag_name      varchar(50) NOT NULL,
    counter       INT UNSIGNED NOT NULL,
    PRIMARY KEY   (account, time_id, tag_name)
);

这将简化您的查询.INSERT 语句如下所示:

This will simplify your queries. The INSERT statement would look like:

INSERT INTO TAG_COUNTER
  (account, time_id, tag_name, counter)
VALUES
  ('google', 2018061023, 'tag1', 1),
  ('google', 2018061023, 'tag2', 1)
ON DUPLICATE KEY UPDATE counter = counter + VALUES(counter);

SELECT 语句可能是这样的

The SELECT statement might be something like this

SELECT
    SUBSTRING(time_id, 1, 6) AS month,
    tag_name,
    SUM(counter) AS counter_agg
FROM TAG_COUNTER
GROUP BY month, tag_name
ORDER BY month, counter_agg DESC;

请注意,我没有尝试针对数据大小和性能优化表/模式.那将是一个不同的问题.但是您必须看到,现在查询要简单得多.

Note that I did't try to optimize the table/schema for data size and performance. That would be a different question. But you must see, that the queries are much simpler now.

相关文章