来自 GROUP_BY 的两个 LEFT JOIN 的 GROUP_CONCAT 的奇怪重复行为

2021-11-20 00:00:00 sql left-join group-by mysql group-concat

这里是我所有表的结构和查询(请关注最后一个查询,附在下面).正如您在小提琴中看到的,这是当前输出:

+---------+-----------+-------+------------+--------------+|用户 ID |用户名 |得分 |声誉|top_two_tags |+---------+-----------+-------+------------+--------------+|1 |杰克 |0 |18 |css,mysql ||4 |詹姆斯 |1 |5 |html ||2 |彼得 |0 |0 |空||3 |阿里 |0 |0 |空|+---------+-----------+-------+------------+--------------+

这是正确的,一切都很好.

<小时>

现在我又多了一个名为category"的存在.每个帖子只能有一个类别.而且我还想为每个用户获得前两个类别.这里是我的新查询.正如您在结果中看到的,发生了一些重复:

+---------+-----------+-------+------------+--------------+------------------------+|用户 ID |用户名 |得分 |声誉|top_two_tags |top_two_categories |+---------+-----------+-------+------------+--------------+------------------------+|1 |杰克 |0 |18 |css,css |技术,技术||4 |詹姆斯 |1 |5 |html |政治 ||2 |彼得 |0 |0 |空|空||3 |阿里 |0 |0 |空|空|+---------+-----------+-------+------------+--------------+------------------------+

看到了吗?css,css技术,技术.为什么这些是重复的?我刚刚为 categories 添加了一个 LEFT JOIN,就像 tags 一样.但它不能按预期工作,甚至会影响标签.

<小时>

无论如何,这是预期的结果:

+---------+-----------+-------+------------+--------------+------------------------+|用户 ID |用户名 |得分 |声誉|top_two_tags |类别 |+---------+-----------+-------+------------+--------------+------------------------+|1 |杰克 |0 |18 |css,mysql |科技、社交 ||4 |詹姆斯 |1 |5 |html |政治 ||2 |彼得 |0 |0 |空|空||3 |阿里 |0 |0 |空|空|+---------+-----------+-------+------------+--------------+------------------------+

有人知道我怎样才能做到这一点吗?

<小时>

CREATE TABLE users(id integer PRIMARY KEY, user_name varchar(5));CREATE TABLE tags(id integer NOT NULL PRIMARY KEY, tag varchar(5));创建表声誉(id 整数 PRIMARY KEY,post_id integer/* REFERENCES posts(id) */,user_id integer REFERENCES users(id),分数整数,声誉整数,日期时间整数);创建表 post_tag(post_id integer/* REFERENCES posts(id) */,tag_id integer REFERENCES tags(id),PRIMARY KEY (post_id, tag_id));创建表类别(id INTEGER NOT NULL PRIMARY KEY,类别varchar(10)NOT NULL);创建表 post_category(post_id INTEGER NOT NULL/* REFERENCES posts(id) */,category_id INTEGER NOT NULL REFERENCES category(id),PRIMARY KEY(post_id, category_id)) ;选择q1.user_id, q1.user_name, q1.score, q1.reputation,substring_index(group_concat(q2.tag ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags,substring_index(group_concat(q3.category ORDER BY q3.c​​ategory_reputation DESC SEPARATOR ','), ',', 2) AS 类别从(选择u.id AS user_Id,u.user_name,合并(sum(r.score), 0) 作为分数,合并(sum(r.reputation), 0) 作为声誉从用户你LEFT JOIN 声望 rON r.user_id = u.idAND r.date_time >1500584821/* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */通过...分组u.id, u.user_name) 作为 q1左加入(选择r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation从声誉加入 post_tag pt ON pt.post_id = r.post_id加入标签 t ON t.id = pt.tag_id在哪里r.date_time >1500584821/* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */通过...分组user_id, t.tag) 作为 q2ON q2.user_id = q1.user_id左加入(选择r.user_id AS user_id, c.category, sum(r.reputation) AS category_reputation从声誉加入 post_category ct ON ct.post_id = r.post_id加入类别 c ON c.id = ct.category_id在哪里r.date_time >1500584821/* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */通过...分组user_id, c.category) 作为 q3ON q3.user_id = q1.user_id通过...分组q1.user_id, q1.user_name, q1.score, q1.reputation订购者q1.reputation DESC, q1.score DESC ;

解决方案

您的第二个查询格式为:

q1 -- PK user_id左加入 (...GROUP BY user_id, t.tag) 作为 q2ON q2.user_id = q1.user_id左加入 (...GROUP BY user_id, c.category) 作为 q3ON q3.user_id = q1.user_idGROUP BY -- group_concats

内部 GROUP BY 结果为 (user_id, t.tag) &(user_id, c.category) 是键/唯一.除此之外,我不会解决那些 GROUP BY.

TL;DR 当您加入 (q1 JOIN q2) 到 q3 时,它不在其中一个键/唯一键上,因此对于每个 user_id,您会为每个可能的标签组合获得一行;类别.所以最终的 GROUP BY 输入重复每个 (user_id, tag) &per (user_id, category) 和不当的 GROUP_CONCATs 重复标签 &每个 user_id 的类别.正确的应该是 (q1 JOIN q2 GROUP BY) JOIN (q1 JOIN q3 GROUP BY),其中所有连接都在公共键/UNIQUE (user_id) 上没有虚假聚合.尽管有时您可以撤消这种虚假聚合.

正确对称的 INNER JOIN 方法:LEFT JOIN q1 &q2--1:many--then GROUP BY &GROUP_CONCAT(这是您的第一个查询所做的);然后分别类似地LEFT JOIN q1 &q3--1:many--then GROUP BY &GROUP_CONCAT;然后 INNER JOIN 这两个结果 ON user_id--1:1.

正确的对称标量子查询方法:从 q1 中选择 GROUP_CONCAT 作为 标量子查询,每个都带有一个 GROUP BY.

正确的累积LEFT JOIN方法:LEFT JOIN q1 &q2--1:many--then GROUP BY &GROUP_CONCAT;然后左加入那个 &q3--1:many--then GROUP BY &GROUP_CONCAT.

像您的第二个查询一样的正确方法:您首先 LEFT JOIN q1 &q2--1:很多.然后你离开加入那个 &q3--许多:1:许多.它为每个可能的标签组合提供一行与 user_id 一起出现的类别.然后在你 GROUP BY 之后你 GROUP_CONCAT - 重复 (user_id, tag) 对和重复 (user_id, category) 对.这就是为什么你有重复的列表元素.但是将 DISTINCT 添加到 GROUP_CONCAT 会给出正确的结果.(根据 wchiquito 的评论.)

与往常一样,您更喜欢的是一种工程权衡,由查询计划和时间,根据实际数据/使用/统计.输入&预期重复数量的统计数据)、实际查询的时间等.一个问题是 many:1:many JOIN 方法的额外行是否抵消了它对 GROUP BY 的节省.

-- 累积LEFT JOIN方法选择q1.user_id, q1.user_name, q1.score, q1.reputation,top_two_tags,substring_index(group_concat(q3.category ORDER BY q3.c​​ategory_reputation DESC SEPARATOR ','), ',', 2) AS 类别从-- 您的第一个查询(更少的 ORDER BY)AS q1(选择q1.user_id, q1.user_name, q1.score, q1.reputation,substring_index(group_concat(q2.tag ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags从(选择u.id AS user_Id,u.user_name,合并(sum(r.score), 0) 作为分数,合并(sum(r.reputation), 0) 作为声誉从用户你LEFT JOIN 声望 rON r.user_id = u.idAND r.date_time >1500584821/* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */通过...分组u.id, u.user_name) 作为 q1左加入(选择r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation从声誉加入 post_tag pt ON pt.post_id = r.post_id加入标签 t ON t.id = pt.tag_id在哪里r.date_time >1500584821/* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */通过...分组user_id, t.tag) 作为 q2ON q2.user_id = q1.user_id通过...分组q1.user_id, q1.user_name, q1.score, q1.reputation) 作为 q1- 像您的第二个查询一样完成左加入(选择r.user_id AS user_id, c.category, sum(r.reputation) AS category_reputation从声誉加入 post_category ct ON ct.post_id = r.post_id加入类别 c ON c.id = ct.category_id在哪里r.date_time >1500584821/* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */通过...分组user_id, c.category) 作为 q3ON q3.user_id = q1.user_id通过...分组q1.user_id, q1.user_name, q1.score, q1.reputation订购者q1.reputation DESC, q1.score DESC ;

Here is all my tables' structure and the query (please focus on the last query, appended below). As you see in the fiddle, here is the current output:

+---------+-----------+-------+------------+--------------+
| user_id | user_name | score | reputation | top_two_tags |
+---------+-----------+-------+------------+--------------+
| 1       | Jack      | 0     | 18         | css,mysql    |
| 4       | James     | 1     | 5          | html         |
| 2       | Peter     | 0     | 0          | null         |
| 3       | Ali       | 0     | 0          | null         |
+---------+-----------+-------+------------+--------------+

It's correct and all fine.


Now I have one more existence named "category". Each post can has only one category. And I also want to get top two categories for each user. And here is my new query. As you see in the result, some duplicates happened:

+---------+-----------+-------+------------+--------------+------------------------+
| user_id | user_name | score | reputation | top_two_tags |   top_two_categories   |
+---------+-----------+-------+------------+--------------+------------------------+
| 1       | Jack      | 0     | 18         | css,css      | technology,technology  |
| 4       | James     | 1     | 5          | html         | political              |
| 2       | Peter     | 0     | 0          | null         | null                   |
| 3       | Ali       | 0     | 0          | null         | null                   |
+---------+-----------+-------+------------+--------------+------------------------+

See? css,css, technology, technology. Why these are duplicate? I've just added one more LEFT JOIN for categories, exactly like tags. But it doesn't work as expected and even affects on the tags either.


Anyway, this is the expected result:

+---------+-----------+-------+------------+--------------+------------------------+
| user_id | user_name | score | reputation | top_two_tags |        category        |
+---------+-----------+-------+------------+--------------+------------------------+
| 1       | Jack      | 0     | 18         | css,mysql    | technology,social      |
| 4       | James     | 1     | 5          | html         | political              |
| 2       | Peter     | 0     | 0          | null         | null                   |
| 3       | Ali       | 0     | 0          | null         | null                   |
+---------+-----------+-------+------------+--------------+------------------------+

Does anybody know how can I achieve that?


CREATE TABLE users(id integer PRIMARY KEY, user_name varchar(5));
CREATE TABLE tags(id integer NOT NULL PRIMARY KEY, tag varchar(5));
CREATE TABLE reputations(
    id  integer PRIMARY KEY, 
    post_id  integer /* REFERENCES posts(id) */, 
    user_id integer REFERENCES users(id), 
    score integer, 
    reputation integer, 
    date_time integer);
CREATE TABLE post_tag(
    post_id integer /* REFERENCES posts(id) */, 
    tag_id integer REFERENCES tags(id),
    PRIMARY KEY (post_id, tag_id));
CREATE TABLE categories(id INTEGER NOT NULL PRIMARY KEY, category varchar(10) NOT NULL);
CREATE TABLE post_category(
    post_id INTEGER NOT NULL /* REFERENCES posts(id) */, 
    category_id INTEGER NOT NULL REFERENCES categories(id),
    PRIMARY KEY(post_id, category_id)) ;

SELECT
    q1.user_id, q1.user_name, q1.score, q1.reputation, 
    substring_index(group_concat(q2.tag  ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags,
    substring_index(group_concat(q3.category  ORDER BY q3.category_reputation DESC SEPARATOR ','), ',', 2) AS category
FROM
    (SELECT 
        u.id AS user_Id, 
        u.user_name,
        coalesce(sum(r.score), 0) as score,
        coalesce(sum(r.reputation), 0) as reputation
    FROM 
        users u
        LEFT JOIN reputations r 
            ON    r.user_id = u.id 
              AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    GROUP BY 
        u.id, u.user_name
    ) AS q1
    LEFT JOIN
    (
    SELECT
        r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation
    FROM
        reputations r 
        JOIN post_tag pt ON pt.post_id = r.post_id
        JOIN tags t ON t.id = pt.tag_id
    WHERE
        r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    GROUP BY
        user_id, t.tag
    ) AS q2
    ON q2.user_id = q1.user_id 
    LEFT JOIN
    (
    SELECT
        r.user_id AS user_id, c.category, sum(r.reputation) AS category_reputation
    FROM
        reputations r 
        JOIN post_category ct ON ct.post_id = r.post_id
        JOIN categories c ON c.id = ct.category_id
    WHERE
        r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    GROUP BY
        user_id, c.category
    ) AS q3
    ON q3.user_id = q1.user_id 
GROUP BY
    q1.user_id, q1.user_name, q1.score, q1.reputation
ORDER BY
    q1.reputation DESC, q1.score DESC ;

解决方案

Your second query is of the form:

q1 -- PK user_id
LEFT JOIN (...
    GROUP BY user_id, t.tag
) AS q2
ON q2.user_id = q1.user_id 
LEFT JOIN (...
    GROUP BY user_id, c.category
) AS q3
ON q3.user_id = q1.user_id
GROUP BY -- group_concats

The inner GROUP BYs result in (user_id, t.tag) & (user_id, c.category) being keys/UNIQUEs. Other than that I won't address those GROUP BYs.

TL;DR When you join (q1 JOIN q2) to q3 it is not on a key/UNIQUE of one of them so for each user_id you get a row for every possible combination of tag & category. So the final GROUP BY inputs duplicates per (user_id, tag) & per (user_id, category) and inappropriately GROUP_CONCATs duplicate tags & categories per user_id. Correct would be (q1 JOIN q2 GROUP BY) JOIN (q1 JOIN q3 GROUP BY) in which all joins are on common key/UNIQUE (user_id) & there is no spurious aggregation. Although sometimes you can undo such spurious aggregation.

A correct symmetrical INNER JOIN approach: LEFT JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT (which is what your first query did); then separately similarly LEFT JOIN q1 & q3--1:many--then GROUP BY & GROUP_CONCAT; then INNER JOIN the two results ON user_id--1:1.

A correct symmetrical scalar subquery approach: SELECT the GROUP_CONCATs from q1 as scalar subqueries each with a GROUP BY.

A correct cumulative LEFT JOIN approach: LEFT JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT; then LEFT JOIN that & q3--1:many--then GROUP BY & GROUP_CONCAT.

A correct approach like your 2nd query: You first LEFT JOIN q1 & q2--1:many. Then you LEFT JOIN that & q3--many:1:many. It gives a row for every possible combination of a tag & a category that appear with a user_id. Then after you GROUP BY you GROUP_CONCAT--over duplicate (user_id, tag) pairs and duplicate (user_id, category) pairs. That is why you have duplicate list elements. But adding DISTINCT to GROUP_CONCAT gives a correct result. (Per wchiquito's comment.)

Which you prefer is as usual an engineering tradeoff to be informed by query plans & timings, per actual data/usage/statistics. input & stats for expected amount of duplication), timing of actual queries, etc. One issue is whether the extra rows of the many:1:many JOIN approach offset its saving of a GROUP BY.

-- cumulative LEFT JOIN approach
SELECT
   q1.user_id, q1.user_name, q1.score, q1.reputation,
    top_two_tags,
    substring_index(group_concat(q3.category  ORDER BY q3.category_reputation DESC SEPARATOR ','), ',', 2) AS category
FROM
    -- your 1st query (less ORDER BY) AS q1
    (SELECT
        q1.user_id, q1.user_name, q1.score, q1.reputation, 
        substring_index(group_concat(q2.tag  ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags
    FROM
        (SELECT 
            u.id AS user_Id, 
            u.user_name,
            coalesce(sum(r.score), 0) as score,
            coalesce(sum(r.reputation), 0) as reputation
        FROM 
            users u
            LEFT JOIN reputations r 
                ON    r.user_id = u.id 
                  AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
        GROUP BY 
            u.id, u.user_name
        ) AS q1
        LEFT JOIN
        (
        SELECT
            r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation
        FROM
            reputations r 
            JOIN post_tag pt ON pt.post_id = r.post_id
            JOIN tags t ON t.id = pt.tag_id
        WHERE
            r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
        GROUP BY
            user_id, t.tag
        ) AS q2
        ON q2.user_id = q1.user_id 
        GROUP BY
            q1.user_id, q1.user_name, q1.score, q1.reputation
    ) AS q1
    -- finish like your 2nd query
    LEFT JOIN
    (
    SELECT
        r.user_id AS user_id, c.category, sum(r.reputation) AS category_reputation
    FROM
        reputations r 
        JOIN post_category ct ON ct.post_id = r.post_id
        JOIN categories c ON c.id = ct.category_id
    WHERE
        r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    GROUP BY
        user_id, c.category
    ) AS q3
    ON q3.user_id = q1.user_id 
GROUP BY
    q1.user_id, q1.user_name, q1.score, q1.reputation
ORDER BY
    q1.reputation DESC, q1.score DESC ;

相关文章