COUNT(DISTINCT) 的百分位数与相关 WHERE 仅适用于视图(或没有 DISTINCT)

2021-12-30 00:00:00 where-clause count distinct mysql view

我遇到了一个奇怪的问题,我不知道是我的语法(看起来很简单)还是错误(或者只是不受支持).

I've got a weird one, and I don't know if it's my syntax (which seems straightforward) or a bug (or just unsupported).

这是我的查询有效但不必要的慢:

Here's my query that works but is needlessly slow:

UPDATE table1 
    SET table1column1 = 
        (SELECT COUNT(DISTINCT table2column1) FROM table2view WHERE table2column1 <= (SELECT table2column1 FROM table2 WHERE table2.id = table1.id) )
        / 
        (SELECT COUNT(DISTINCT table2column1) FROM table2) 

       + (SELECT COUNT(DISTINCT table2column2) FROM table2view WHERE table2column2 <= (SELECT table2column2 FROM table2 WHERE table2.id = table1.id) ) 
        / 
        (SELECT COUNT(DISTINCT table2column2) FROM table2) 

       + (SELECT COUNT(DISTINCT table2column3) FROM table2view WHERE table2column3 <= (SELECT table2column3 FROM table2 WHERE table2.id = table1.id) ) 
        / (SELECT COUNT(DISTINCT table2column3) FROM table2);

它只是去除重复项后的三个百分位数(table2column1、table2column2 和 table2column3)的总和.

It's just the sum of three percentiles (of table2column1, table2column2, and table2column3) with duplicates removed.

这就是奇怪的地方.我必须使用一个视图来处理带有 WHERE 的子查询,否则它只会 UPDATE table1 的第一行,并设置其余行的 table1column10.table2viewtable2 完全相同.是的,很奇怪.

Here's where it gets weird. I have to use a view for this to work on the subquery with the WHERE or it will only UPDATE the first row of table1, and set the rest of the rows' table1column1 to 0. That table2view is an exact duplicate of table2. Yeah, weird.

如果我不使用 DISTINCT,我可以在没有视图的情况下完成.那有意义吗?注意:我必须有 DISTINCT 因为我有很多重复项.

If I don't use DISTINCT, I can do it without the view. Does that make sense? Note: I have to have DISTINCT because I have lots of duplicates.

我尝试仅从视图中创建 SELECT ,但这会减慢速度.

I tried making it SELECT only from the view, but that slowed it down worse.

有谁知道问题是什么以及重新处理这个查询的最佳方法,所以它不会花费这么长时间?它在 TRIGGER 中,并且更新的数据非常随需.

Does anyone know what the problem is and the best way to rework this query so it doesn't take so long? It's in a TRIGGER, and the updated data is pretty on demand.

非常感谢!

详情

我正在 phpMyAdmin 的命令行中测试速度.

I'm testing the speed in phpMyAdmin's command line.

我很确定降级来自视图,因为视图越多,我使用的实际表越少,它变得越慢.

I'm pretty sure the degradation is coming from the view since the more of the view and the less of the actual table I use, the slower it gets.

当我在没有 DISTINCT 的情况下做一个时,它的速度非常快.

When I do the one without DISTINCT, it's lightning fast.

仅适用于视图?

好的,所以我只是设置了 table2 的副本.我首先尝试执行原始查询,用副本替换视图.不行.

OK, so I just set up a copy of table2. I tried first to do the original query substituting the view with the copy. No go.

我尝试使用副本而不是视图执行以下查询.不行.

I tried to do the query below with the copy instead of the view. No go.

希望这些常量的引入能更好地展示我正在尝试做的事情.

Hopefully the introduction of these constants will better show what I'm trying to do.

SET @table2column1_distinct_count = (SELECT COUNT(DISTINCT table2column1) FROM table2);
SET @table2column2_distinct_count = (SELECT COUNT(DISTINCT table2column2) FROM table2);
SET @table2column3_distinct_count = (SELECT COUNT(DISTINCT table2column3) FROM table2);
UPDATE table1, table2
    SET table1.table1column1 = (SELECT COUNT(DISTINCT table2column1) FROM  table2view WHERE table2column1 <= table2.table2column1) / @table2column1_distinct_count 
    + (SELECT COUNT(DISTINCT table2column2) FROM  table2view WHERE table2column2 <= table2.table2column2) / @table2column2_distinct_count 
    + (SELECT COUNT(DISTINCT table2column3) FROM  table2view WHERE table2column3 <= table2.table2column3) / @table2column3_distinct_count 
        WHERE table1.id = table2.id;

同样,当我使用 table2 而不是 table2view 时,它只会正确更新第一行并设置所有其他行的 table1.table1column1 = 0.

Again, when I use table2 instead of the table2view, it only updates the first row properly and sets all other rows' table1.table1column1 = 0.

数学

我正在尝试将 table1.table1column1 = 设置为 table2column1table2column2table2column3 的百分位数之和 通过 id.

I'm trying to set table1.table1column1 = to the sum of the percentiles of table2column1, table2column2, and table2column3 by id.

我通过(计算 table2columnX 的不同值 <= 到当前 table2columnX )/(不同 table2columnX 的总计数s).

I do a percentile by (counting the distinct values of a table2columnX <= to the current table2columnX ) / (the total count of distinct table2columnXs).

我使用 DISTINCT 来去除过多的重复.

I use DISTINCT to get rid of the excessive duplicates.

查看

这是视图的 SELECT.这有帮助吗?

Here's the SELECT for the view. Does this help?

CREATE VIEW myTable.table2view AS SELECT
    table2.table2column1 AS table2column1,
    table2.table2column2 AS table2column2,
    table2.table2column2 AS table2column3,
FROM table2
GROUP BY table2.id;

视图的 SELECT 中的 GROUP BY 是否有什么特别之处使这项工作(我没有看到)?

Is there something special about the GROUP BY in the view's SELECT that makes this work (that I'm not seeing)?

推荐答案

我可能会说查询很慢,因为它在触发器触发时重复访问表.

I would probably say that the query is slow because it is repeatedly accessing the table when the trigger fires.

我不是 SQL 专家,但我曾尝试使用临时表组合查询.您可以查看它是否有助于加快查询速度.我在下面的代码示例中使用了不同但发音相似的列名称.

I am no SQL expert but I have tried to put together a query using temporary tables. You can see if it helps speed up the query. I have used different but similar sounding column names in my code sample below.

在我之前的代码中有一个计算错误.现已更新.

EDIT : There was a calculation error in my earlier code. Updated now.

SELECT COUNT(id) INTO @no_of_attempts from tb2;

-- DROP TABLE IF EXISTS S1Percentiles;
-- DROP TABLE IF EXISTS S2Percentiles;
-- DROP TABLE IF EXISTS S3Percentiles;

CREATE TEMPORARY TABLE S1Percentiles (
    s1 FLOAT NOT NULL,
    percentile FLOAT NOT NULL DEFAULT 0.00
);

CREATE TEMPORARY TABLE S2Percentiles (
    s2 FLOAT NOT NULL,
    percentile FLOAT NOT NULL DEFAULT 0.00
);

CREATE TEMPORARY TABLE S3Percentiles (
    s3 FLOAT NOT NULL,
    percentile FLOAT NOT NULL DEFAULT 0.00
);



INSERT INTO S1Percentiles (s1, percentile)
    SELECT A.s1, ((COUNT(B.s1)/@no_of_attempts)*100)
    FROM (SELECT DISTINCT s1 from tb2) A
    INNER JOIN tb2 B
    ON B.s1 <= A.s1
    GROUP BY A.s1;

INSERT INTO S2Percentiles (s2, percentile)
    SELECT A.s2, ((COUNT(B.s2)/@no_of_attempts)*100)
    FROM (SELECT DISTINCT s2 from tb2) A
    INNER JOIN tb2 B
    ON B.s2 <= A.s2
    GROUP BY A.s2;

INSERT INTO S3Percentiles (s3, percentile)
    SELECT A.s3, ((COUNT(B.s3)/@no_of_attempts)*100)
    FROM (SELECT DISTINCT s3 from tb2) A
    INNER JOIN tb2 B
    ON B.s3 <= A.s3
    GROUP BY A.s3;

-- select * from S1Percentiles;
-- select * from S2Percentiles;
-- select * from S3Percentiles;

UPDATE tb1 A
    INNER JOIN
    (
    SELECT B.tb1_id AS id, (C.percentile + D.percentile + E.percentile) AS sum FROM tb2 B
        INNER JOIN S1Percentiles C
        ON B.s1 = C.s1
        INNER JOIN S2Percentiles D
        ON B.s2 = D.s2
        INNER JOIN S3Percentiles E
        ON B.s3 = E.s3
    ) F
    ON A.id = F.id

    SET A.sum = F.sum;

-- SELECT * FROM tb1;

DROP TABLE S1Percentiles;
DROP TABLE S2Percentiles;
DROP TABLE S3Percentiles;

这样做是为了记录每个分数组的百分位数,然后最后只用必要的数据更新 tb1 列,而不是重新计算每个学生行的百分位数.

What this does is that it records the percentile for each score group and then finally just updates the tb1 column with the requisite data instead of recalculating the percentile for each student row.

您还应该索引 s1s2s3 列以优化对这些列的查询.

You should also index columns s1, s2 and s3 for optimizing the queries on these columns.

注意:请根据您的数据库架构更新列名称.另请注意,每个百分位数计算都乘以 100,因为我相信百分位数通常是这样计算的.

Note: Please update the column names according to your db schema. Also note that each percentile calculation has been multiplied by 100 as I believe that percentile is usually calculated that way.

相关文章