在 SQL Server 中计算中值的函数

2021-12-02 00:00:00 tsql aggregate-functions sql-server median

根据 MSDN，中值不能作为聚合函数使用在 Transact-SQL 中.但是，我想知道是否可以创建此功能(使用创建聚合函数、用户定义函数或其他一些方法).

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregate function, user defined function, or some other method).

这样做的最佳方法是什么(如果可能) - 允许在聚合查询中计算中值(假设是数字数据类型)?

What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?

推荐答案

2019 UPDATE: 在我写这个答案的 10 年里，已经发现了更多的解决方案可能会产生更好的结果.此外，此后的 SQL Server 版本(尤其是 SQL 2012)引入了可用于计算中位数的新 T-SQL 功能.SQL Server 版本还改进了其查询优化器，这可能会影响各种中值解决方案的性能.Net-net，我 2009 年的原始帖子仍然可以，但对于现代 SQL Server 应用程序可能有更好的解决方案.看看 2012 年的这篇文章，这是一个很好的资源:https://sqlperformance.com/2012/08/t-sql-queries/median

本文发现以下模式比所有其他替代方案快得多，至少在他们测试的简单模式上是这样.此解决方案比测试的最慢 (PERCENTILE_CONT) 解决方案快 373 倍 (!!!).请注意，此技巧需要两个单独的查询，这在所有情况下可能都不实用.它还需要 SQL 2012 或更高版本.

This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.

DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows); SELECT AVG(1.0 * val) FROM ( SELECT val FROM dbo.EvenRows ORDER BY val OFFSET (@c - 1) / 2 ROWS FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY ) AS x;

当然，仅仅因为 2012 年对一个架构的一项测试取得了不错的结果，您的里程可能会有所不同，尤其是如果您使用的是 SQL Server 2014 或更高版本.如果 perf 对您的中值计算很重要，我强烈建议您尝试并对该文章中推荐的几个选项进行性能测试，以确保您找到了最适合您的架构的选项.

Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.

我也会特别小心地使用(SQL Server 2012 中的新功能)函数 PERCENTILE_CONT 在 other 之一中推荐回答这个问题，因为上面链接的文章发现这个内置函数比最快的解决方案慢 373 倍.从那以后的 7 年中，这种差异可能有所改善，但就我个人而言，在验证其性能与其他解决方案之前，我不会在大表上使用此函数.

I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.

2009 年原始帖子如下:

有很多方法可以做到这一点，但性能却大不相同.这是一个特别优化的解决方案，来自中位数、ROW_NUMBERs 和性能.对于执行期间生成的实际 I/O，这是一个特别理想的解决方案 - 看起来比其他解决方案成本更高，但实际上要快得多.

There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.

该页面还包含对其他解决方案和性能测试详细信息的讨论.请注意使用唯一列作为消歧器，以防有多行具有相同的中值列值.

That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.

与所有数据库性能方案一样，始终尝试使用真实硬件上的真实数据测试解决方案 - 您永远不知道何时更改 SQL Server 优化器或环境中的特殊性会使正常速度的解决方案变慢.

As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.

SELECT CustomerId, AVG(TotalDue) FROM ( SELECT CustomerId, TotalDue, -- SalesOrderId in the ORDER BY is a disambiguator to break ties ROW_NUMBER() OVER ( PARTITION BY CustomerId ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY CustomerId ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc FROM Sales.SalesOrderHeader SOH ) x WHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1) GROUP BY CustomerId ORDER BY CustomerId;

相关文章