使用 T-SQL 在时间序列数据中前向填充空值的有效方法

2021-09-10 00:00:00 tsql sql-server

我有一个包含大部分为空值的时间序列数据的表,我想用最后一个已知值填充所有空值.

I have a table with time-series data that's mostly nulls, and I want to fill in all of the nulls with the last known value.

我有一些解决方案,但它们比在 Pandas 中执行等效的 DataFrame.fillna(method='ffill') 操作要慢得多.

I have a few solutions, but they're much slower than doing the equivalent DataFrame.fillna(method='ffill') operation in Pandas.

我正在使用的代码/数据的简化版本:

A simplified version of the code / data that I'm using:

select d.[date], d.[price],
       (select top 1 p.price from price_table p
        where p.price is not null and p.[date] <= p.[date]
        order by p.[date] desc) as ff_price
from price_table d

制作桌子

date       price ff_price
---------- ----- --------
2016-07-11 0.79  0.79
2016-07-12 NULL  0.79
2016-07-13 NULL  0.79
2016-07-14 0.69  0.69
2016-07-15 NULL  0.69
...
2016-09-21 0.88  0.88
...

我有超过 1 亿行,所以这需要很长时间.

I have >100 million rows, so this takes quite a while.

推荐答案

假设你的列是 DATE 并且价格是 DECIMAL(5,2),请测试这个方法:

Assuming that your column is DATE and price is DECIMAL(5,2), please test this approach:

SELECT
    P.[date],
    P.[price],
    ff_price = CONVERT(
        DECIMAL(5,2),       -- Original price datatype
        SUBSTRING(
            MAX(
                CAST(P.[date] AS BINARY(3)) +   -- 3: datalength of P.[date] column
                CAST(P.[price] AS BINARY(5))    -- 5: datalength of P.[price] column
            ) OVER (ORDER BY P.[date] ROWS UNBOUNDED PRECEDING),

            4,  -- Position to start that's not the binary part of the date

            5))-- Characters that compose the binary of the original price datatype
FROM
    price_table  AS P

这是我用类似问题实现的解决方案,您可以找到详尽的解释 此处.这种方法之所以好是因为它不需要显式排序,只要您有日期的索引即可.

This is a solution I implemented with a similar problem and you can find the exaustive explanation here. The reason this approach is good is because it doesn't require a explicit sort, as long as you have an index by date.

它所做的基本上是使用窗口化的 MAX 与组成日期列的 3 个字节的串联(这就是为什么我提到您的列必须是 DATE,否则 DATETIME 将需要 8 个字节,您可以编辑查询以使用它)使用构成您的价格列的字节(也假定为 5 个字节).这是 CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5)) 部分.

What it does is basically use a windowed MAX with the concatenation of the 3 bytes that composes your date column (this is why I mentioned that you column must be DATE, otherwise DATETIME will need 8 bytes, you can edit the query to work with this) with the bytes that compose your price column (which are 5 bytes, also assumed). This is the CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5)) part.

当你计算这个和 ORDER BY P.[date] ROWS UNBOUNDED PRECEDING 时,引擎基本上是滚动最大值,其中最重要的字节是你的日期.当日期更改时,最大值结果将始终更新,但考虑到将任何值与 NULL 作为价格连接也会产生 NULL(作为二进制),那么 MAX 将始终忽略此值并保留之前的非空 MAX(按 P.[date] ROWS UNBOUNDED PRECEDING).

When you calculate this and ORDER BY P.[date] ROWS UNBOUNDED PRECEDING, the engine is basically doing rolling max with values which most significant bytes are your dates. The max result will always update when the date changes, but considering that concatenating any value with NULL as price will also yield NULL (as binary), then the MAX will always ignore this value and retain the previous non-null MAX (by P.[date] ROWS UNBOUNDED PRECEDING).

这是窗口化 MAX 的二进制结果(我添加了一个带有 NULL 的前一条记录,所以你看到结果是 NULL 表示 null价格值):

This is the binary result of the windowed MAX (I added a previous record with NULL so you see that result is NULL for null prices values):

date        price   ff_price    WindowedMax
2016-07-10  NULL    NULL        NULL
2016-07-11  0.79    0.79        0x9B3B0B050200014F
2016-07-12  NULL    0.79        0x9B3B0B050200014F
2016-07-13  NULL    0.79        0x9B3B0B050200014F
2016-07-14  0.69    0.69        0x9E3B0B0502000145
2016-07-15  NULL    0.69        0x9E3B0B0502000145
2016-07-21  0.88    0.88        0xA53B0B0502000158
2016-07-22  NULL    0.88        0xA53B0B0502000158

相关文章