展平交叉时间跨度
对于给定的 ID,我有很多包含开始和停止时间的数据,我需要将所有相交和相邻的时间跨度展平为一个组合时间跨度.下面贴出的样本数据都是同一个ID,所以我没有列出来.
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan. The sample data posted below is all for the same ID so I didn't list it.
为了让事情更清楚一些,请看一下 03.06.2009 的示例数据:
To make things a bit clearer, take a look at the sample data for 03.06.2009:
以下时间跨度重叠或连续,需要合并为一个时间跨度
The following timespans are overlapping or contiunous and need to merge into one timespan
- 05:54:48 - 10:00:13
- 09:26:45 - 09:59:40
生成的时间跨度将从 05:54:48 到 10:00:13.由于 10:00:13 和 10:12:50 之间存在间隔,因此我们还有以下时间跨度:
The resulting timespan would be from 05:54:48 to 10:00:13. Since there's a gap between 10:00:13 and 10:12:50 we also have the following timespans:
- 10:12:50 - 10:27:25
- 10:13:12 - 11:14:56
- 10:27:25 - 10:27:31
- 10:27:39 - 13:53:38
- 11:14:56 - 11:15:03
- 11:15:30 - 14:02:14
- 13:53:38 - 13:53:43
- 14:02:14 - 14:02:31
这会导致从 10:12:50 到 14:02:31 的一个合并时间跨度,因为它们重叠或相邻.
which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
您将在下面找到我需要的示例数据和扁平化数据.持续时间列只是提供信息.
Below you will find the sample data and the flattened data as I would need it. The duration column is just informative.
任何解决方案 - 无论是 SQL 与否 - 都值得赞赏.
Any solution - be it SQL or not - is appreciated.
编辑:由于有很多不同且有趣的解决方案,我正在通过添加约束来改进我原来的问题,以查看最佳"(如果有的话)解决方案的冒泡:
EDIT: Since there are lots of different and interesting solutions I'm refining my original question by adding constraints to see the "best" (if there is one) solution bubble up:
- 我通过 ODBC 从另一个系统获取数据.无法为我更改表格布局或添加索引
- 数据仅按日期列索引(时间部分不是)
- 每天大约有 2.5 万行
- 数据的估计使用模式大致如下:
- 大多数情况下(比如 90%)用户只会查询一两天(2.5k - 5k 行)
- 有时 (9%) 范围长达一个月(约 75,000 行)
- 很少 (1%) 范围长达一年(约 90 万行)
在这些限制条件下,最佳解决方案是什么?我担心大多数解决方案会非常缓慢,因为它们是在日期和时间的组合上加入的,在我的情况下这不是索引字段.
Within these constraints, what would be the best solution? I'm afraid that most of the solutions will be horribly slow since they join on the combination of date and time, which is not an index field in my case.
您会在客户端还是服务器端进行所有合并?您会首先创建一个优化的临时表并使用该表的建议解决方案之一吗?直到现在我才有时间测试这些解决方案,但我会随时通知您什么最适合我.
Would you do all the merging on the client or the server side? Would you first create an optimized temp table and use one of the proposed solutions with that table? I didn't have the time to test the solutions until now but I will keep you informed what works best for me.
示例数据:
Date | Start | Stop -----------+----------+--------- 02.06.2009 | 05:55:28 | 09:58:27 02.06.2009 | 10:15:19 | 13:58:24 02.06.2009 | 13:58:24 | 13:58:43 03.06.2009 | 05:54:48 | 10:00:13 03.06.2009 | 09:26:45 | 09:59:40 03.06.2009 | 10:12:50 | 10:27:25 03.06.2009 | 10:13:12 | 11:14:56 03.06.2009 | 10:27:25 | 10:27:31 03.06.2009 | 10:27:39 | 13:53:38 03.06.2009 | 11:14:56 | 11:15:03 03.06.2009 | 11:15:30 | 14:02:14 03.06.2009 | 13:53:38 | 13:53:43 03.06.2009 | 14:02:14 | 14:02:31 04.06.2009 | 05:48:27 | 09:58:59 04.06.2009 | 06:00:00 | 09:59:07 04.06.2009 | 10:15:52 | 13:54:52 04.06.2009 | 10:16:01 | 13:24:20 04.06.2009 | 13:24:20 | 13:24:24 04.06.2009 | 13:24:32 | 14:00:39 04.06.2009 | 13:54:52 | 13:54:58 04.06.2009 | 14:00:39 | 14:00:49 05.06.2009 | 05:53:58 | 09:59:12 05.06.2009 | 10:16:05 | 13:59:08 05.06.2009 | 13:59:08 | 13:59:16 06.06.2009 | 06:04:00 | 10:00:00 06.06.2009 | 10:16:54 | 10:18:40 06.06.2009 | 10:18:40 | 10:18:45 06.06.2009 | 10:23:00 | 13:57:00 06.06.2009 | 10:23:48 | 13:57:54 06.06.2009 | 13:57:21 | 13:57:38 06.06.2009 | 13:57:54 | 13:57:58 07.06.2009 | 21:59:30 | 01:58:49 07.06.2009 | 22:12:16 | 01:58:39 07.06.2009 | 22:12:25 | 01:58:28 08.06.2009 | 02:10:33 | 05:56:11 08.06.2009 | 02:10:43 | 05:56:23 08.06.2009 | 02:10:49 | 05:55:59 08.06.2009 | 05:55:59 | 05:56:01 08.06.2009 | 05:56:11 | 05:56:14 08.06.2009 | 05:56:23 | 05:56:27
扁平化的结果:
Date | Start | Stop | Duration -----------+----------+----------+--------- 02.06.2009 | 05:55:28 | 09:58:27 | 04:02:59 02.06.2009 | 10:15:19 | 13:58:43 | 03:43:24 03.06.2009 | 05:54:48 | 10:00:13 | 04:05:25 03.06.2009 | 10:12:50 | 14:02:31 | 03:49:41 04.06.2009 | 05:48:27 | 09:59:07 | 04:10:40 04.06.2009 | 10:15:52 | 14:00:49 | 03:44:58 05.06.2009 | 05:53:58 | 09:59:12 | 04:05:14 05.06.2009 | 10:16:05 | 13:59:16 | 03:43:11 06.06.2009 | 06:04:00 | 10:00:00 | 03:56:00 06.06.2009 | 10:16:54 | 10:18:45 | 00:01:51 06.06.2009 | 10:23:00 | 13:57:58 | 03:34:58 07.06.2009 | 21:59:30 | 01:58:49 | 03:59:19 08.06.2009 | 02:10:33 | 05:56:27 | 03:45:54
推荐答案
这是一个仅 SQL 的解决方案.我使用 DATETIME 作为列.在我看来,将时间分开存储是错误的,因为当时间超过午夜时,您会遇到问题.如果需要,您可以调整它以处理这种情况.该解决方案还假定开始时间和结束时间不是 NULL.同样,如果情况并非如此,您可以根据需要进行调整.
Here is a SQL only solution. I used DATETIME for the columns. Storing the time separate is a mistake in my opinion, as you will have problems when the times go past midnight. You can adjust this to handle that situation though if you need to. The solution also assumes that the start and end times are NOT NULL. Again, you can adjust as needed if that's not the case.
解决方案的一般要点是获取不与任何其他跨度重叠的所有开始时间,获取不与任何跨度重叠的所有结束时间,然后将两者匹配在一起.
The general gist of the solution is to get all of the start times that don't overlap with any other spans, get all of the end times that don't overlap with any spans, then match the two together.
结果与您的预期结果相符,但在一种情况下,手动检查看起来您的预期输出有误.6 日应该有一个结束于 2009-06-06 10:18:45.000 的跨度.
The results match your expected results except in one case, which checking by hand looks like you have a mistake in your expected output. On the 6th there should be a span that ends at 2009-06-06 10:18:45.000.
SELECT ST.start_time, ET.end_time FROM ( SELECT T1.start_time FROM dbo.Test_Time_Spans T1 LEFT OUTER JOIN dbo.Test_Time_Spans T2 ON T2.start_time < T1.start_time AND T2.end_time >= T1.start_time WHERE T2.start_time IS NULL ) AS ST INNER JOIN ( SELECT T3.end_time FROM dbo.Test_Time_Spans T3 LEFT OUTER JOIN dbo.Test_Time_Spans T4 ON T4.end_time > T3.end_time AND T4.start_time <= T3.end_time WHERE T4.start_time IS NULL ) AS ET ON ET.end_time > ST.start_time LEFT OUTER JOIN ( SELECT T5.end_time FROM dbo.Test_Time_Spans T5 LEFT OUTER JOIN dbo.Test_Time_Spans T6 ON T6.end_time > T5.end_time AND T6.start_time <= T5.end_time WHERE T6.start_time IS NULL ) AS ET2 ON ET2.end_time > ST.start_time AND ET2.end_time < ET.end_time WHERE ET2.end_time IS NULL
相关文章