随着 xml 文档的增长,交叉应用 xml 查询的性能呈指数级下降
我有一个可变大小的 XML 文档,需要在 MSSQL 2008 R2 上进行解析,如下所示:
I have a variable size XML document that needs to be parsed on MSSQL 2008 R2 that looks like this:
<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
<item name="1">
<field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
<field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
<field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
</item>
<item name="2">
<field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
<field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
<field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
</item>
</data>
.
我需要将其转换为如下所示的常规表格类型数据集:
I need to transform it into a regular table type dataset that looks like this:
item_name field_id field_type field_value
--------- ------------------------------------ ----------- ---------------
1 EA032B25-19F1-4C1B-BDDE-3113542D13A5 2 0.5065430097062
1 71014ACB-571B-4C72-9C9B-05458B11335F 2 -0.795004023461
1 740C36E9-1988-413E-A1D5-B3E5B4405B45 2 0.0152649050024
2 EA032B25-19F1-4C1B-BDDE-3113542D13A5 2 0.3660968028040
2 71014ACB-571B-4C72-9C9B-05458B11335F 2 -0.386642801354
2 740C36E9-1988-413E-A1D5-B3E5B4405B45 2 0.0316711741841
3 EA032B25-19F1-4C1B-BDDE-3113542D13A5 2 0.8839620369590
3 71014ACB-571B-4C72-9C9B-05458B11335F 2 -0.781459993268
3 740C36E9-1988-413E-A1D5-B3E5B4405B45 2 0.2284423515729
.
此 cross apply
查询创建所需的输出:
This cross apply
query creates the desired output:
create table #temp (x xml)
insert into #temp (x)
values ('
<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
<item name="1">
<field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
<field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
<field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
</item>
<item name="2">
<field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
<field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
<field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
</item>
<item name="3">
<field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.883962036959074</field>
<field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.781459993268713</field>
<field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.228442351572923</field>
</item>
</data>
')
select c.value('(../@name)','varchar(5)') as item_name
,c.value('(@id)','uniqueidentifier') as field_id
,c.value('(@type)','int') as field_type
,c.value('(.)','nvarchar(15)') as field_value
from #temp cross apply
#temp.x.nodes('/data/item/field') as y(c)
drop table #temp
.
当 XML 中有几百个(或更少)<item>
元素时,查询执行得很好.但是,当有 1,000 个 <item>
元素时,在 SSMS 中完成返回行需要 24 秒.当有 6,500 个 <item>
元素时,运行 cross apply
查询大约需要 20 分钟.我们可以有 10-20,000 个 <item>
元素.
When there are a few hundred (or fewer) <item>
elements in the XML, the query performs just fine. However, when there are 1,000 <item>
elements, it takes 24 seconds to finish returning the rows in SSMS. When there are 6,500 <item>
elements, it takes about 20 minutes to run the cross apply
query. We could have 10-20,000 <item>
elements.
.
是什么让 cross apply
查询在这个简单的 XML 文档上表现如此糟糕,并且随着数据集的增长呈指数级增长?
What makes the cross apply
query perform so poorly on this simple XML document, and perform exponentially slower as the dataset grows?
有没有更有效的方法将 XML 文档转换为表格数据集(在 SQL 中)?
Is there a more efficient way to transform the XML document into the tabular dataset (in SQL)?
推荐答案
是什么让交叉应用查询在这个简单的 XML 上表现如此糟糕文档,并随着数据集的增长而呈指数级变慢?
What makes the cross apply query perform so poorly on this simple XML document, and perform exponentially slower as the dataset grows?
是利用父轴从item节点获取属性ID.
It Is the use of the parent axis to get the attribute ID from the item node.
正是这部分查询计划有问题.
It is this part of the query plan that is problematic.
注意下面的表值函数有 423 行.
Notice the 423 rows coming out of the lower Table-valued function.
再添加一个带有三个字段节点的项目节点就可以了.
Adding just one more item node with three field nodes gives you this.
返回 732 行.
如果我们将第一个查询的节点加倍到总共 6 个项目节点会怎样?
What if we double the nodes from the first query to a total of 6 item nodes?
我们最多返回 1602 行.
We are up to a whopping 1602 row returned.
顶部函数中的图 18 是 XML 中的所有字段节点.我们这里有 6 个项目,每个项目中有 3 个字段.这 18 个节点用于嵌套循环连接另一个函数,因此 18 次执行返回 1602 行,每次迭代返回 89 行.这恰好是整个 XML 中节点的确切数量.好吧,它实际上比所有可见节点多一个.我不知道为什么.您可以使用此查询来检查 XML 中的节点总数.
The figure 18 in the top function is all field nodes in your XML. We have here 6 items with three fields in each item. Those 18 nodes are used in a nested loops join against the other function so 18 executions returning 1602 rows gives that it is returning 89 rows per iteration. That just happens to be the exact number of nodes in the entire XML. Well it is actually one more than all the visible nodes. I don't know why. You can use this query to check the total number of nodes in your XML.
select count(*)
from @XML.nodes('//*, //@*, //*/text()') as T(X)
因此,当您在 values 函数中使用父轴 ..
时,SQL Server 用于获取值的算法是它首先找到您要粉碎的所有节点,最后找到 18 个节点案件.对于这些节点中的每一个,它都会分解并返回整个 XML 文档,并在过滤器运算符中检查您实际需要的节点.在那里你有你的指数增长.您应该使用一个额外的交叉应用,而不是使用父轴.先在物品上切碎,然后在场地上切碎.
So the algorithm used by SQL Server to get the value when you use the parent axis ..
in a values function is that it first finds all the nodes you are shredding on, 18 in the last case. For each of those nodes it shreds and returns the entire XML document and checks in the filter operator for the node you actually want. There you have your exponential growth.
Instead of using the parent axis you should use one extra cross apply. First shred on item and then on field.
select I.X.value('@name', 'varchar(5)') as item_name,
F.X.value('@id', 'uniqueidentifier') as field_id,
F.X.value('@type', 'int') as field_type,
F.X.value('text()[1]', 'nvarchar(15)') as field_value
from #temp as T
cross apply T.x.nodes('/data/item') as I(X)
cross apply I.X.nodes('field') as F(X)
我还更改了访问字段文本值的方式.使用 .
将使 SQL Server 去寻找子节点到 field
并在结果中连接这些值.您没有子值,因此结果是相同的,但最好避免在查询计划中包含该部分(UDX 运算符).
I also changed how you access the text value of the field. Using .
will make SQL Server go look for child nodes to field
and concatenate those values in the result. You have no child values so the result is the same but it is a good thing to avoid to have that part in the query plan (the UDX operator).
如果您使用 XML 索引,则查询计划不存在父轴的问题,但您仍然会从更改获取字段值的方式中受益.
The query plan does not have the issue with the parent axis if you are using an XML index but you will still benefit from changing how you fetch the field value.
相关文章