SimpleXML PHP解析OCR XML文档
假设我有一个XML文档(实际上是一个包含OCR文档结果的XML文件:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<document xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" version="1.0" producer="ABBYY FineReader Engine 11" languages="">
<page width="317" height="387" resolution="96" originalCoords="1">
<block blockType="Text" l="132" t="18" r="302" b="98"><region><rect l="132" t="18" r="302" b="71"/><rect l="195" t="71" r="302" b="98"/></region>
<text>
<par align="Right">
<line baseline="36" l="179" t="20" r="299" b="36"><formatting lang="EnglishUnitedStates">
<charParams l="179" t="20" r="199" b="36">W</charParams>
<charParams l="201" t="20" r="210" b="36">h</charParams>
<charParams l="212" t="24" r="223" b="36">a</charParams>
<charParams l="223" t="21" r="229" b="36">t</charParams>
<charParams l="230" t="20" r="236" b="36"> </charParams>
<charParams l="237" t="20" r="253" b="36">M</charParams>
<charParams l="255" t="24" r="266" b="36">a</charParams>
<charParams l="267" t="20" r="277" b="36">k</charParams>
<charParams l="278" t="24" r="288" b="36">e</charParams>
<charParams l="289" t="24" r="299" b="36">s</charParams></formatting></line></par>
<par align="Justified">
<line baseline="68" l="134" t="42" r="299" b="69"><formatting lang="EnglishUnitedStates">
<charParams l="134" t="42" r="160" b="68">Y</charParams>
<charParams l="155" t="48" r="175" b="69">o</charParams>
<charParams l="177" t="49" r="197" b="69">u</charParams>
<charParams l="200" t="48" r="213" b="68">r</charParams>
<charParams l="214" t="42" r="221" b="68"> </charParams>
<charParams l="222" t="42" r="246" b="68">V</charParams>
<charParams l="244" t="48" r="264" b="69">o</charParams>
<charParams l="264" t="43" r="278" b="68">t</charParams>
<charParams l="278" t="48" r="299" b="69">e</charParams></formatting></line></par>
<par>
<line baseline="95" l="197" t="76" r="299" b="96"><formatting lang="EnglishUnitedStates">
<charParams l="197" t="76" r="218" b="95">M</charParams>
<charParams l="220" t="81" r="234" b="96">a</charParams>
<charParams l="235" t="77" r="245" b="96">t</charParams>
<charParams l="245" t="77" r="256" b="96">t</charParams>
<charParams l="257" t="81" r="272" b="96">e</charParams>
<charParams l="274" t="81" r="284" b="95">r</charParams>
<charParams l="285" t="76" r="299" b="95">?</charParams></formatting></line></par>
</text>
</block>
<block blockType="Text" l="25" t="83" r="155" b="281"><region><rect l="25" t="83" r="155" b="281"/></region>
<text>
<par align="Justified">
<line baseline="276" l="26" t="82" r="154" b="278"><formatting lang="EnglishUnitedStates">
<charParams l="26" t="82" r="154" b="278">9</charParams></formatting></line></par>
</text>
</block>
<block blockType="Text" l="73" t="288" r="107" b="322"><region><rect l="73" t="288" r="107" b="322"/></region>
<text>
<par lineSpacing="-1"></par>
</text>
</block>
<block blockType="Text" l="25" t="327" r="136" b="356"><region><rect l="25" t="327" r="136" b="356"/></region>
<text>
<par lineSpacing="1050">
<line baseline="337" l="27" t="328" r="109" b="337"><formatting lang="EnglishUnitedStates">
<charParams l="27" t="328" r="32" b="337">F</charParams>
<charParams l="33" t="328" r="35" b="337">i</charParams>
<charParams l="36" t="330" r="42" b="337">n</charParams>
<charParams l="42" t="329" r="49" b="337">d</charParams>
<charParams l="50" t="329" r="52" b="337"> </charParams>
<charParams l="53" t="330" r="59" b="337">o</charParams>
<charParams l="60" t="331" r="66" b="337">u</charParams>
<charParams l="66" t="329" r="69" b="337">t</charParams>
<charParams l="70" t="329" r="73" b="337"> </charParams>
<charParams l="74" t="329" r="79" b="337">h</charParams>
<charParams l="80" t="331" r="86" b="337">o</charParams>
<charParams l="87" t="331" r="95" b="337">w</charParams>
<charParams l="96" t="330" r="98" b="337"> </charParams>
<charParams l="99" t="330" r="105" b="337">a</charParams>
<charParams l="106" t="329" r="109" b="337">t</charParams></formatting></line>
<line baseline="351" l="31" t="341" r="133" b="353"><formatting lang="EnglishUnitedStates">
<charParams l="31" t="343" r="42" b="351" suspicious="1">m</charParams>
<charParams l="44" t="343" r="50" b="351">a</charParams>
<charParams l="51" t="341" r="56" b="351">t</charParams>
<charParams l="57" t="341" r="64" b="351">h</charParams>
<charParams l="66" t="343" r="72" b="351">a</charParams>
<charParams l="74" t="344" r="84" b="351">w</charParams>
<charParams l="85" t="343" r="92" b="351">a</charParams>
<charParams l="94" t="344" r="98" b="351">r</charParams>
<charParams l="99" t="343" r="106" b="351">e</charParams>
<charParams l="107" t="348" r="110" b="351">.</charParams>
<charParams l="111" t="343" r="119" b="351">o</charParams>
<charParams l="120" t="344" r="125" b="351">r</charParams>
<charParams l="125" t="343" r="133" b="353">g</charParams></formatting></line></par>
</text>
</block>
<block blockType="Text" l="6" t="364" r="312" b="376"><region><rect l="6" t="364" r="312" b="376"/></region>
<text>
<par align="Justified">
<line baseline="372" l="7" t="365" r="309" b="374"><formatting lang="EnglishUnitedStates">
<charParams l="7" t="365" r="14" b="372">M</charParams>
<charParams l="15" t="367" r="19" b="372">a</charParams>
<charParams l="19" t="366" r="22" b="372">t</charParams>
<charParams l="23" t="366" r="27" b="372">h</charParams>
<charParams l="27" t="367" r="33" b="372">e</charParams>
<charParams l="33" t="367" r="40" b="372">m</charParams>
<charParams l="40" t="367" r="45" b="372">a</charParams>
<charParams l="45" t="366" r="48" b="372">t</charParams>
<charParams l="48" t="365" r="50" b="372">i</charParams>
<charParams l="50" t="367" r="55" b="372">c</charParams>
<charParams l="55" t="367" r="60" b="372">s</charParams>
<charParams l="61" t="365" r="62" b="372"> </charParams>
<charParams l="62" t="365" r="68" b="372">A</charParams>
<charParams l="68" t="367" r="75" b="372">w</charParams>
<charParams l="75" t="367" r="80" b="372">a</charParams>
<charParams l="80" t="367" r="83" b="372">r</charParams>
<charParams l="83" t="367" r="88" b="372">e</charParams>
<charParams l="88" t="367" r="93" b="372">n</charParams>
<charParams l="93" t="367" r="98" b="372">e</charParams>
<charParams l="98" t="367" r="103" b="372">s</charParams>
<charParams l="103" t="367" r="107" b="372">s</charParams>
<charParams l="108" t="365" r="109" b="372"> </charParams>
<charParams l="110" t="365" r="117" b="372">M</charParams>
<charParams l="118" t="367" r="123" b="372">o</charParams>
<charParams l="123" t="367" r="127" b="372">n</charParams>
<charParams l="128" t="366" r="131" b="372">t</charParams>
<charParams l="131" t="365" r="136" b="372">h</charParams>
<charParams l="137" t="365" r="138" b="372"> </charParams>
<charParams l="139" t="367" r="143" b="370">•</charParams>
<charParams l="144" t="365" r="145" b="372"> </charParams>
<charParams l="146" t="365" r="152" b="372">A</charParams>
<charParams l="151" t="367" r="157" b="374">p</charParams>
<charParams l="157" t="367" r="162" b="372">n</charParams>
<charParams l="162" t="366" r="164" b="372" suspicious="1">l</charParams>
<charParams l="165" t="365" r="166" b="372"> </charParams>
<charParams l="167" t="365" r="171" b="372">2</charParams>
<charParams l="172" t="365" r="177" b="372">0</charParams>
<charParams l="177" t="365" r="182" b="372">0</charParams>
<charParams l="182" t="365" r="187" b="372">8</charParams>
<charParams l="188" t="365" r="190" b="372"> </charParams>
<charParams l="191" t="367" r="194" b="370">•</charParams>
<charParams l="195" t="365" r="197" b="372"> </charParams>
<charParams l="198" t="365" r="206" b="372">M</charParams>
<charParams l="206" t="367" r="211" b="372">a</charParams>
<charParams l="212" t="366" r="215" b="372">t</charParams>
<charParams l="216" t="365" r="220" b="372">h</charParams>
<charParams l="221" t="367" r="226" b="372" suspicious="1">e</charParams>
<charParams l="227" t="367" r="235" b="372">m</charParams>
<charParams l="235" t="367" r="240" b="372">a</charParams>
<charParams l="241" t="366" r="244" b="372">t</charParams>
<charParams l="245" t="365" r="246" b="372">i</charParams>
<charParams l="247" t="367" r="252" b="372">c</charParams>
<charParams l="253" t="367" r="257" b="372">s</charParams>
<charParams l="258" t="367" r="260" b="372"> </charParams>
<charParams l="261" t="367" r="265" b="372">a</charParams>
<charParams l="266" t="367" r="271" b="372">n</charParams>
<charParams l="272" t="365" r="277" b="372">d</charParams>
<charParams l="278" t="365" r="279" b="372"> </charParams>
<charParams l="280" t="365" r="286" b="372">V</charParams>
<charParams l="286" t="367" r="291" b="372">o</charParams>
<charParams l="291" t="366" r="295" b="372">t</charParams>
<charParams l="295" t="365" r="297" b="372">i</charParams>
<charParams l="298" t="367" r="303" b="372">n</charParams>
<charParams l="304" t="367" r="309" b="374">g</charParams></formatting></line></par>
</text>
</block>
<block blockType="Separator" l="0" t="379" r="310" b="387"><region><rect l="0" t="379" r="310" b="387"/></region>
<separator type="Black" thickness="8"><start x="0" y="383"/><end x="310" y="383"/>
</separator>
</block>
<block blockType="Picture" l="1" t="362" r="316" b="376"><region><rect l="1" t="362" r="316" b="376"/></region>
</block>
</page>
</document>
如何使用simplexml_load_string
获取XML文档中每个charParams
的内容?我想echo
每个charParms
里面的内容line
。而在每个line
之后,它将echo
一个<br>
。谢谢!哦,这会通过foreach
实现吗?请给我指个方向。
与本例一样,我希望它输出如下内容:
What Makes<br>Your Vote<br>Matter?<br>Find out how at
等
只是给您一个基本的概念。
解决方案
您需要循环遍历块/段落/行/字符,如下所示:
$string = '';
$xml = simplexml_load_file('file');
foreach ($xml->page->block as $block) {
if ($block->text->count()) {
foreach ($block->text->par as $par) {
if ($par->line->count()) {
foreach ($par->line as $line) {
foreach ($line->formatting->charParams as $char) {
$string .= $char;
}
$string .= "
";
}
}
$string .= "
";
}
}
$string .= "
";
}
编辑:我添加了更多的错误检查,以避免循环警告。当前所有内容都使用新行联接,因此您可能需要更改这些行以更改输出。
Here's a working example。
相关文章