使用ElementTree在python中解析xml

2022-01-10 00:00:00 python xml parsing xml-parsing

问题描述

我对 python 很陌生，我需要先解析一些需要清理的脏 xml 文件.

I'm very new to python and I need to parse some dirty xml files which need sanitising first.

我有以下 python 代码:

I have the following python code:

import arff import xml.etree.ElementTree import re totstring="" with open('input.sgm', 'r') as inF: for line in inF: string=re.sub("[^0-9a-zA-Z<>/s=!-""]+","", line) totstring+=string data=xml.etree.ElementTree.fromstring(totstring) print data file.close

解析:

<!DOCTYPE lewis SYSTEM "lewis.dtd"> <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1"> <DATE>26-FEB-1987 15:01:01.79</DATE> <TOPICS><D>cocoa</D></TOPICS> <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <UNKNOWN> C T f0704reute u f BC-BAHIA-COCOA-REVIEW 02-26 0105</UNKNOWN> <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE> SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in the arrivals figures. Comissaria Smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an end. With total Bahia crop estimates around 6.4 mln bags and sales standing at almost 6.2 mln there are a few hundred thousand bags still in the hands of farmers, middlemen, exporters and processors. There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing dificulties in obtaining +Bahia superior+ certificates. In view of the lower quality over recent weeks farmers have sold a good part of their cocoa held on consignment. Comissaria Smith said spot bean prices rose to 340 to 350 cruzados per arroba of 15 kilos. Bean shippers were reluctant to offer nearby shipment and only limited sales were booked for March shipment at 1,750 to 1,780 dlrs per tonne to ports to be named. New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB. Routine sales of butter were made. March/April sold at 4,340, 4,345 and 4,350 dlrs. April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at 2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and 2.27 times New York Dec, Comissaria Smith said. Destinations were the U.S., Covertible currency areas, Uruguay and open ports. Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for Oct/Dec. Buyers were the U.S., Argentina, Uruguay and convertible currency areas. Liquor sales were limited with March/April selling at 2,325 and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith said. Total Bahia sales are currently estimated at 6.13 mln bags against the 1986/87 crop and 1.06 mln bags against the 1987/88 crop. Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which ends midday on February 27. Reuter </BODY></TEXT> </REUTERS>

我现在如何才能只从 body 标记中获取文本?

How can I now go about getting just the text from inside the body tag?

我看到的所有教程都依赖于直接从文件中读取 xml，以便 Elementtree.parse 工作.当我试图从一个字符串中解析时，这将不起作用，这会破坏我阅读的很多教程.

All the tutorials i have seen rely on reading the xml directly from a file so that Elementtree.parse works. As I am trying to parse from a string this will not work and this breaks a lot of tutorials I have read.

非常感谢

解决方案

如果您不关心(可能是混乱的)XML 文档的特定结构，而只想快速获取给定标签/元素的内容，您可能想尝试 BeautifulSoup 模块.

If you don't care about the particular structure of a (potentially messy) XML document and just want to quickly get the contents of a given tag/element, you may want to try the BeautifulSoup module.

import BeautifulSoup from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(totstring) body = soup.find("body") bodytext = body.text

相关文章