Unicode 编码错误 Python - 解析 XML 无法编码字符(星号)

2022-01-10 00:00:00 python unicode xml-parsing

问题描述

我是 Python 的初学者，目前正在从 eventful.com API 解析一个基于 Web 的 XML 文件，但是，在检索数据的某些元素时，我收到了一些 unicode 错误.

I am a beginner to Python and am currently parsing a web-based XML file from the eventful.com API however, I am receiving some unicode errors when retrieving certain elements of the data.

我能够从 xml 文件中检索 5 个数据元素而没有任何我想要的问题，但是它会终止并在 GAE 错误控制台中产生以下错误:

I am able to retrieve 5 data elements without any problems which I want from the xml file, however then it terminates and produces the following error in the GAE error console:

UnicodeEncodeError: 'ascii' codec can't encode character u'u2605' in position 0: ordinal not in range(128)

我知道抛出我的解析器的字符是★"字符，无论如何我都不想从 xml 文件中检索它.

I know that the character that is throwing my parser is a "★" character, which I would prefer to not retrieve from the xml file anyway.

我的代码如下:

class XMLParser(webapp2.RequestHandler): def get(self): base_url = 'my xml file' #downloads data from xml file response = urllib.urlopen(base_url) #converts data to string: data = response.read() #closes file response.close() #parses xml downloaded dom = mdom.parseString(data) node = dom.documentElement #print out all event names (titles) found in the eventful xml event_main = dom.getElementsByTagName('event') event_names = [] for event in event_main: eventObj = event.getElementsByTagName("title")[0] event_names.append(eventObj) for ev in event_names: nodes = ev.childNodes for node in nodes: if node.nodeType == node.TEXT_NODE: print node.data

有什么方法可以检索标题"元素并忽略此处的 ★ 字符等有趣字符?我真的很感激在这件事上的任何帮助.我已经尝试过使用 word.encode('us-ascii', 'ignore') 的解决方案，但这并不能解决问题.

Is there any way that I would be able to retrieve the "title" elements and ignore funny characters like the ★ character here? I would really appreciate any help on this matter. I have already tried solutions which uses word.encode('us-ascii', 'ignore') but this is not fixing the issue.

-----------我找到了解决方案:

-----------I HAVE FOUND THE SOLUTION:

因此，当我遇到此类问题时，在与该主题的讲师交谈后，我发现只需要两行代码即可对已解析的 xml 文件进行编码和解码(在读取后进入程序).希望这可以帮助遇到同样问题的其他人！

So as I was having such issues with this problem and after talking to a lecturer on this topic I was able to find that all it required was two lines of code to both encode and decode the parsed xml file (after it was read into the program). Hope this helps someone else having the same issue!

unicode_data = data.decode('utf-8') data = unicode_data.encode('ascii','ignore')

解决方案

你在哪里使用你的解码方法?

Where are you using your decoding methods?

我过去遇到过这个错误，不得不解码原始数据.换句话说，我会尝试做

I had this error in the past and had to decode the raw. In other words, I would try doing

data = response.read() #closes file response.close() #decode data.encode("us-ascii")

也就是说，如果它实际上是 ascii.我的意思是，在调用 parseString 之前，请确保在原始结果仍为字符串格式时对其进行编码/解码.

That is if it is in fact ascii. My point being make sure you are encoding/decoding the raw results while it is still in a string format, before you call parseString on it.

相关文章