如何使用 Python 有效地将解析后的 XML 文档存储在 MySQL 数据库中?

2022-01-10 00:00:00 python xml xml-parsing dom mysql

以下是 XML 文件:book.xml

Following is the XML file : book.xml

<?xml version="1.0" ?>  <bookstore> <book _id="E7854"> <title> Sample XML Book </title> <author> <name _id="AU363"> <first> Benjamin </first> <last> Smith </last> </name> <affiliation> A </affiliation> </author> <chapter number="1"> <title> First Chapter </title> <para> B <count> 783 </count> . </para> </chapter> <chapter number="3"> <title> Third Chapter </title> <para> B <count> 59 </count> . </para> </chapter> </book> <book _id="C843"> <title> XML Master </title> <author> <name _id="AU245"> <first> John </first> <last> Doe </last> </name> <affiliation> C </affiliation> </author> <chapter number="2"> <title> Second Chapter </title> <para> K <count> 54 </count> . </para> </chapter> <chapter number="3"> <title> Third Chapter </title> <para> K <count> 328 </count> . </para> </chapter> <chapter number="7"> <title> Seventh Chapter </title> <para> K <count> 265 </count> . </para> </chapter> <chapter number="9"> <title> Ninth Chapter </title> <para> K <count> 356 </count> . </para> </chapter> </book> </bookstore>

以下是 Python 代码:book_dom.py

Following is the Python code : book_dom.py

from xml.dom import minidom, Node import re, textwrap class SampleScanner: def __init__(self, doc): for child in doc.childNodes: if child.nodeType == Node.ELEMENT_NODE and child.tagName == 'bookstore': self.handleBookStore(child) def gettext(self, nodelist): retlist = [] for node in nodelist: if node.nodeType == Node.TEXT_NODE: retlist.append(node.wholeText) elif node.hasChildNodes: retlist.append(self.gettext(node.childNodes)) return re.sub('s+', ' ', ''.join(retlist)) def handleBookStore(self, node): for child in node.childNodes: if child.nodeType != Node.ELEMENT_NODE: continue if child.tagName == 'book': self.handleBook(child) def handleBook(self, node): for child in node.childNodes: if child.nodeType != Node.ELEMENT_NODE: continue if child.tagName == 'title': print "Book title is:", self.gettext(child.childNodes) if child.tagName == 'author': self.handleAuthor(child) if child.tagName == 'chapter': self.handleChapter(child) def handleAuthor(self, node): for child in node.childNodes: if child.nodeType != Node.ELEMENT_NODE: continue if child.tagName == 'name': self.handleAuthorName(child) elif child.tagName == 'affiliation': print "Author affiliation:", self.gettext([child]) def handleAuthorName(self, node): surname = self.gettext(node.getElementsByTagName("last")) givenname = self.gettext(node.getElementsByTagName("first")) print "Author Name: %s, %s" % (surname, givenname) def handleChapter(self, node): print " *** Start of Chapter %s: %s" % (node.getAttribute('number'), self.gettext(node.getElementsByTagName('title'))) for child in node.childNodes: if child.nodeType != Node.ELEMENT_NODE: continue if child.tagName == 'para': self.handlePara(child) def handlePara(self, node): partext = self.gettext([node]) partext = textwrap.fill(partext) print partext print doc = minidom.parse('book.xml') SampleScanner(doc)

输出:~/$ python book_dom.py

Output :~/$ python book_dom.py

Book ID : E7854 Book title is: Sample XML Book Name ID : AU363 Author Name: Smith , Benjamin Author affiliation: A *** Start of Chapter 1: First Chapter B 783 . *** Start of Chapter 3: Third Chapter B 59 . Book ID : C843 Book title is: XML Master Name ID : AU245 Author Name: Doe , John Author affiliation: C *** Start of Chapter 2: Second Chapter K 54 . *** Start of Chapter 3: Third Chapter K 328 . *** Start of Chapter 7: Seventh Chapter K 265 . *** Start of Chapter 9: Ninth Chapter K 356 .

我的目标是将书籍存储在书籍"表中，将作者信息存储在作者"表中(保留书籍->作者关系)[MySQL DB].

My aim is to store the Books in "Books" table and Author info in "Authors" table (preserving the book -> author relationship) [MySQL DB].

**Book table :** id |title E7854 Sample XML Book .... **Chapter table :** book_id|chapter_number|title |para E7854 1 First Chapter B 783 . E7854 3 Third Chapter B 59 . .... **Author table :** id |book_id |name |Affiliation AU363 E7854 Smith Benjamin A ....

如果我有几千本书和作者(和章节)，我该如何将数据存储在数据库中?我无法为每本书/作者唯一识别数据集.我可以使用 ID 并将它们传递给函数以保留关系，但我不确定这是否是最好的方法.任何指针都非常感谢.

How do I go about storing the data in the database if I have few thousand books and authors (and chapters)? I am having trouble with uniquely identifying the dataset for each book/author. I can use the IDs and pass them to the functions to preserve the relation but I am not sure if that is the best way to do it. Any pointers are highly appreciated.

p.s:我正在编写脚本的 SQL 部分，一旦我测试它就会更新.随意发布您的想法，代码示例.谢谢！

p.s : I am working on the SQL part of the script and will update once I test it. Feel free to post your thoughts, code samples. Thanks!

推荐答案

根据您上面的评论，我将简单地创建一个书类、一个作者类、一个作者列表和一个章节类.将书籍的章节分配给书籍本身的章节对象列表.将 AuthorList 维护为其 ID 的字典，指向实际的 Author 对象.使用 Book 对象的数据成员来包含 ID；为方便起见，您可以提供一种将作者从 AuthorList 字典中拉出的方法.

Based on your comment above, I would simply create a book class, an author class, an author list, and a chapter class. Assign the chapters of the book to a list of Chapter objects on the Book itself. Maintain the AuthorList as a dict of their IDs, pointing to the actual Author objects. Use a data member of the Book object to contain the ID; you can provide a method to pull the author out of the AuthorList dict for convenience.

相关文章