如何使用 Python 有效地将解析后的 XML 文档存储在 MySQL 数据库中?

2022-01-10 00:00:00 python xml xml-parsing dom mysql

以下是 XML 文件:book.xml

Following is the XML file : book.xml

<?xml version="1.0" ?>
<!--Sample XML Document-->
<bookstore>
    <book _id="E7854">
       <title>
          Sample XML Book
       </title>
       <author>
          <name _id="AU363">
         <first>
            Benjamin
         </first>

         <last>
            Smith
         </last>
          </name>
          <affiliation>
         A
          </affiliation>
       </author>
       <chapter number="1">
          <title>
         First Chapter
          </title>
          <para>
         B
         <count>
            783
         </count>
         .
          </para>
       </chapter>
       <chapter number="3">
          <title>
         Third Chapter
          </title>
          <para>
         B
         <count>
           59
         </count>
         .
          </para>
       </chapter>
    </book>
    <book _id="C843">
       <title>
          XML Master
       </title>
       <author>
          <name _id="AU245">
         <first>
            John
         </first>

         <last>
            Doe
         </last>
          </name>
          <affiliation>
         C
          </affiliation>
       </author>
       <chapter number="2">
          <title>
         Second Chapter
          </title>
          <para>
         K
         <count>
            54
         </count>
         .
          </para>
       </chapter>
       <chapter number="3">
          <title>
         Third Chapter
          </title>
          <para>
         K
         <count>
            328
         </count>
         .
          </para>
       </chapter>
       <chapter number="7">
          <title>
         Seventh Chapter
          </title>
          <para>
         K
         <count>
            265
         </count>
         .
          </para>
       </chapter>
       <chapter number="9">
          <title>
         Ninth Chapter
          </title>
          <para>
         K
         <count>
            356
         </count>
         .
          </para>
       </chapter>
    </book> 
</bookstore>

以下是 Python 代码:book_dom.py

Following is the Python code : book_dom.py

from xml.dom import minidom, Node
import re, textwrap

class SampleScanner:
    def __init__(self, doc):
        for child in doc.childNodes:
            if child.nodeType == Node.ELEMENT_NODE and child.tagName == 'bookstore':
                self.handleBookStore(child)

    def gettext(self, nodelist):
        retlist = []
        for node in nodelist:
            if node.nodeType == Node.TEXT_NODE:
                retlist.append(node.wholeText)
            elif node.hasChildNodes:
                retlist.append(self.gettext(node.childNodes))

        return re.sub('s+', ' ', ''.join(retlist))

    def handleBookStore(self, node):
        for child in node.childNodes:
            if child.nodeType != Node.ELEMENT_NODE:
                continue
            if child.tagName == 'book':
        self.handleBook(child)

    def handleBook(self, node):
        for child in node.childNodes:
            if child.nodeType != Node.ELEMENT_NODE:
                continue
            if child.tagName == 'title':
                print "Book title is:", self.gettext(child.childNodes)
            if child.tagName == 'author':
                self.handleAuthor(child)
            if child.tagName == 'chapter':
                self.handleChapter(child)

    def handleAuthor(self, node):
        for child in node.childNodes:
            if child.nodeType != Node.ELEMENT_NODE:
                continue
            if child.tagName == 'name':
                self.handleAuthorName(child)
            elif child.tagName == 'affiliation':
                print "Author affiliation:", self.gettext([child])

    def handleAuthorName(self, node):
        surname = self.gettext(node.getElementsByTagName("last"))
        givenname = self.gettext(node.getElementsByTagName("first"))
        print "Author Name: %s, %s" % (surname, givenname)

    def handleChapter(self, node):
        print " *** Start of Chapter %s: %s" % (node.getAttribute('number'),
             self.gettext(node.getElementsByTagName('title')))
        for child in node.childNodes:
            if child.nodeType != Node.ELEMENT_NODE:
                continue
            if child.tagName == 'para':
                self.handlePara(child)

    def handlePara(self, node):
        partext = self.gettext([node])
        partext = textwrap.fill(partext)
        print partext
        print

doc = minidom.parse('book.xml') 
SampleScanner(doc)

输出:~/$ python book_dom.py

Output :~/$ python book_dom.py

Book ID :  E7854
Book title is:  Sample XML Book 
Name ID :  AU363
Author Name:  Smith ,  Benjamin 
Author affiliation:  A 
 *** Start of Chapter 1:  First Chapter 
 B 783 .

 *** Start of Chapter 3:  Third Chapter 
 B 59 .

Book ID :  C843
Book title is:  XML Master 
Name ID :  AU245
Author Name:  Doe ,  John 
Author affiliation:  C 
 *** Start of Chapter 2:  Second Chapter 
 K 54 .

 *** Start of Chapter 3:  Third Chapter 
 K 328 .

 *** Start of Chapter 7:  Seventh Chapter 
 K 265 .

 *** Start of Chapter 9:  Ninth Chapter 
 K 356 .

我的目标是将书籍存储在书籍"表中,将作者信息存储在作者"表中(保留书籍->作者关系)[MySQL DB].

My aim is to store the Books in "Books" table and Author info in "Authors" table (preserving the book -> author relationship) [MySQL DB].

**Book table :**
id    |title
E7854  Sample XML Book
....

**Chapter table :**
book_id|chapter_number|title        |para
E7854   1              First Chapter B 783 .
E7854   3              Third Chapter B 59 .
....

**Author table :**
id    |book_id  |name           |Affiliation
AU363  E7854     Smith Benjamin  A
....

如果我有几千本书和作者(和章节),我该如何将数据存储在数据库中?我无法为每本书/作者唯一识别数据集.我可以使用 ID 并将它们传递给函数以保留关系,但我不确定这是否是最好的方法.任何指针都非常感谢.

How do I go about storing the data in the database if I have few thousand books and authors (and chapters)? I am having trouble with uniquely identifying the dataset for each book/author. I can use the IDs and pass them to the functions to preserve the relation but I am not sure if that is the best way to do it. Any pointers are highly appreciated.

p.s:我正在编写脚本的 SQL 部分,一旦我测试它就会更新.随意发布您的想法,代码示例.谢谢!

p.s : I am working on the SQL part of the script and will update once I test it. Feel free to post your thoughts, code samples. Thanks!

推荐答案

根据您上面的评论,我将简单地创建一个书类、一个作者类、一个作者列表和一个章节类.将书籍的章节分配给书籍本身的章节对象列表.将 AuthorList 维护为其 ID 的字典,指向实际的 Author 对象.使用 Book 对象的数据成员来包含 ID;为方便起见,您可以提供一种将作者从 AuthorList 字典中拉出的方法.

Based on your comment above, I would simply create a book class, an author class, an author list, and a chapter class. Assign the chapters of the book to a list of Chapter objects on the Book itself. Maintain the AuthorList as a dict of their IDs, pointing to the actual Author objects. Use a data member of the Book object to contain the ID; you can provide a method to pull the author out of the AuthorList dict for convenience.

相关文章