在这里使用哪个 XML 解析器?

2022-01-10 00:00:00 xml xml-parsing java jaxb

我正在接收一个 XML 文件作为输入，其大小可以从几 KB 到更多.我正在通过网络获取此文件.我需要根据我的使用提取少量节点，所以大部分文档对我来说毫无用处.我没有记忆偏好，我只需要速度.

I am receving an XML file as an input, whose size can vary from a few KBs to a lot more. I am getting this file over a network. I need to extract a small number of nodes as per my use, so most of the document is pretty useless for me. I have no memory preferences, I just need speed.

考虑到这一切，我得出结论:

Considering all this, I concluded :

这里不使用 DOM(由于 doc 可能很大，没有 CRUD 要求，并且来源是网络)

Not using DOM here (due to possible huge size of doc , no CRUD requirement, and source being network)

没有 SAX，因为我只需要获取一小部分数据.

No SAX as I only need to get a small subset of data.

StaX 可能是一种方法，但我不确定它是否是最快的方法.

StaX can be a way to go, but I am not sure if it is the fastest way.

JAXB 是另一种选择——但它使用什么样的解析器?我读到它默认使用 Xerces(这是什么类型 - 推或拉?)，尽管我可以按照这个链接p>

JAXB came up as another option - but what sort of parser does it use ? I read it uses Xerces by default (which is what type - push or pull ?), although I can configure it for use with Stax or Woodstock as per this link

我读了很多书，仍然对这么多选项感到困惑！任何帮助将不胜感激.

I am reading a lot, still confused with so many options ! Any help would be appreciated.

谢谢！

我想在这里再添加一个问题:在这里使用 JAXB 有什么问题?

Edit : I want to add one more question here : What is wrong in using JAXB here ?

推荐答案

目前最快的解决方案是 StAX 解析器，特别是因为您只需要 XML 文件的特定子集，并且您可以轻松地忽略任何不需要使用的东西StAX，而如果您使用 SAX 解析器，无论如何您都会收到该事件.

Fastest solution is by far a StAX parser, specially as you only need a specific subset of the XML file and you can easily ignore whatever isn't really necessary using StAX, while you would receive the event anyway if you were using a SAX parser.

但它也比使用 SAX 或 DOM 稍微复杂一些.有一天，我不得不为以下 XML 编写 StAX 解析器:

But it's also a little bit more complicated than using SAX or DOM. One of these days I had to write a StAX parser for the following XML:

<?xml version="1.0"?> <table> <row> <column>1</column> <column>Nome</column> <column>Sobrenome</column> <column>email@gmail.com</column> <column></column> <column>2011-06-22 03:02:14.915</column> <column>2011-06-22 03:02:25.953</column> <column></column> <column></column> </row> </table>

以下是最终解析器代码的样子:

Here's how the final parser code looks like:

public class Parser { private String[] files ; public Parser(String ... files) { this.files = files; } private List<Inscrito> process() { List<Inscrito> inscritos = new ArrayList<Inscrito>(); for ( String file : files ) { XMLInputFactory factory = XMLInputFactory.newFactory(); try { String content = StringEscapeUtils.unescapeXml( FileUtils.readFileToString( new File(file) ) ); XMLStreamReader parser = factory.createXMLStreamReader( new ByteArrayInputStream( content.getBytes() ) ); String currentTag = null; int columnCount = 0; Inscrito inscrito = null; while ( parser.hasNext() ) { int currentEvent = parser.next(); switch ( currentEvent ) { case XMLStreamReader.START_ELEMENT: currentTag = parser.getLocalName(); if ( "row".equals( currentTag ) ) { columnCount = 0; inscrito = new Inscrito(); } break; case XMLStreamReader.END_ELEMENT: currentTag = parser.getLocalName(); if ( "row".equals( currentTag ) ) { inscritos.add( inscrito ); } if ( "column".equals( currentTag ) ) { columnCount++; } break; case XMLStreamReader.CHARACTERS: if ( "column".equals( currentTag ) ) { String text = parser.getText().trim().replaceAll( " " , " "); switch( columnCount ) { case 0: inscrito.setId( Integer.valueOf( text ) ); break; case 1: inscrito.setFirstName( WordUtils.capitalizeFully( text ) ); break; case 2: inscrito.setLastName( WordUtils.capitalizeFully( text ) ); break; case 3: inscrito.setEmail( text ); break; } } break; } } parser.close(); } catch (Exception e) { throw new IllegalStateException(e); } } Collections.sort(inscritos); return inscritos; } public Map<String,List<Inscrito>> parse() { List<Inscrito> inscritos = this.process(); Map<String,List<Inscrito>> resultado = new LinkedHashMap<String, List<Inscrito>>(); for ( Inscrito i : inscritos ) { List<Inscrito> lista = resultado.get( i.getInicial() ); if ( lista == null ) { lista = new ArrayList<Inscrito>(); resultado.put( i.getInicial(), lista ); } lista.add( i ); } return resultado; } }

代码本身是葡萄牙语，但你应该很容易理解它是什么，这里是github上的repo.

The code itself is in portuguese but it should be straightforward for you to understand what it is, here's the repo on github.

相关文章