Jsoup - 提取文本

2022-01-24 00:00:00 iteration text-extraction java jsoup

我需要从这样的节点中提取文本:

一些带有标签的文本<b></b>可能会去这里.<p>还有段落</p>更多文字可以不用段落<br/></div>

我需要构建:

一些带有标签的文本<b></b>可能会去这里.还有段落更多文本可以不带段落

Element.text 只返回 div 的所有内容.Element.ownText - 不在子元素内的所有内容.两者都是错误的.遍历 children 会忽略文本节点.

是否有办法迭代元素的内容以接收文本节点.例如

  • 文本节点 - 一些文本
  • 节点 <b> - 带有标签
  • 文本节点 - 可能会放在这里.
  • 节点 <p> - 还有段落
  • 文本节点 - 更多文本可以不带段落
  • 节点 <br> - <空>

解决方案

元素.children() 返回一个 Elements 对象 - 元素 对象.查看父类 Node,您会看到允许您访问任意节点的方法,而不仅仅是元素,例如 Node.childNodes().

public static void main(String[] args) throws IOException {字符串 str = "<div>"+" 一些带有标签的文本 <b> 可能会放在这里."+<p>还有段落</p>"+更多的文字可以不用段落<br/>"+"</div>";文档 doc = Jsoup.parse(str);元素 div = doc.select("div").first();诠释 i = 0;对于(节点节点:div.childNodes()){我++;System.out.println(String.format("%d %s %s",一世,node.getClass().getSimpleName(),node.toString()));}}

结果:

<上一页>1 个文本节点一些文字2 元素带标签3 TextNode 可能会放在这里.4元素

还有段落

5 TextNode 更多文字可以不用段落6元素<br/>

I need to extract text from a node like this:

<div>
    Some text <b>with tags</b> might go here.
    <p>Also there are paragraphs</p>
    More text can go without paragraphs<br/>
</div>

And I need to build:

Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs

Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.

Is there are way to iterate contents of an element to receive text nodes as well. E.g.

  • Text node - Some text
  • Node <b> - with tags
  • Text node - might go here.
  • Node <p> - Also there are paragraphs
  • Text node - More text can go without paragraphs
  • Node <br> - <empty>

解决方案

Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().

public static void main(String[] args) throws IOException {
    String str = "<div>" +
            "    Some text <b>with tags</b> might go here." +
            "    <p>Also there are paragraphs</p>" +
            "    More text can go without paragraphs<br/>" +
            "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    int i = 0;

    for (Node node : div.childNodes()) {
        i++;
        System.out.println(String.format("%d %s %s",
                i,
                node.getClass().getSimpleName(),
                node.toString()));
    }
}

Result:

1 TextNode 
 Some text 
2 Element <b>with tags</b>
3 TextNode  might go here. 
4 Element <p>Also there are paragraphs</p>
5 TextNode  More text can go without paragraphs
6 Element <br/>

相关文章