Jsoup - 提取文本
我需要从这样的节点中提取文本:
一些带有标签的文本<b></b>可能会去这里.<p>还有段落</p>更多文字可以不用段落<br/></div>
我需要构建:
一些带有标签的文本<b></b>可能会去这里.还有段落更多文本可以不带段落
Element.text
只返回 div 的所有内容.Element.ownText
- 不在子元素内的所有内容.两者都是错误的.遍历 children
会忽略文本节点.
是否有办法迭代元素的内容以接收文本节点.例如
- 文本节点 - 一些文本
- 节点 <b> - 带有标签
- 文本节点 - 可能会放在这里.
- 节点 <p> - 还有段落
- 文本节点 - 更多文本可以不带段落
- 节点 <br> - <空>
元素.children() 返回一个 Elements 对象 - 元素 对象.查看父类 Node,您会看到允许您访问任意节点的方法,而不仅仅是元素,例如 Node.childNodes().
public static void main(String[] args) throws IOException {字符串 str = "<div>"+" 一些带有标签的文本 <b> 可能会放在这里."+<p>还有段落</p>"+更多的文字可以不用段落<br/>"+"</div>";文档 doc = Jsoup.parse(str);元素 div = doc.select("div").first();诠释 i = 0;对于(节点节点:div.childNodes()){我++;System.out.println(String.format("%d %s %s",一世,node.getClass().getSimpleName(),node.toString()));}}
结果:
<上一页>1 个文本节点一些文字2 元素带标签3 TextNode 可能会放在这里.4元素还有段落
5 TextNode 更多文字可以不用段落6元素<br/>I need to extract text from a node like this:
<div>
Some text <b>with tags</b> might go here.
<p>Also there are paragraphs</p>
More text can go without paragraphs<br/>
</div>
And I need to build:
Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Element.text
returns just all content of the div. Element.ownText
- everything that is not inside children elements. Both are wrong. Iterating through children
ignores text nodes.
Is there are way to iterate contents of an element to receive text nodes as well. E.g.
- Text node - Some text
- Node <b> - with tags
- Text node - might go here.
- Node <p> - Also there are paragraphs
- Text node - More text can go without paragraphs
- Node <br> - <empty>
Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().
public static void main(String[] args) throws IOException {
String str = "<div>" +
" Some text <b>with tags</b> might go here." +
" <p>Also there are paragraphs</p>" +
" More text can go without paragraphs<br/>" +
"</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
int i = 0;
for (Node node : div.childNodes()) {
i++;
System.out.println(String.format("%d %s %s",
i,
node.getClass().getSimpleName(),
node.toString()));
}
}
Result:
1 TextNode Some text 2 Element <b>with tags</b> 3 TextNode might go here. 4 Element <p>Also there are paragraphs</p> 5 TextNode More text can go without paragraphs 6 Element <br/>
相关文章