woody 是一款 java 的html 解析/提取器,用法非常类似 webmagic, 是对其抽取模板完全重写,之所有单独提取出来是因为为来更好可重用。
一些新功能:
多种结果数据类型(String, char, byte, short int, long, double, float, string[], Set, List,Data)
支持用户之定义脚本处理函数(目前支持Javascript 函数配置处理)
支持css,xpath内核替换
立即学习“前端免费学习笔记(深入)”;
支持filter功能
一个完整的例子:
public class OsChinaBlog {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://www.oschina.net/news/43879/webmagic-0-3-0").timeout(60000)
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0").get();
String html = doc.html();
OsChinaBlogModel model = AnnotationExtractor.me().process(html, OsChinaBlogModel.class);
System.out.println(model.toJson());
}
public static class OsChinaBlogModel extends Model {
public OsChinaBlogModel() {
//use to reflect
}
@Inject
@ComboExtract(value = { @ExtractBy(value = "h1.OSCTitle", type = ExprType.CSS),
@ExtractBy(value = "//title/text()", type = ExprType.XPATH) }, op = OP.OR)
public String title;
@Inject
@ExtractBy(value = "p.PubDate a[href~=http://my\.oschina\.net/]", type = ExprType.CSS)
public String author;
@Inject
@ExtractBy(value = "发布于.\s*(\d+年\d+月\d+日)", type = ExprType.REGEX)
public Date publishDate;
@Inject
@ComboExtract(value = {
@ExtractBy(value = "p.PubDate", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
@ExtractBy(value = "(\d+)评", type = ExprType.REGEX) }, op = OP.AND)
public int commentNum;
@Inject
@ExtractBy(value = "span#p_favor_count", type = ExprType.CSS, setting = @Setting(function = @Function(value = "replace", args = {
"+", "" })))
public int collectNum;
@Inject
@ComboExtract(value = {
@ExtractBy(value = "p[id=userComments]", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
@ExtractBy(value = "p.TextContent", type = ExprType.CSS) }, op = OP.AND, multi = true)
public List commentContents;
@Inject
@ExtractBy(value = "p[id=toolbar_wrapper]", setting = @Setting(fliters = { "b", "span" }), type = ExprType.CSS, impl = Document.class)
public String weibo;
}
}【相关推荐】
1. 免费html在线视频教程
2. String0
3. String1
以上就是对HTML 提取器(woody)的介绍的详细内容,更多请关注php中文网其它相关文章!
HTML怎么学习?HTML怎么入门?HTML在哪学?HTML怎么学才快?不用担心,这里为大家提供了HTML速学教程(入门课程),有需要的小伙伴保存下载就能学习啦!
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号