Commit 57556ab8 authored by yihua.huang's avatar yihua.huang

merege

parent 4f84b5f8
......@@ -4,7 +4,7 @@ webmagic
[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)
>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.
>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
## Features:
......@@ -17,23 +17,17 @@ webmagic
## Install:
Clone the repo and build:
git clone https://github.com/code4craft/webmagic.git
cd webmagic
mvn clean install
Add dependencies to your project:
Add dependencies to your pom.xml:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.2.0</version>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.2.0</version>
<version>0.3.0</version>
</dependency>
## Get Started:
......@@ -42,6 +36,7 @@ Add dependencies to your project:
Write a class implements PageProcessor:
```java
public class OschinaBlogPageProcesser implements PageProcessor {
private Site site = Site.me().setDomain("my.oschina.net")
......@@ -67,6 +62,7 @@ Write a class implements PageProcessor:
.pipeline(new ConsolePipeline()).run();
}
}
```
* `page.addTargetRequests(links)`
......@@ -74,6 +70,7 @@ Write a class implements PageProcessor:
You can also use annotation way:
```java
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
......@@ -92,6 +89,7 @@ You can also use annotation way:
new ConsolePageModelPipeline(), OschinaBlog.class).run();
}
}
```
### Docs and samples:
......
......@@ -6,9 +6,13 @@
<version>7</version>
</parent>
<groupId>us.codecraft</groupId>
<version>0.2.1</version>
<version>0.3.1-SNAPSHOT</version>
<modelVersion>4.0.0</modelVersion>
<packaging>pom</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
<artifactId>webmagic-parent</artifactId>
<name>webmagic-parent</name>
<description>
......@@ -32,7 +36,7 @@
<connection>scm:git:git@github.com:code4craft/webmagic.git</connection>
<developerConnection>scm:git:git@github.com:code4craft/webmagic.git</developerConnection>
<url>git@github.com:code4craft/webmagic.git</url>
<tag>webmagic-parent-0.2.1</tag>
<tag>HEAD</tag>
</scm>
<licenses>
<license>
......@@ -44,7 +48,6 @@
<modules>
<module>webmagic-core</module>
<module>webmagic-extension/</module>
<module>webmagic-samples/</module>
</modules>
<dependencyManagement>
......@@ -60,6 +63,11 @@
<artifactId>httpclient</artifactId>
<version>4.2.4</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
<version>0.1.0</version>
</dependency>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>Saxon-HE</artifactId>
......
Release Notes
----
*2012-9-4* `version:0.3.0`
* Change default XPath selector from HtmlCleaner to [Xsoup](https://github.com/code4craft/xsoup).
[Xsoup](https://github.com/code4craft/xsoup) is an XPath selector based on Jsoup written by me. It has much better performance than HtmlCleaner.
Time of processing a page is reduced from 7~9ms to 0.4ms.
If Xsoup is not stable for your usage, just use `Spider.xsoupOff()` to turn off it and report an issue to me!
* Add cycle retry times for Site.
When cycle retry times is set, Spider will put the url which downloading failed back to scheduler, and retry after a cycle of queue.
*2012-8-20* `version:0.2.1`
ComboExtractor support for annotation.
......
......@@ -21,22 +21,17 @@ webmagic使用手册
### 使用maven
webmagic使用maven管理依赖,你可以直接下载webmagic源码进行编译:
git clone https://github.com/code4craft/webmagic.git
mvn clean install
安装后,在项目中添加对应的依赖即可使用webmagic:
webmagic使用maven管理依赖,在项目中添加对应的依赖即可使用webmagic:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.2.0</version>
<version>0.2.1</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.2.0</version>
<version>0.2.1</version>
</dependency>
#### 项目结构
......@@ -51,7 +46,7 @@ webmagic主要包括两个包:
webmagic的扩展模块,提供一些更方便的编写爬虫的工具。包括注解格式定义爬虫、JSON、分布式等支持。
webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较重量级的工具,所以从主要包中抽离出来:
webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较重量级的工具,所以从主要包中抽离出来,这些包需要下载源码后自己编译
* **webmagic-saxon**
......
<?xml version="1.0" encoding="UTF-8"?>
<project name="module_webmagic-core" default="compile.module.webmagic-core">
<dirname property="module.webmagic-core.basedir" file="${ant.file.module_webmagic-core}"/>
<property name="module.jdk.home.webmagic-core" value="${project.jdk.home}"/>
<property name="module.jdk.bin.webmagic-core" value="${project.jdk.bin}"/>
<property name="module.jdk.classpath.webmagic-core" value="${project.jdk.classpath}"/>
<property name="compiler.args.webmagic-core" value="${compiler.args}"/>
<property name="webmagic-core.output.dir" value="${module.webmagic-core.basedir}/target/classes"/>
<property name="webmagic-core.testoutput.dir" value="${module.webmagic-core.basedir}/target/test-classes"/>
<path id="webmagic-core.module.bootclasspath">
<!-- Paths to be included in compilation bootclasspath -->
</path>
<path id="webmagic-core.module.production.classpath">
<path refid="${module.jdk.classpath.webmagic-core}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<path id="webmagic-core.runtime.production.module.classpath">
<pathelement location="${webmagic-core.output.dir}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<path id="webmagic-core.module.classpath">
<path refid="${module.jdk.classpath.webmagic-core}"/>
<pathelement location="${webmagic-core.output.dir}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_junit:junit:4.7.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<path id="webmagic-core.runtime.module.classpath">
<pathelement location="${webmagic-core.testoutput.dir}"/>
<pathelement location="${webmagic-core.output.dir}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_junit:junit:4.7.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<patternset id="excluded.from.module.webmagic-core">
<patternset refid="ignored.files"/>
</patternset>
<patternset id="excluded.from.compilation.webmagic-core">
<patternset refid="excluded.from.module.webmagic-core"/>
</patternset>
<path id="webmagic-core.module.sourcepath">
<dirset dir="${module.webmagic-core.basedir}">
<include name="src/main/java"/>
<include name="src/main/resources"/>
</dirset>
</path>
<path id="webmagic-core.module.test.sourcepath">
<dirset dir="${module.webmagic-core.basedir}">
<include name="src/test/java"/>
<include name="src/test/resources"/>
</dirset>
</path>
<target name="compile.module.webmagic-core" depends="compile.module.webmagic-core.production,compile.module.webmagic-core.tests" description="Compile module webmagic-core"/>
<target name="compile.module.webmagic-core.production" depends="register.custom.compilers" description="Compile module webmagic-core; production classes">
<mkdir dir="${webmagic-core.output.dir}"/>
<javac2 destdir="${webmagic-core.output.dir}" debug="${compiler.debug}" nowarn="${compiler.generate.no.warnings}" memorymaximumsize="${compiler.max.memory}" fork="true" executable="${module.jdk.bin.webmagic-core}/javac">
<compilerarg line="${compiler.args.webmagic-core}"/>
<bootclasspath refid="webmagic-core.module.bootclasspath"/>
<classpath refid="webmagic-core.module.production.classpath"/>
<src refid="webmagic-core.module.sourcepath"/>
<patternset refid="excluded.from.compilation.webmagic-core"/>
</javac2>
<copy todir="${webmagic-core.output.dir}">
<fileset dir="${module.webmagic-core.basedir}/src/main/java">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
<fileset dir="${module.webmagic-core.basedir}/src/main/resources">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
</copy>
</target>
<target name="compile.module.webmagic-core.tests" depends="register.custom.compilers,compile.module.webmagic-core.production" description="compile module webmagic-core; test classes" unless="skip.tests">
<mkdir dir="${webmagic-core.testoutput.dir}"/>
<javac2 destdir="${webmagic-core.testoutput.dir}" debug="${compiler.debug}" nowarn="${compiler.generate.no.warnings}" memorymaximumsize="${compiler.max.memory}" fork="true" executable="${module.jdk.bin.webmagic-core}/javac">
<compilerarg line="${compiler.args.webmagic-core}"/>
<bootclasspath refid="webmagic-core.module.bootclasspath"/>
<classpath refid="webmagic-core.module.classpath"/>
<src refid="webmagic-core.module.test.sourcepath"/>
<patternset refid="excluded.from.compilation.webmagic-core"/>
</javac2>
<copy todir="${webmagic-core.testoutput.dir}">
<fileset dir="${module.webmagic-core.basedir}/src/test/java">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
<fileset dir="${module.webmagic-core.basedir}/src/test/resources">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
</copy>
</target>
<target name="clean.module.webmagic-core" description="cleanup module">
<delete dir="${webmagic-core.output.dir}"/>
<delete dir="${webmagic-core.testoutput.dir}"/>
</target>
</project>
\ No newline at end of file
......@@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.2.1</version>
<version>0.3.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -25,6 +25,11 @@
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
......
package us.codecraft.webmagic;
import org.apache.commons.lang3.StringUtils;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.Selectable;
import us.codecraft.webmagic.utils.UrlUtils;
......@@ -28,7 +29,7 @@ public class Page {
private ResultItems resultItems = new ResultItems();
private Selectable html;
private Html html;
private Selectable url;
......@@ -58,11 +59,11 @@ public class Page {
*
* @return html
*/
public Selectable getHtml() {
public Html getHtml() {
return html;
}
public void setHtml(Selectable html) {
public void setHtml(Html html) {
this.html = html;
}
......@@ -87,6 +88,23 @@ public class Page {
}
}
/**
* add urls to fetch
*
* @param requests
*/
public void addTargetRequests(List<String> requests,long priority) {
synchronized (targetRequests) {
for (String s : requests) {
if (StringUtils.isBlank(s) || s.equals("#") || s.startsWith("javascript:")) {
break;
}
s = UrlUtils.canonicalizeUrl(s, url.toString());
targetRequests.add(new Request(s).setPriority(priority));
}
}
}
/**
* add url to fetch
*
......
......@@ -17,6 +17,8 @@ public class Request implements Serializable {
private static final long serialVersionUID = 2062192774891352043L;
public static final String CYCLE_TRIED_TIMES = "_cycle_tried_times";
private String url;
/**
......
......@@ -30,6 +30,8 @@ public class Site {
private int retryTimes = 0;
private int cycleRetryTimes = 0;
private static final Set<Integer> DEFAULT_STATUS_CODE_SET = new HashSet<Integer>();
private Set<Integer> acceptStatCode = DEFAULT_STATUS_CODE_SET;
......@@ -200,7 +202,7 @@ public class Site {
}
/**
* Get retry times when download fail, 0 by default.<br>
* Get retry times when download fail immediately, 0 by default.<br>
*
* @return retry times when download fail
*/
......@@ -218,6 +220,25 @@ public class Site {
return this;
}
/**
* When cycleRetryTimes is more than 0, it will add back to scheduler and try download again. <br>
*
* @return retry times when download fail
*/
public int getCycleRetryTimes() {
return cycleRetryTimes;
}
/**
* Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler. <br>
*
* @return this
*/
public Site setCycleRetryTimes(int cycleRetryTimes) {
this.cycleRetryTimes = cycleRetryTimes;
return this;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
......
......@@ -9,6 +9,7 @@ import us.codecraft.webmagic.pipeline.Pipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.scheduler.Scheduler;
import us.codecraft.webmagic.utils.EnvironmentUtil;
import us.codecraft.webmagic.utils.ThreadUtils;
import java.io.Closeable;
......@@ -309,6 +310,12 @@ public class Spider implements Runnable, Task {
sleep(site.getSleepTime());
return;
}
//for cycle retry
if (page.getHtml()==null){
addRequest(page);
sleep(site.getSleepTime());
return;
}
pageProcessor.process(page);
addRequest(page);
if (!page.getResultItems().isSkip()) {
......@@ -368,6 +375,14 @@ public class Spider implements Runnable, Task {
return this;
}
/**
* switch off xsoup
* @return
*/
public static void xsoupOff(){
EnvironmentUtil.setUseXsoup(false);
}
@Override
public String getUUID() {
if (uuid != null) {
......
......@@ -46,6 +46,17 @@ public class HttpClientDownloader implements Downloader {
return (Html) page.getHtml();
}
/**
* A simple method to download a url.
*
* @param url
* @return html
*/
public Html download(String url, String charset) {
Page page = download(new Request(url), Site.me().setCharset(charset).toTask());
return (Html) page.getHtml();
}
@Override
public Page download(Request request, Task task) {
Site site = null;
......@@ -79,6 +90,21 @@ public class HttpClientDownloader implements Downloader {
if (tried > retryTimes) {
logger.warn("download page " + request.getUrl() + " error", e);
if (site.getCycleRetryTimes() > 0) {
Page page = new Page();
Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES);
if (cycleTriedTimesObject == null) {
page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
} else {
int cycleTriedTimes = (Integer) cycleTriedTimesObject;
cycleTriedTimes++;
if (cycleTriedTimes >= site.getCycleRetryTimes()) {
return null;
}
page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
}
return page;
}
return null;
}
logger.info("download page " + request.getUrl() + " error, retry the " + tried + " time!");
......@@ -87,13 +113,12 @@ public class HttpClientDownloader implements Downloader {
} while (retry);
int statusCode = httpResponse.getStatusLine().getStatusCode();
if (acceptStatCode.contains(statusCode)) {
handleGzip(httpResponse);
//charset
if (charset == null) {
String value = httpResponse.getEntity().getContentType().getValue();
charset = UrlUtils.getCharset(value);
}
//
handleGzip(httpResponse);
return handleResponse(request, charset, httpResponse, task);
} else {
logger.warn("code error " + statusCode + "\t" + request.getUrl());
......
package us.codecraft.webmagic.selector;
import org.jsoup.Jsoup;
import java.util.List;
/**
* @author code4crafter@gmail.com
* @since 0.3.0
*/
public abstract class BaseElementSelector implements Selector,ElementSelector {
@Override
public String select(String text) {
return select(Jsoup.parse(text));
}
@Override
public List<String> selectList(String text) {
return selectList(Jsoup.parse(text));
}
}
package us.codecraft.webmagic.selector;
import org.apache.commons.collections.CollectionUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
......@@ -15,7 +13,7 @@ import java.util.List;
* @author code4crafter@gmail.com <br>
* @since 0.1.0
*/
public class CssSelector implements Selector {
public class CssSelector extends BaseElementSelector {
private String selectorText;
......@@ -30,16 +28,6 @@ public class CssSelector implements Selector {
this.attrName = attrName;
}
@Override
public String select(String text) {
Document doc = Jsoup.parse(text);
Elements elements = doc.select(selectorText);
if (CollectionUtils.isEmpty(elements)) {
return null;
}
return getValue(elements.get(0));
}
private String getValue(Element element) {
if (attrName == null) {
return element.outerHtml();
......@@ -51,9 +39,17 @@ public class CssSelector implements Selector {
}
@Override
public List<String> selectList(String text) {
public String select(Element element) {
Elements elements = element.select(selectorText);
if (CollectionUtils.isEmpty(elements)) {
return null;
}
return getValue(elements.get(0));
}
@Override
public List<String> selectList(Element doc) {
List<String> strings = new ArrayList<String>();
Document doc = Jsoup.parse(text);
Elements elements = doc.select(selectorText);
if (CollectionUtils.isNotEmpty(elements)) {
for (Element element : elements) {
......
package us.codecraft.webmagic.selector;
import org.jsoup.nodes.Element;
import java.util.List;
/**
* Selector(extractor) for html elements.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.3.0
*/
public interface ElementSelector {
/**
* Extract single result in text.<br>
* If there are more than one result, only the first will be chosen.
*
* @param element
* @return result
*/
public String select(Element element);
/**
* Extract all results in text.<br>
*
* @param element
* @return results
*/
public List<String> selectList(Element element);
}
package us.codecraft.webmagic.selector;
import org.apache.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import us.codecraft.webmagic.utils.EnvironmentUtil;
import java.util.ArrayList;
import java.util.List;
......@@ -11,12 +16,29 @@ import java.util.List;
*/
public class Html extends PlainText {
private Logger logger = Logger.getLogger(getClass());
/**
* Store parsed document for better performance when only one text exist.
*/
private Document document;
public Html(List<String> strings) {
super(strings);
}
public Html(String text) {
super(text);
try {
this.document = Jsoup.parse(text);
} catch (Exception e) {
logger.warn("parse document error ", e);
}
}
public Html(Document document) {
super(document.html());
this.document = document;
}
public static Html create(String text) {
......@@ -47,32 +69,77 @@ public class Html extends PlainText {
@Override
public Selectable smartContent() {
SmartContentSelector smartContentSelector = SelectorFactory.getInstatnce().newSmartContentSelector();
SmartContentSelector smartContentSelector = Selectors.smartContent();
return select(smartContentSelector, strings);
}
@Override
public Selectable links() {
XpathSelector xpathSelector = SelectorFactory.getInstatnce().newXpathSelector("//a/@href");
return selectList(xpathSelector, strings);
return xpath("//a/@href");
}
@Override
public Selectable xpath(String xpath) {
XpathSelector xpathSelector = SelectorFactory.getInstatnce().newXpathSelector(xpath);
if (EnvironmentUtil.useXsoup()) {
XsoupSelector xsoupSelector = new XsoupSelector(xpath);
if (document != null) {
return new Html(xsoupSelector.selectList(document));
}
return selectList(xsoupSelector, strings);
} else {
XpathSelector xpathSelector = new XpathSelector(xpath);
return selectList(xpathSelector, strings);
}
}
@Override
public Selectable $(String selector) {
CssSelector cssSelector = new CssSelector(selector);
CssSelector cssSelector = Selectors.$(selector);
if (document != null) {
return new Html(cssSelector.selectList(document));
}
return selectList(cssSelector, strings);
}
@Override
public Selectable $(String selector, String attrName) {
CssSelector cssSelector = new CssSelector(selector, attrName);
CssSelector cssSelector = Selectors.$(selector, attrName);
if (document != null) {
return new Html(cssSelector.selectList(document));
}
return selectList(cssSelector, strings);
}
public Document getDocument() {
return document;
}
public String getText() {
if (strings != null && strings.size() > 0) {
return strings.get(0);
}
return document.html();
}
/**
* @param selector
* @return
*/
public String selectDocument(Selector selector) {
if (selector instanceof ElementSelector) {
ElementSelector elementSelector = (ElementSelector) selector;
return elementSelector.select(getDocument());
} else {
return selector.select(getText());
}
}
public List<String> selectDocumentForList(Selector selector) {
if (selector instanceof ElementSelector) {
ElementSelector elementSelector = (ElementSelector) selector;
return elementSelector.selectList(getDocument());
} else {
return selector.selectList(getText());
}
}
}
......@@ -57,13 +57,13 @@ public class PlainText implements Selectable {
@Override
public Selectable regex(String regex) {
RegexSelector regexSelector = SelectorFactory.getInstatnce().newRegexSelector(regex);
RegexSelector regexSelector = Selectors.regex(regex);
return selectList(regexSelector, strings);
}
@Override
public Selectable regex(String regex, int group) {
RegexSelector regexSelector = SelectorFactory.getInstatnce().newRegexSelector(regex, group);
RegexSelector regexSelector = Selectors.regex(regex, group);
return selectList(regexSelector, strings);
}
......@@ -89,7 +89,7 @@ public class PlainText implements Selectable {
@Override
public Selectable replace(String regex, String replacement) {
ReplaceSelector replaceSelector = SelectorFactory.getInstatnce().newReplaceSelector(regex, replacement);
ReplaceSelector replaceSelector = new ReplaceSelector(regex,replacement);
return select(replaceSelector, strings);
}
......@@ -106,4 +106,9 @@ public class PlainText implements Selectable {
return null;
}
}
@Override
public boolean match() {
return strings != null && strings.size() > 0;
}
}
......@@ -82,6 +82,13 @@ public interface Selectable {
*/
public String toString();
/**
* if result exist for select
*
* @return true if result exist
*/
public boolean match();
/**
* multi string result
*
......
......@@ -9,11 +9,15 @@ package us.codecraft.webmagic.selector;
public abstract class Selectors {
public static RegexSelector regex(String expr) {
return SelectorFactory.getInstatnce().newRegexSelector(expr);
return new RegexSelector(expr);
}
public static RegexSelector regex(String expr, int group) {
return SelectorFactory.getInstatnce().newRegexSelector(expr, group);
return new RegexSelector(expr,group);
}
public static SmartContentSelector smartContent() {
return new SmartContentSelector();
}
public static CssSelector $(String expr) {
......@@ -25,7 +29,11 @@ public abstract class Selectors {
}
public static XpathSelector xpath(String expr) {
return SelectorFactory.getInstatnce().newXpathSelector(expr);
return new XpathSelector(expr);
}
public static XsoupSelector xsoup(String expr) {
return new XsoupSelector(expr);
}
public static AndSelector and(Selector... selectors) {
......
package us.codecraft.webmagic.selector;
import org.jsoup.nodes.Element;
import us.codecraft.xsoup.XPathEvaluator;
import us.codecraft.xsoup.Xsoup;
import java.util.List;
/**
* XPath selector based on Xsoup.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.3.0
*/
public class XsoupSelector extends BaseElementSelector {
private XPathEvaluator xPathEvaluator;
public XsoupSelector(String xpathStr) {
this.xPathEvaluator = Xsoup.compile(xpathStr);
}
@Override
public String select(Element element) {
return xPathEvaluator.evaluate(element).get();
}
@Override
public List<String> selectList(Element element) {
return xPathEvaluator.evaluate(element).list();
}
}
package us.codecraft.webmagic.utils;
import org.apache.commons.lang3.BooleanUtils;
import java.util.Properties;
/**
* @author code4crafter@gmail.com
* @since 0.3.0
*/
public abstract class EnvironmentUtil {
private static final String USE_XSOUP = "xsoup";
public static boolean useXsoup() {
Properties properties = System.getProperties();
Object o = properties.get(USE_XSOUP);
if (o == null) {
return true;
}
return BooleanUtils.toBoolean(((String) o).toLowerCase());
}
public static void setUseXsoup(boolean useXsoup) {
Properties properties = System.getProperties();
properties.setProperty(USE_XSOUP, BooleanUtils.toString(useXsoup, "true", "false"));
}
}
......@@ -2,6 +2,7 @@ package us.codecraft.webmagic.utils;
import org.apache.commons.lang3.StringUtils;
import java.nio.charset.Charset;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
......@@ -98,15 +99,17 @@ public class UrlUtils {
return stringBuilder.toString();
}
private static final Pattern patternForCharset = Pattern.compile("charset=([^\\s;]*)");
private static final Pattern patternForCharset = Pattern.compile("charset\\s*=\\s*['\"]*([^\\s;'\"]*)");
public static String getCharset(String contentType) {
Matcher matcher = patternForCharset.matcher(contentType);
if (matcher.find()) {
return matcher.group(1);
} else {
return null;
String charset = matcher.group(1);
if (Charset.isSupported(charset)) {
return charset;
}
}
return null;
}
}
package us.codecraft.webmagic.utils;
import org.junit.Test;
import static junit.framework.Assert.*;
/**
* @author code4crafter@gmail.com
*/
public class EnvironmentUtilTest {
@Test
public void test() {
assertTrue(EnvironmentUtil.useXsoup());
EnvironmentUtil.setUseXsoup(false);
assertFalse(EnvironmentUtil.useXsoup());
}
}
......@@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.2.1</version>
<version>0.3.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......
......@@ -34,7 +34,7 @@ class PageModelExtractor {
private List<FieldExtractor> fieldExtractors;
private Extractor extractor;
private Extractor objectExtractor;
public static PageModelExtractor create(Class clazz) {
PageModelExtractor pageModelExtractor = new PageModelExtractor();
......@@ -169,7 +169,7 @@ class PageModelExtractor {
annotation = clazz.getAnnotation(ExtractBy.class);
if (annotation != null) {
ExtractBy extractBy = (ExtractBy) annotation;
extractor = new Extractor(new XpathSelector(extractBy.value()), Extractor.Source.Html, extractBy.notNull(), extractBy.multi());
objectExtractor = new Extractor(new XpathSelector(extractBy.value()), Extractor.Source.Html, extractBy.notNull(), extractBy.multi());
}
}
......@@ -183,28 +183,28 @@ class PageModelExtractor {
if (!matched) {
return null;
}
if (extractor == null) {
return processSingle(page, page.getHtml().toString());
if (objectExtractor == null) {
return processSingle(page, null, false);
} else {
if (extractor.multi) {
if (objectExtractor.multi) {
List<Object> os = new ArrayList<Object>();
List<String> list = extractor.getSelector().selectList(page.getHtml().toString());
List<String> list = objectExtractor.getSelector().selectList(page.getHtml().toString());
for (String s : list) {
Object o = processSingle(page, s);
Object o = processSingle(page, s, false);
if (o != null) {
os.add(o);
}
}
return os;
} else {
String select = extractor.getSelector().select(page.getHtml().toString());
Object o = processSingle(page, select);
String select = objectExtractor.getSelector().select(page.getHtml().toString());
Object o = processSingle(page, select, false);
return o;
}
}
}
private Object processSingle(Page page, String html) {
private Object processSingle(Page page, String html, boolean isRaw) {
Object o = null;
try {
o = clazz.newInstance();
......@@ -213,10 +213,14 @@ class PageModelExtractor {
List<String> value;
switch (fieldExtractor.getSource()) {
case RawHtml:
value = fieldExtractor.getSelector().selectList(page.getHtml().toString());
value = page.getHtml().selectDocumentForList(fieldExtractor.getSelector());
break;
case Html:
if (isRaw) {
value = page.getHtml().selectDocumentForList(fieldExtractor.getSelector());
} else {
value = fieldExtractor.getSelector().selectList(html);
}
break;
case Url:
value = fieldExtractor.getSelector().selectList(page.getUrl().toString());
......@@ -232,10 +236,14 @@ class PageModelExtractor {
String value;
switch (fieldExtractor.getSource()) {
case RawHtml:
value = fieldExtractor.getSelector().select(page.getHtml().toString());
value = page.getHtml().selectDocument(fieldExtractor.getSelector());
break;
case Html:
if (isRaw) {
value = page.getHtml().selectDocument(fieldExtractor.getSelector());
} else {
value = fieldExtractor.getSelector().select(html);
}
break;
case Url:
value = fieldExtractor.getSelector().select(page.getUrl().toString());
......
package us.codecraft.webmagic.pipeline;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang3.builder.ToStringBuilder;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.HasKey;
import us.codecraft.webmagic.model.PageModelPipeline;
import us.codecraft.webmagic.utils.FilePersistentBase;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
/**
* Store results objects (page models) to files in plain format.<br>
* Use model.getKey() as file name if the model implements HasKey.<br>
* Otherwise use SHA1 as file name.
*
* @author code4crafter@gmail.com <br>
* @since 0.3.0
*/
public class FilePageModelPipeline extends FilePersistentBase implements PageModelPipeline {
private Logger logger = Logger.getLogger(getClass());
/**
* new JsonFilePageModelPipeline with default path "/data/webmagic/"
*/
public FilePageModelPipeline() {
setPath("/data/webmagic/");
}
public FilePageModelPipeline(String path) {
setPath(path);
}
@Override
public void process(Object o, Task task) {
String path = this.path + "/" + task.getUUID() + "/";
try {
String filename;
if (o instanceof HasKey) {
filename = path + ((HasKey) o).key() + ".html";
} else {
filename = path + DigestUtils.md5Hex(ToStringBuilder.reflectionToString(o)) + ".html";
}
PrintWriter printWriter = new PrintWriter(new FileWriter(getFile(filename)));
printWriter.write(ToStringBuilder.reflectionToString(o));
printWriter.close();
} catch (IOException e) {
logger.warn("write file error", e);
}
}
}
......@@ -36,9 +36,11 @@ public class RedisScheduler implements Scheduler {
public synchronized void push(Request request, Task task) {
Jedis jedis = pool.getResource();
try {
//使用Set进行url去重
if (!jedis.sismember(SET_PREFIX + task.getUUID(), request.getUrl())) {
//使用List保存队列
// if cycleRetriedTimes is set, allow duplicated.
Object cycleRetriedTimes = request.getExtra(Request.CYCLE_TRIED_TIMES);
// use set to remove duplicate url
if (cycleRetriedTimes != null || !jedis.sismember(SET_PREFIX + task.getUUID(), request.getUrl())) {
// use list to store queue
jedis.rpush(QUEUE_PREFIX + task.getUUID(), request.getUrl());
jedis.sadd(SET_PREFIX + task.getUUID(), request.getUrl());
if (request.getExtras() != null) {
......
package us.codecraft.webmagic.utils;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.selector.CssSelector;
import us.codecraft.webmagic.selector.RegexSelector;
import us.codecraft.webmagic.selector.Selector;
import us.codecraft.webmagic.selector.XpathSelector;
import us.codecraft.webmagic.selector.*;
import java.util.ArrayList;
import java.util.List;
/**
* Tools for annotation converting. <br>
*
* @author code4crafter@gmail.com <br>
* @since 0.2.1
*/
......@@ -27,9 +25,19 @@ public class ExtractorUtils {
selector = new RegexSelector(value);
break;
case XPath:
selector = new XpathSelector(value);
selector = getXpathSelector(value);
break;
default:
selector = getXpathSelector(value);
}
return selector;
}
private static Selector getXpathSelector(String value) {
Selector selector;
if (EnvironmentUtil.useXsoup()) {
selector = new XsoupSelector(value);
} else {
selector = new XpathSelector(value);
}
return selector;
......@@ -37,7 +45,7 @@ public class ExtractorUtils {
public static List<Selector> getSelectors(ExtractBy[] extractBies) {
List<Selector> selectors = new ArrayList<Selector>();
if (extractBies==null){
if (extractBies == null) {
return selectors;
}
for (ExtractBy extractBy : extractBies) {
......
......@@ -5,7 +5,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.2.1</version>
<version>0.3.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.AfterExtractor;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.TargetUrl;
import java.util.List;
/**
* @author yihua.huang@dianping.com <br>
* Date: 13-8-13 <br>
* Time: 上午10:13 <br>
*/
@TargetUrl("http://*.alpha.dp/*")
public class DianpingFtlDataScanner implements AfterExtractor {
@ExtractBy(value = "(DP\\.data\\(\\{.*\\}\\));", type = ExtractBy.Type.Regex, notNull = true, multi = true)
private List<String> data;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://w.alpha.dp/").setSleepTime(0), DianpingFtlDataScanner.class)
.thread(5).run();
}
@Override
public void afterProcess(Page page) {
if (data.size() > 1) {
System.err.println(page.getUrl());
}
if (data.size() > 0 && data.get(0).length() > 100) {
System.err.println(page.getUrl());
}
}
}
......@@ -2,6 +2,7 @@ package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.PlainText;
......@@ -24,7 +25,7 @@ public class DiaoyuwengProcessor implements PageProcessor {
page.addTargetRequests(requests);
if (page.getUrl().toString().contains("thread")){
page.putField("title", page.getHtml().xpath("//a[@id='thread_subject']"));
page.putField("content", page.getHtml().xpath("//div[@class='pcb']//tbody"));
page.putField("content", page.getHtml().xpath("//div[@class='pcb']//tbody/tidyText()"));
page.putField("date",page.getHtml().regex("发表于 (\\d{4}-\\d+-\\d+ \\d+:\\d+:\\d+)"));
page.putField("id",new PlainText("1000"+page.getUrl().regex("http://www\\.diaoyuweng\\.com/thread-(\\d+)-1-1.html").toString()));
}
......@@ -38,4 +39,8 @@ public class DiaoyuwengProcessor implements PageProcessor {
}
return site;
}
public static void main(String[] args) {
Spider.create(new DiaoyuwengProcessor()).run();
}
}
......@@ -2,7 +2,9 @@ package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.RedisScheduler;
import java.util.List;
......@@ -15,14 +17,18 @@ public class F58PageProcesser implements PageProcessor {
@Override
public void process(Page page) {
List<String> strings = page.getHtml().regex("<a[^<>]*href=[\"']{1}(/yewu/.*?)[\"']{1}").all();
List<String> strings = page.getHtml().links().regex(".*/yewu/.*").all();
page.addTargetRequests(strings);
page.putField("title",page.getHtml().regex("<title>(.*)</title>"));
page.putField("body",page.getHtml().xpath("//dd[@class='w133']"));
page.putField("body",page.getHtml().xpath("//dd"));
}
@Override
public Site getSite() {
return Site.me().setDomain("sh.58.com").addStartUrl("http://sh.58.com/"); //To change body of implemented methods use File | Settings | File Templates.
return Site.me().setDomain("sh.58.com").addStartUrl("http://sh1.51a8.com/").setCycleRetryTimes(2); //To change body of implemented methods use File | Settings | File Templates.
}
public static void main(String[] args) {
Spider.create(new F58PageProcesser()).setScheduler(new RedisScheduler("localhost")).run();
}
}
......@@ -2,6 +2,7 @@ package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.List;
......@@ -14,10 +15,9 @@ import java.util.List;
public class HuxiuProcessor implements PageProcessor {
@Override
public void process(Page page) {
//http://progressdaily.diandian.com/post/2013-01-24/40046867275
List<String> requests = page.getHtml().regex("<a[^<>\"']*href=[\"']{1}([/]{0,1}article[^<>#\"']*?)[\"']{1}").all();
List<String> requests = page.getHtml().links().regex(".*article.*").all();
page.addTargetRequests(requests);
page.putField("title",page.getHtml().xpath("//div[@class='neirong']//h1[@class='ph xs5']"));
page.putField("title",page.getHtml().xpath("//div[@class='clearfix neirong']//h1/text()"));
page.putField("content",page.getHtml().smartContent());
}
......@@ -26,4 +26,8 @@ public class HuxiuProcessor implements PageProcessor {
return Site.me().setDomain("www.huxiu.com").addStartUrl("http://www.huxiu.com/").
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
}
public static void main(String[] args) {
Spider.create(new HuxiuProcessor()).run();
}
}
......@@ -4,9 +4,7 @@ import org.apache.commons.collections.CollectionUtils;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.RedisScheduler;
import java.util.List;
......@@ -41,8 +39,6 @@ public class InfoQMiniBookProcessor implements PageProcessor {
public static void main(String[] args) {
Spider.create(new InfoQMiniBookProcessor())
.scheduler(new RedisScheduler("localhost"))
.pipeline(new FilePipeline("/data/temp/webmagic/"))
.thread(5)
.run();
}
......
......@@ -3,7 +3,6 @@ package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
/**
......@@ -32,6 +31,6 @@ public class IteyeBlogProcessor implements PageProcessor {
}
public static void main(String[] args) {
Spider.create(new IteyeBlogProcessor()).thread(5).pipeline(new FilePipeline("/data/webmagic/")).run();
Spider.create(new IteyeBlogProcessor()).thread(5).run();
}
}
......@@ -2,6 +2,7 @@ package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
/**
......@@ -24,4 +25,8 @@ public class KaichibaProcessor implements PageProcessor {
return Site.me().setDomain("kaichiba.com").addStartUrl("http://kaichiba.com/shop/41725781").setCharset("utf-8").
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
}
public static void main(String[] args) {
Spider.create(new KaichibaProcessor()).run();
}
}
......@@ -2,6 +2,7 @@ package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.List;
......@@ -21,8 +22,8 @@ public class MeicanProcessor implements PageProcessor {
}
page.addTargetRequests(requests);
page.addTargetRequests(page.getHtml().links().regex("(.*/restaurant/[^#]+)").all());
page.putField("items", page.getHtml().xpath("//ul[@class=\"dishes menu_dishes\"]/li/span[@class=\"name\"]"));
page.putField("prices", page.getHtml().xpath("//ul[@class=\"dishes menu_dishes\"]/li/span[@class=\"price_outer\"]/span[@class=\"price\"]"));
page.putField("items", page.getHtml().xpath("//ul[@class=\"dishes menu_dishes\"]/li/span[@class=\"name\"]/text()"));
page.putField("prices", page.getHtml().xpath("//ul[@class=\"dishes menu_dishes\"]/li/span[@class=\"price_outer\"]/span[@class=\"price\"]/text()"));
}
@Override
......@@ -30,4 +31,8 @@ public class MeicanProcessor implements PageProcessor {
return Site.me().setDomain("meican.com").addStartUrl("http://www.meican.com/shanghai/districts").setCharset("utf-8").
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
}
public static void main(String[] args) {
Spider.create(new MeicanProcessor()).run();
}
}
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.List;
......@@ -21,8 +20,8 @@ public class OschinaBlogPageProcesser implements PageProcessor {
public void process(Page page) {
List<String> links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all();
page.addTargetRequests(links);
page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString());
page.putField("content", page.getHtml().$("div.content").toString());
page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1/text()").toString());
page.putField("content", page.getHtml().xpath("//div[@class='BlogContent']/tidyText()").toString());
page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all());
}
......@@ -33,6 +32,6 @@ public class OschinaBlogPageProcesser implements PageProcessor {
}
public static void main(String[] args) {
Spider.create(new OschinaBlogPageProcesser()).pipeline(new ConsolePipeline()).run();
Spider.create(new OschinaBlogPageProcesser()).run();
}
}
package us.codecraft.webmagic.samples.scheduler;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.scheduler.PriorityScheduler;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.DelayQueue;
import java.util.concurrent.Delayed;
import java.util.concurrent.TimeUnit;
/**
* @author code4crafter@gmail.com
*/
public class DelayQueueScheduler extends PriorityScheduler {
private DelayQueue<RequestWrapper> queue = new DelayQueue<RequestWrapper>();
private Set<String> urls = new HashSet<String>();
private long time;
private TimeUnit timeUnit;
private class RequestWrapper implements Delayed {
private long startTime = System.currentTimeMillis();
private Request request;
private RequestWrapper(Request request) {
this.request = request;
}
private long getStartTime() {
return startTime;
}
private Request getRequest() {
return request;
}
@Override
public long getDelay(TimeUnit unit) {
long convert = unit.convert(TimeUnit.MILLISECONDS.convert(time, timeUnit) - System.currentTimeMillis() + startTime, TimeUnit.MILLISECONDS);
return convert;
}
@Override
public int compareTo(Delayed o) {
return new Long(getDelay(TimeUnit.MILLISECONDS)).compareTo(o.getDelay(TimeUnit.MILLISECONDS));
}
}
public DelayQueueScheduler(long time, TimeUnit timeUnit) {
this.time = time;
this.timeUnit = timeUnit;
}
@Override
public synchronized void push(Request request, Task task) {
if (urls.add(request.getUrl())) {
queue.add(new RequestWrapper(request));
}
}
@Override
public synchronized Request poll(Task task) {
RequestWrapper take = null;
while (take == null) {
try {
take = queue.take();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
queue.add(new RequestWrapper(take.getRequest()));
return take.getRequest();
}
}
package us.codecraft.webmagic.samples.scheduler;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.scheduler.PriorityScheduler;
/**
* @author code4crafter@gmail.com
*/
public class LevelLimitScheduler extends PriorityScheduler {
private int levelLimit = 3;
public LevelLimitScheduler(int levelLimit) {
this.levelLimit = levelLimit;
}
@Override
public synchronized void push(Request request, Task task) {
if (((Integer) request.getExtra("_level")) <= levelLimit) {
super.push(request, task);
}
}
}
package us.codecraft.webmagic.samples.scheduler;
import org.apache.commons.lang3.StringUtils;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.PriorityScheduler;
import java.util.List;
import static us.codecraft.webmagic.selector.Selectors.regex;
import static us.codecraft.webmagic.selector.Selectors.xpath;
/**
* @author code4crafter@gmail.com
*/
public class ZipCodePageProcessor implements PageProcessor {
private Site site = Site.me().setCharset("gb2312")
.setSleepTime(100).addStartUrl("http://www.ip138.com/post/");
@Override
public void process(Page page) {
if (page.getUrl().toString().equals("http://www.ip138.com/post/")) {
processCountry(page);
} else if (page.getUrl().regex("http://www\\.ip138\\.com/post/\\w+[/]?$").toString() != null) {
processProvince(page);
} else {
processDistrict(page);
}
}
private void processCountry(Page page) {
List<String> provinces = page.getHtml().xpath("//*[@id=\"newAlexa\"]/table/tbody/tr/td").all();
for (String province : provinces) {
String link = xpath("//@href").select(province);
String title = xpath("/text()").select(province);
Request request = new Request(link).setPriority(0).putExtra("province", title);
page.addTargetRequest(request);
}
}
private void processProvince(Page page) {
//这里仅靠xpath没法精准定位,所以使用正则作为筛选,不符合正则的会被过滤掉
List<String> districts = page.getHtml().xpath("//body/table/tbody/tr/td").regex(".*http://www\\.ip138\\.com/post/\\w+/\\w+.*").all();
for (String district : districts) {
String link = xpath("//@href").select(district);
String title = xpath("/text()").select(district);
Request request = new Request(link).setPriority(1).putExtra("province", page.getRequest().getExtra("province")).putExtra("district", title);
page.addTargetRequest(request);
}
}
private void processDistrict(Page page) {
String province = page.getRequest().getExtra("province").toString();
String district = page.getRequest().getExtra("district").toString();
List<String> counties = page.getHtml().xpath("//body/table/tbody/tr").regex(".*<td>\\d+</td>.*").all();
String regex = "<td[^<>]*>([^<>]+)</td><td[^<>]*>([^<>]+)</td><td[^<>]*>([^<>]+)</td><td[^<>]*>([^<>]+)</td>";
for (String county : counties) {
String county0 = regex(regex, 1).select(county);
String county1 = regex(regex, 2).select(county);
String zipCode = regex(regex, 3).select(county);
page.putField("result", StringUtils.join(new String[]{province, district,
county0, county1, zipCode}, "\t"));
}
List<String> links = page.getHtml().links().regex("http://www\\.ip138\\.com/post/\\w+/\\w+").all();
for (String link : links) {
page.addTargetRequest(new Request(link).setPriority(2).putExtra("province", province).putExtra("district", district));
}
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new ZipCodePageProcessor()).scheduler(new PriorityScheduler()).run();
PriorityScheduler scheduler = new PriorityScheduler();
Spider spider = Spider.create(new ZipCodePageProcessor()).scheduler(scheduler);
scheduler.push(new Request("http://www.baidu.com/s?wd=webmagic&f=12&rsp=0&oq=webmagix&tn=baiduhome_pg&ie=utf-8"),spider);
spider.run();
}
}
package us.codecraft.webmagic.samples.scheduler;
import org.junit.Ignore;
import org.junit.Test;
import us.codecraft.webmagic.Request;
import java.util.concurrent.TimeUnit;
/**
* @author code4crafter@gmail.com
*/
public class DelayQueueSchedulerTest {
@Ignore("infinite")
@Test
public void test() {
DelayQueueScheduler delayQueueScheduler = new DelayQueueScheduler(1, TimeUnit.SECONDS);
delayQueueScheduler.push(new Request("1"), null);
while (true){
Request poll = delayQueueScheduler.poll(null);
System.out.println(System.currentTimeMillis()+"\t"+poll);
}
}
}
......@@ -17,6 +17,11 @@
<artifactId>webmagic-core</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
<version>0.0.1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>Saxon-HE</artifactId>
......
package us.codecraft.webmagic.selector;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.htmlcleaner.XPatherException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.Assert;
import org.junit.Ignore;
import org.junit.Test;
import us.codecraft.xsoup.XPathEvaluator;
import us.codecraft.xsoup.Xsoup;
/**
* @author code4crafter@gmail.com <br> Date: 13-4-21 Time: 上午10:06
......@@ -1353,6 +1360,7 @@ public class XpathSelectorTest {
Html html1 = new Html(html);
Assert.assertEquals("再次吐槽easyui", html1.xpath(".//*[@class='QTitle']/h1/a").toString());
Assert.assertNotNull(html1.$("a[href]").xpath("//@href").all());
Selectors.xpath("/abc/").select("");
}
@Test
......@@ -1379,17 +1387,86 @@ public class XpathSelectorTest {
xpath2Selector.selectList(html);
}
System.out.println(System.currentTimeMillis()-time);
XpathSelector xpathSelector = new XpathSelector("//a");
time =System.currentTimeMillis();
for (int i = 0; i < 1000; i++) {
xpathSelector.selectList(html);
}
System.out.println(System.currentTimeMillis()-time);
time =System.currentTimeMillis();
for (int i = 0; i < 1000; i++) {
xpath2Selector.selectList(html);
}
System.out.println(System.currentTimeMillis() - time);
CssSelector cssSelector = new CssSelector("a");
time =System.currentTimeMillis();
for (int i = 0; i < 1000; i++) {
cssSelector.selectList(html);
}
System.out.println("css "+(System.currentTimeMillis()-time));
}
@Ignore("take long time")
@Test
public void parserPerformanceTest() throws XPatherException {
System.out.println(html.length());
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode tagNode = htmlCleaner.clean(html);
Document document = Jsoup.parse(html);
long time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
htmlCleaner.clean(html);
}
System.out.println(System.currentTimeMillis()-time);
time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
tagNode.evaluateXPath("//a");
}
System.out.println(System.currentTimeMillis()-time);
System.out.println("=============");
time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
Jsoup.parse(html);
}
System.out.println(System.currentTimeMillis()-time);
time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
document.select("a");
}
System.out.println(System.currentTimeMillis()-time);
System.out.println("=============");
time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
htmlCleaner.clean(html);
}
System.out.println(System.currentTimeMillis()-time);
time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
tagNode.evaluateXPath("//a");
}
System.out.println(System.currentTimeMillis()-time);
System.out.println("=============");
XPathEvaluator compile = Xsoup.compile("//a");
time =System.currentTimeMillis();
for (int i = 0; i < 2000; i++) {
compile.evaluate(document);
}
System.out.println(System.currentTimeMillis()-time);
}
}
......@@ -29,23 +29,18 @@ Java爬虫 **Spiderman** [https://gitcafe.com/laiweiwei/Spiderman](https://gitca
### 使用maven
webmagic使用maven管理依赖,你可以直接下载webmagic源码进行编译:
git clone https://github.com/code4craft/webmagic.git
cd webmagic
mvn clean install
安装后,在项目中添加对应的依赖即可使用webmagic:
webmagic使用maven管理依赖,在项目中添加对应的依赖即可使用webmagic:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.2.0</version>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.2.0</version>
<version>0.3.0
</version>
</dependency>
#### 项目结构
......@@ -60,7 +55,7 @@ webmagic主要包括两个包:
webmagic的扩展模块,提供一些更方便的编写爬虫的工具。包括注解格式定义爬虫、JSON、分布式等支持。
webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较重量级的工具,所以从主要包中抽离出来:
webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较重量级的工具,所以从主要包中抽离出来,这些包需要下载源码后自己编译:
* **webmagic-saxon**
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment