Commit 04ade756 authored by yihua.huang's avatar yihua.huang

Merge branch 'stable' of github.com:code4craft/webmagic

Conflicts:
	README.md
	pom.xml
	webmagic-avalon/pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-lucene/pom.xml
	webmagic-samples/pom.xml
	webmagic-saxon/pom.xml
	webmagic-scripts/pom.xml
	webmagic-selenium/pom.xml
parents 74fdd1df 42a2676e
......@@ -2,4 +2,5 @@ target
*.iml
out/
.idea
.classpath
.project
![logo](https://raw.github.com/code4craft/webmagic/master/assets/logo.jpg)
![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg)
[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)
......@@ -175,6 +175,7 @@ webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0)
以下是为WebMagic提交过代码或者issue的朋友:
* [ccliangbo](https://github.com/ccliangbo)
* [yuany](https://github.com/yuany)
* [yxssfxwzy](https://github.com/yxssfxwzy)
* [linkerlin](https://github.com/linkerlin)
......@@ -188,8 +189,9 @@ webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0)
* [ywooer](https://github.com/ywooer)
* [yyw258520](https://github.com/yyw258520)
* [perfecking](https://github.com/perfecking)
* [ccliangbo](https://github.com/ccliangbo)
* [lidongyang](http://my.oschina.net/lidongyang)
* [seveniu](https://github.com/seveniu)
* [sebastian1118](https://github.com/sebastian1118)
### 邮件组:
......@@ -201,4 +203,4 @@ QQ:
### QQ群:
330192938
373225642
<mockup version="1.0" skin="sketch" fontFace="Balsamiq Sans" measuredW="1154" measuredH="470" mockupW="709" mockupH="470">
<controls>
<control controlID="0" controlTypeID="com.balsamiq.mockups::BrowserWindow" x="445" y="0" w="709" h="470" measuredW="450" measuredH="400" zOrder="0" locked="false" isInGroup="-1">
<controlProperties>
<text>A%20Web%20Page%0Ahttp%3A//</text>
</controlProperties>
</control>
</controls>
</mockup>
\ No newline at end of file
webmagic
---
![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg)
[Readme in Chinese](https://github.com/code4craft/webmagic/tree/master/zh_docs)
[User Manual (Chinese)](https://github.com/code4craft/webmagic/blob/master/user-manual.md)
[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)
>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.
>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
## Features:
......@@ -14,26 +17,19 @@ webmagic
* Multi-thread and Distribution support.
* Easy to be integrated.
## Install:
Clone the repo and build:
git clone https://github.com/code4craft/webmagic.git
cd webmagic
mvn clean install
Add dependencies to your project:
Add dependencies to your pom.xml:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.4.2</version>
<version>0.4.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.4.2</version>
<version>0.4.3</version>
</dependency>
## Get Started:
......@@ -42,10 +38,10 @@ Add dependencies to your project:
Write a class implements PageProcessor:
```java
public class OschinaBlogPageProcesser implements PageProcessor {
private Site site = Site.me().setDomain("my.oschina.net")
.addStartUrl("http://my.oschina.net/flashsword/blog");
private Site site = Site.me().setDomain("my.oschina.net");
@Override
public void process(Page page) {
......@@ -63,10 +59,11 @@ Write a class implements PageProcessor:
}
public static void main(String[] args) {
Spider.create(new OschinaBlogPageProcesser())
.pipeline(new ConsolePipeline()).run();
Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog")
.addPipeline(new ConsolePipeline()).run();
}
}
```
* `page.addTargetRequests(links)`
......@@ -74,6 +71,7 @@ Write a class implements PageProcessor:
You can also use annotation way:
```java
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
......@@ -88,10 +86,11 @@ You can also use annotation way:
public static void main(String[] args) {
OOSpider.create(
Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
new ConsolePageModelPipeline(), OschinaBlog.class).run();
Site.me(),
new ConsolePageModelPipeline(), OschinaBlog.class).addUrl("http://my.oschina.net/flashsword/blog").run();
}
}
```
### Docs and samples:
......@@ -103,11 +102,30 @@ Javadocs: [http://code4craft.github.io/webmagic/docs/en/](http://code4craft.gith
There are some samples in `webmagic-samples` package.
### Lisence:
Lisenced under [Apache 2.0 lisence](http://opensource.org/licenses/Apache-2.0)
### Contributors:
Thanks these people for commiting source code, reporting bugs or suggesting for new feature:
* [yuany](https://github.com/yuany)
* [yxssfxwzy](https://github.com/yxssfxwzy)
* [linkerlin](https://github.com/linkerlin)
* [d0ngw](https://github.com/d0ngw)
* [xuchaoo](https://github.com/xuchaoo)
* [supermicah](https://github.com/supermicah)
* [SimpleExpress](https://github.com/SimpleExpress)
* [aruanruan](https://github.com/aruanruan)
* [l1z2g9](https://github.com/l1z2g9)
* [zhegexiaohuozi](https://github.com/zhegexiaohuozi)
* [ywooer](https://github.com/ywooer)
* [yyw258520](https://github.com/yyw258520)
* [perfecking](https://github.com/perfecking)
* [lidongyang](http://my.oschina.net/lidongyang)
### Thanks:
To write webmagic, I refered to the projects below :
......@@ -124,3 +142,10 @@ To write webmagic, I refered to the projects below :
[https://gitcafe.com/laiweiwei/Spiderman](https://gitcafe.com/laiweiwei/Spiderman)
### Mail-list:
[https://groups.google.com/forum/#!forum/webmagic-java](https://groups.google.com/forum/#!forum/webmagic-java)
[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/code4craft/webmagic/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
......@@ -6,7 +6,7 @@
<version>7</version>
</parent>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
<modelVersion>4.0.0</modelVersion>
<packaging>pom</packaging>
<properties>
......@@ -51,11 +51,10 @@
<module>webmagic-core</module>
<module>webmagic-extension/</module>
<module>webmagic-scripts/</module>
<module>webmagic-avalon</module>
<module>webmagic-lucene</module>
<module>webmagic-samples</module>
<module>webmagic-saxon</module>
<module>webmagic-selenium</module>
<module>webmagic-saxon</module>
<module>webmagic-samples</module>
<module>webmagic-avalon</module>
</modules>
<dependencyManagement>
......@@ -63,7 +62,7 @@
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.7</version>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
......@@ -89,12 +88,7 @@
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
<version>0.2.0</version>
</dependency>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>Saxon-HE</artifactId>
<version>9.5.1-1</version>
<version>0.2.2</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
......@@ -121,11 +115,6 @@
<artifactId>commons-collections</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
......@@ -136,6 +125,12 @@
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
</dependencies>
</dependencyManagement>
......@@ -159,26 +154,26 @@
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.8</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<!--<plugin>-->
<!--<groupId>org.apache.maven.plugins</groupId>-->
<!--<artifactId>maven-dependency-plugin</artifactId>-->
<!--<version>2.8</version>-->
<!--<executions>-->
<!--<execution>-->
<!--<id>copy-dependencies</id>-->
<!--<phase>package</phase>-->
<!--<goals>-->
<!--<goal>copy-dependencies</goal>-->
<!--</goals>-->
<!--<configuration>-->
<!--<outputDirectory>${project.build.directory}/lib</outputDirectory>-->
<!--<overWriteReleases>false</overWriteReleases>-->
<!--<overWriteSnapshots>false</overWriteSnapshots>-->
<!--<overWriteIfNewer>true</overWriteIfNewer>-->
<!--</configuration>-->
<!--</execution>-->
<!--</executions>-->
<!--</plugin>-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
......@@ -187,6 +182,15 @@
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<excludes>
<exclude>log4j.xml</exclude>
</excludes>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
......
......@@ -65,7 +65,7 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较
git clone http://git.oschina.net/flashsword20/webmagic.git
**bin/lib**目录下,有项目依赖的所有jar包,直接在IDE里import即可。
**lib**目录下,有项目依赖的所有jar包,直接在IDE里import即可。
--------
......
WebMagic-Avalon
========
> Spiders Manage Web
see [#issue43](https://github.com/code4craft/webmagic/issues/43)
\ No newline at end of file
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
forger
======
Dynamic Java object generator with template class and configuration.
## Compiler
Use groovy compiler. Compile source code to Java class.
## PropertyLoader
Load properties of object from user input.
## API
```java
@Test
public void testForgerCreateByClassAnnotationCompile() throws Exception {
ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), new GroovyForgerCompiler());
Forger<Fooable> forger = forgerFactory.<Fooable>compile(Foo.SOURCE_CODE);
Fooable foo = forger.forge(ImmutableMap.<String, Object>of("fooa", "test"));
Field field = forger.getClazz().getDeclaredField("foo");
field.setAccessible(true);
assertThat(field.get(foo)).isEqualTo("test");
assertThat(foo.foo()).isEqualTo("test");
}
```
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<groupId>org.sonatype.oss</groupId>
<artifactId>oss-parent</artifactId>
<version>7</version>
</parent>
<groupId>us.codecraft</groupId>
<artifactId>forger</artifactId>
<version>0.1.1-SNAPSHOT</version>
<modelVersion>4.0.0</modelVersion>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
<name>forger</name>
<description>
Dynamic Java object generator with template class and configuration.
</description>
<url>https://github.com/code4craft/forger/</url>
<developers>
<developer>
<id>code4craft</id>
<name>Yihua huang</name>
<email>code4crafer@gmail.com</email>
</developer>
</developers>
<scm>
<connection>scm:git:git@github.com:code4craft/forger.git</connection>
<developerConnection>scm:git:git@github.com:code4craft/forger.git</developerConnection>
<url>git@github.com:code4craft/forger.git</url>
<tag>HEAD</tag>
</scm>
<licenses>
<license>
<name>Apache License,Version 2</name>
<url>http://www.apache.org/licenses/LICENSE-2.0</url>
<distribution>repo</distribution>
</license>
</licenses>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.6</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>15.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.8</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>2.6</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>2.2.1</version>
<executions>
<execution>
<id>attach-sources</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>2.9.1</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<id>attach-javadocs</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-release-plugin</artifactId>
<version>2.4.1</version>
</plugin>
</plugins>
</build>
<profiles>
<profile>
<id>release-sign-artifacts</id>
<activation>
<property>
<name>performRelease</name>
<value>true</value>
</property>
</activation>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-gpg-plugin</artifactId>
<version>1.1</version>
<executions>
<execution>
<id>sign-artifacts</id>
<phase>verify</phase>
<goals>
<goal>sign</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
package us.codecraft.forger;
import us.codecraft.forger.property.Property;
import us.codecraft.forger.property.PropertyLoader;
import java.util.List;
import java.util.Map;
/**
* @author code4crafter@gmail.com
*/
public class Forger<T> {
private final Class<T> clazz;
private final PropertyLoader propertyLoader;
public Forger(Class<T> clazz,PropertyLoader propertyLoader) {
this.clazz = clazz;
this.propertyLoader = propertyLoader;
}
public T forge(Map<String, Object> properties) throws IllegalAccessException, InstantiationException {
T t = clazz.newInstance();
propertyLoader.load(t, properties);
return t;
}
public List<Property> getPropertyNames() {
return propertyLoader.getProperties(clazz);
}
public Class<T> getClazz() {
return clazz;
}
}
package us.codecraft.forger;
import us.codecraft.forger.compiler.ForgerCompiler;
import us.codecraft.forger.property.PropertyLoader;
/**
* @author code4crafter@gmail.com
*/
public class ForgerFactory {
private final PropertyLoader propertyLoader;
private final ForgerCompiler forgerCompiler;
public ForgerFactory(PropertyLoader propertyLoader, ForgerCompiler forgerCompiler) {
this.propertyLoader = propertyLoader;
this.forgerCompiler = forgerCompiler;
}
public <T> Forger<T> compile(String sourceCode) {
Class clazz = forgerCompiler.compile(sourceCode);
return new Forger(clazz, propertyLoader);
}
public <T> Forger<T> create(Class clazz) {
return new Forger(clazz, propertyLoader);
}
}
package us.codecraft.forger.compiler;
/**
* @author code4crafter@gmail.com
*/
public interface ForgerCompiler {
public Class compile(String sourceCode);
}
package us.codecraft.forger.compiler;
import groovy.lang.GroovyClassLoader;
/**
* @author code4crafter@gmail.com
*/
public class GroovyForgerCompiler implements ForgerCompiler{
private GroovyClassLoader groovyClassLoader = new GroovyClassLoader();
@Override
public Class compile(String sourceCode) {
return groovyClassLoader.parseClass(sourceCode);
}
}
package us.codecraft.forger.property;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.forger.property.format.*;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* @author code4crafter@gmail.com
*/
public abstract class AbstractPropertyLoader implements PropertyLoader {
private TypeFormatterFactory typeFormatterFactory = new TypeFormatterFactory();
protected Logger logger = LoggerFactory.getLogger(getClass());
protected TypeFormatterFactory getTypeFormatterFactory() {
return typeFormatterFactory;
}
@Override
public <T> T load(T object, Map<String, Object> propertyConfigs) {
List<Property> properties = getProperties(object.getClass());
for (Property property : properties) {
Object value = propertyConfigs.get(property.getName());
if (value == null) {
throw new IllegalArgumentException("Config for property " + property.getName() + " is missing!");
}
ObjectFormatter objectFormatter = property.getObjectFormatter();
switch (property.getType()) {
case PropertyString:
Object fieldValue = objectFormatter.format(String.valueOf(value));
try {
property.getField().set(object, fieldValue);
} catch (IllegalAccessException e) {
logger.warn("Set field " + property.getField() + " error!", e);
}
break;
case PropertyList:
if (!List.class.isAssignableFrom(value.getClass())) {
throw new IllegalArgumentException("Config for property " + property.getName() + " should be subclass of List!");
}
List listField = new ArrayList();
List<String> listConfigs = (List) value;
for (String listConfig : listConfigs) {
listField.add(objectFormatter.format(listConfig));
}
try {
property.getField().set(object, listField);
} catch (IllegalAccessException e) {
logger.warn("Set field " + property.getField() + " error!", e);
}
break;
case PropertyMap:
if (!Map.class.isAssignableFrom(value.getClass())) {
throw new IllegalArgumentException("Config for property " + property.getName() + " should be subclass of List!");
}
Map mapField = new HashMap();
Map<String, String> mapConfigs = (Map<String, String>) value;
for (Map.Entry<String, String> entry : mapConfigs.entrySet()) {
mapField.put(entry.getKey(), objectFormatter.format(entry.getValue()));
}
try {
property.getField().set(object, mapField);
} catch (IllegalAccessException e) {
logger.warn("Set field " + property.getField() + " error!", e);
}
break;
}
}
return object;
}
protected ObjectFormatter prepareTypeFormatterParam(TypeFormatter objectFormatter, String[] params) {
if (params == null) {
return objectFormatter;
}
return new ObjectFormatterWithParams().setTypeFormatter(objectFormatter).setParams(params);
}
protected ObjectFormatter getObjectFormatter(Field field) {
Class type = field.getType();
if (List.class.isAssignableFrom(type) || Map.class.isAssignableFrom(type)) {
type = String.class;
}
if (field.isAnnotationPresent(Formatter.class)) {
Formatter formatter = field.getAnnotation(Formatter.class);
if (!formatter.formatter().equals(TypeFormatter.class)) {
TypeFormatter typeFormatter = typeFormatterFactory.getByFormatterClass(formatter.formatter());
if (typeFormatter != null) {
return prepareTypeFormatterParam(typeFormatter,formatter.value());
}
typeFormatterFactory.put(formatter.formatter());
return prepareTypeFormatterParam(typeFormatterFactory.getByFormatterClass(formatter.formatter()), formatter.value());
} else if (!formatter.subClazz().equals(String.class)) {
type = formatter.subClazz();
TypeFormatter typeFormatter = typeFormatterFactory.get(type);
if (typeFormatter == null) {
throw new IllegalArgumentException("No typeFormatter for class " + type);
}
return prepareTypeFormatterParam(typeFormatter, formatter.value());
}
}
return getTypeFormatterFactory().get(BasicTypeFormatter.detectBasicClass(type));
}
}
package us.codecraft.forger.property;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafter@gmail.com
*/
public class AnnotationPropertyLoader extends AbstractPropertyLoader {
@Override
public List<Property> getProperties(Class clazz) {
Field[] fields = clazz.getDeclaredFields();
List<Property> properties = new ArrayList<Property>(fields.length);
for (Field field : fields) {
Inject inject = field.getAnnotation(Inject.class);
if (inject != null) {
if (!field.isAccessible()) {
field.setAccessible(true);
}
Property property = Property.fromField(field);
if (inject.value().length() > 0) {
property.setName(inject.value());
}
property.setObjectFormatter(getObjectFormatter(field));
properties.add(property);
}
}
return properties;
}
}
package us.codecraft.webmagic.configurable;
package us.codecraft.forger.property;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.Target;
/**
* @author yihua.huang@dianping.com
* @author code4crafter@gmail.com
*/
@Retention(java.lang.annotation.RetentionPolicy.RUNTIME)
@Target({ElementType.FIELD})
public @interface Inject {
String value() default "";
}
package us.codecraft.forger.property;
import us.codecraft.forger.property.format.ObjectFormatter;
import java.lang.reflect.Field;
/**
* @author code4crafter@gmail.com
*/
public class Property {
private String name;
private PropertyType type;
private Field field;
private ObjectFormatter objectFormatter;
public ObjectFormatter getObjectFormatter() {
return objectFormatter;
}
public Property setObjectFormatter(ObjectFormatter objectFormatter) {
this.objectFormatter = objectFormatter;
return this;
}
public String getName() {
return name;
}
public Property setName(String name) {
this.name = name;
return this;
}
public PropertyType getType() {
return type;
}
public Property setType(PropertyType type) {
this.type = type;
return this;
}
public Field getField() {
return field;
}
public Property setField(Field field) {
this.field = field;
return this;
}
public static Property fromField(Field field) {
return new Property().setName(field.getName()).setType(PropertyType.from(field.getType())).setField(field);
}
}
package us.codecraft.forger.property;
import java.util.List;
import java.util.Map;
/**
* @author code4crafter@gmail.com
*/
public interface PropertyLoader {
public <T> T load(T object, Map<String, Object> propertyConfigs);
public List<Property> getProperties(Class clazz);
}
package us.codecraft.forger.property;
import java.util.List;
import java.util.Map;
/**
* @author code4crafter@gmail.com
*/
public enum PropertyType {
PropertyString,PropertyMap,PropertyList;
public static PropertyType from(Class clazz){
if (Map.class.isAssignableFrom(clazz)){
return PropertyMap;
}
if (List.class.isAssignableFrom(clazz)){
return PropertyList;
}
return PropertyString;
}
}
package us.codecraft.forger.property;
import java.lang.reflect.Field;
import java.lang.reflect.Modifier;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafter@gmail.com
*/
public class SimpleFieldPropertyLoader extends AbstractPropertyLoader {
@Override
public List<Property> getProperties(Class clazz) {
Field[] fields = clazz.getDeclaredFields();
List<Property> properties = new ArrayList<Property>(fields.length);
for (Field field : fields) {
if (Modifier.isStatic(field.getModifiers())){
continue;
}
if (!field.isAccessible()){
field.setAccessible(true);
}
properties.add(Property.fromField(field).setObjectFormatter(getObjectFormatter(field)));
}
return properties;
}
}
package us.codecraft.forger.property.format;
import java.util.Arrays;
import java.util.List;
/**
* @author code4crafter@gmail.com
* @since 0.3.2
*/
public abstract class BasicTypeFormatter<T> implements TypeFormatter<T> {
@Override
public T format(String text) {
if (text == null) {
return null;
}
text = text.trim();
return formatTrimmed(text);
}
@Override
public T format(String text, String[] params) {
return format(text);
}
protected abstract T formatTrimmed(String raw);
public static final List<Class<? extends TypeFormatter>> basicTypeFormatters = Arrays.<Class<? extends TypeFormatter>>asList(IntegerFormatter.class,
LongFormatter.class, DoubleFormatter.class, FloatFormatter.class, ShortFormatter.class,
CharactorFormatter.class, ByteFormatter.class, BooleanFormatter.class, DateFormatter.class, StringFormatter.class);
public static Class<?> detectBasicClass(Class<?> type) {
if (type.equals(Integer.TYPE) || type.equals(Integer.class)) {
return Integer.class;
} else if (type.equals(Long.TYPE) || type.equals(Long.class)) {
return Long.class;
} else if (type.equals(Double.TYPE) || type.equals(Double.class)) {
return Double.class;
} else if (type.equals(Float.TYPE) || type.equals(Float.class)) {
return Float.class;
} else if (type.equals(Short.TYPE) || type.equals(Short.class)) {
return Short.class;
} else if (type.equals(Character.TYPE) || type.equals(Character.class)) {
return Character.class;
} else if (type.equals(Byte.TYPE) || type.equals(Byte.class)) {
return Byte.class;
} else if (type.equals(Boolean.TYPE) || type.equals(Boolean.class)) {
return Boolean.class;
}
return type;
}
public static class IntegerFormatter extends BasicTypeFormatter<Integer> {
@Override
public Integer formatTrimmed(String raw) {
return Integer.parseInt(raw);
}
@Override
public Class<Integer> clazz() {
return Integer.class;
}
}
public static class LongFormatter extends BasicTypeFormatter<Long> {
@Override
public Long formatTrimmed(String raw) {
return Long.parseLong(raw);
}
@Override
public Class<Long> clazz() {
return Long.class;
}
}
public static class DoubleFormatter extends BasicTypeFormatter<Double> {
@Override
public Double formatTrimmed(String raw) {
return Double.parseDouble(raw);
}
@Override
public Class<Double> clazz() {
return Double.class;
}
}
public static class FloatFormatter extends BasicTypeFormatter<Float> {
@Override
public Float formatTrimmed(String raw) {
return Float.parseFloat(raw);
}
@Override
public Class<Float> clazz() {
return Float.class;
}
}
public static class ShortFormatter extends BasicTypeFormatter<Short> {
@Override
public Short formatTrimmed(String raw) {
return Short.parseShort(raw);
}
@Override
public Class<Short> clazz() {
return Short.class;
}
}
public static class CharactorFormatter extends BasicTypeFormatter<Character> {
@Override
public Character formatTrimmed(String raw) {
return raw.charAt(0);
}
@Override
public Class<Character> clazz() {
return Character.class;
}
}
public static class ByteFormatter extends BasicTypeFormatter<Byte> {
@Override
public Byte formatTrimmed(String raw) {
return Byte.parseByte(raw, 10);
}
@Override
public Class<Byte> clazz() {
return Byte.class;
}
}
public static class BooleanFormatter extends BasicTypeFormatter<Boolean> {
@Override
public Boolean formatTrimmed(String raw) {
return Boolean.parseBoolean(raw);
}
@Override
public Class<Boolean> clazz() {
return Boolean.class;
}
}
public static class StringFormatter implements TypeFormatter<String> {
@Override
public String format(String text) {
return text;
}
@Override
public String format(String text, String[] params) {
return format(text);
}
@Override
public Class<String> clazz() {
return String.class;
}
}
}
package us.codecraft.forger.property.format;
import org.apache.commons.lang3.time.DateUtils;
import java.text.ParseException;
import java.util.Date;
/**
* @author code4crafter@gmail.com
* @since 0.3.2
*/
public class DateFormatter implements TypeFormatter<Date> {
public static final String[] DEFAULT_PATTERN = new String[]{"yyyy-MM-dd HH:mm"};
@Override
public Date format(String text) {
return format(text,DEFAULT_PATTERN);
}
@Override
public Date format(String text, String[] params) {
try {
return DateUtils.parseDate(text, params);
} catch (ParseException e) {
throw new IllegalArgumentException(e);
}
}
@Override
public Class<Date> clazz() {
return Date.class;
}
}
package us.codecraft.forger.property.format;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.Target;
/**
* Define how the result string is convert to an object for field.
*
* @author code4crafter@gmail.com <br>
* @since 0.3.2
*/
@Retention(java.lang.annotation.RetentionPolicy.RUNTIME)
@Target({ElementType.FIELD})
public @interface Formatter {
/**
* Set formatter params.
*
* @return formatter params
*/
String[] value();
/**
* Specific the class of field of class of elements in collection for field. <br/>
* It is not necessary to be set because we can detect the class by class of field,
* unless you use a collection as a field. <br/>
*
* @return the class of field
*/
Class subClazz() default String.class;
/**
* If there are more than one formatter for a class, just specify the implement.
* @return implement
*/
Class<? extends TypeFormatter> formatter() default TypeFormatter.class;
}
package us.codecraft.forger.property.format;
/**
* @author code4crafter@gmail.com
*/
public interface ObjectFormatter<T> {
T format(String text);
}
package us.codecraft.forger.property.format;
/**
* @author code4crafter@gmail.com
*/
public class ObjectFormatterWithParams<T> implements ObjectFormatter<T> {
private TypeFormatter<T> typeFormatter;
private String[] params;
public TypeFormatter getTypeFormatter() {
return typeFormatter;
}
public ObjectFormatterWithParams<T> setTypeFormatter(TypeFormatter typeFormatter) {
this.typeFormatter = typeFormatter;
return this;
}
public String[] getParams() {
return params;
}
public ObjectFormatterWithParams setParams(String[] params) {
this.params = params;
return this;
}
@Override
public T format(String text) {
return typeFormatter.format(text, params);
}
}
package us.codecraft.forger.property.format;
/**
* @author code4crafter@gmail.com
*/
public interface TypeFormatter<T> extends ObjectFormatter<T> {
T format(String text, String[] params);
Class<T> clazz();
}
package us.codecraft.forger.property.format;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
/**
* @author code4crafter@gmail.com
* @since 0.3.2
*/
public class TypeFormatterFactory {
private Logger logger = LoggerFactory.getLogger(getClass());
private Map<Class, TypeFormatter> objectFormatterMapWithPropertyAsKey = new ConcurrentHashMap<Class, TypeFormatter>();
private Map<Class, TypeFormatter> objectFormatterMapWithClassAsKey = new ConcurrentHashMap<Class, TypeFormatter>();
public TypeFormatterFactory() {
initFormatterMap();
}
private void initFormatterMap() {
for (Class<? extends TypeFormatter> basicTypeFormatter : BasicTypeFormatter.basicTypeFormatters) {
put(basicTypeFormatter);
}
put(DateFormatter.class);
}
public synchronized void put(Class<? extends TypeFormatter> objectFormatterClazz) {
try {
TypeFormatter typeFormatter = objectFormatterClazz.newInstance();
if (typeFormatter.clazz() != null) {
objectFormatterMapWithPropertyAsKey.put(typeFormatter.clazz(), typeFormatter);
}
objectFormatterMapWithClassAsKey.put(objectFormatterClazz, typeFormatter);
} catch (InstantiationException e) {
logger.error("Init objectFormatter error", e);
} catch (IllegalAccessException e) {
logger.error("Init objectFormatter error", e);
}
}
public TypeFormatter get(Class<?> clazz) {
return objectFormatterMapWithPropertyAsKey.get(clazz);
}
public TypeFormatter getByFormatterClass(Class<?> clazz) {
return objectFormatterMapWithClassAsKey.get(clazz);
}
}
package us.codecraft.forger;
import us.codecraft.forger.property.Inject;
import us.codecraft.forger.property.format.Formatter;
import java.util.List;
import java.util.Map;
/**
* @author code4crafter@gmail.com
*/
public class Bar {
@Inject("bar")
private String bar;
@Inject
private List<String> values;
@Formatter(value = "", subClazz = Integer.class)
@Inject
private Map<String, Integer> idMap;
public String getBar() {
return bar;
}
public void setBar(String bar) {
this.bar = bar;
}
public List<String> getValues() {
return values;
}
public void setValues(List<String> values) {
this.values = values;
}
public Map<String, Integer> getIdMap() {
return idMap;
}
public void setIdMap(Map<String, Integer> idMap) {
this.idMap = idMap;
}
}
package us.codecraft.forger;
import us.codecraft.forger.property.Inject;
import us.codecraft.forger.property.format.Formatter;
/**
* @author code4crafter@gmail.com
*/
public class Foo implements Fooable{
@Formatter("")
@Inject("fooa")
private String foo;
public static final String SOURCE_CODE="import us.codecraft.forger.*;\n" +
"import us.codecraft.forger.property.Inject;\n" +
"import us.codecraft.forger.property.Inject;\n" +
"import us.codecraft.forger.property.format.Formatter;\n" +
"\n" +
"/**\n" +
" * @author code4crafter@gmail.com\n" +
" */\n" +
"public class Foo implements Fooable{\n" +
"\n" +
" @Formatter(\"\")\n" +
" @Inject(\"fooa\")\n" +
" private String foo;\n" +
"\n" +
" public String getFoo() {\n" +
" return foo;\n" +
" }\n" +
"\n" +
" @Override\n" +
" public String foo() {\n" +
" return foo;\n" +
" }\n" +
"}";
public String getFoo() {
return foo;
}
@Override
public String foo() {
return foo;
}
}
package us.codecraft.forger;
/**
* @author code4crafter@gmail.com
*/
public interface Fooable {
public String foo();
}
package us.codecraft.forger;
import com.google.common.collect.ImmutableMap;
import org.junit.Test;
import us.codecraft.forger.compiler.GroovyForgerCompiler;
import us.codecraft.forger.property.AnnotationPropertyLoader;
import us.codecraft.forger.property.SimpleFieldPropertyLoader;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import static org.assertj.core.api.Assertions.*;
/**
* @author code4crafter@gmail.com
*/
public class ForgerFactoryTest {
@Test
public void testForgerCreateByClassProperty() throws Exception {
ForgerFactory forgerFactory = new ForgerFactory(new SimpleFieldPropertyLoader(), null);
Forger<Foo> forger = forgerFactory.<Foo>create(Foo.class);
Foo foo = forger.forge(ImmutableMap.<String, Object>of("foo", "test"));
assertThat(foo.getFoo()).isEqualTo("test");
}
@Test
public void testForgerCreateByClassAnnotation() throws Exception {
ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), null);
Forger<Foo> forger = forgerFactory.<Foo>create(Foo.class);
Foo foo = forger.forge(ImmutableMap.<String, Object>of("fooa", "test"));
assertThat(foo.getFoo()).isEqualTo("test");
}
@Test
public void testForgerCreateByClassAnnotationCompile() throws Exception {
ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), new GroovyForgerCompiler());
Forger<Fooable> forger = forgerFactory.<Fooable>compile(Foo.SOURCE_CODE);
Fooable foo = forger.forge(ImmutableMap.<String, Object>of("fooa", "test"));
Field field = forger.getClazz().getDeclaredField("foo");
field.setAccessible(true);
assertThat(field.get(foo)).isEqualTo("test");
assertThat(foo.foo()).isEqualTo("test");
}
@Test
public void testForgerCreateByClassAnnotationWithCollections() throws Exception {
ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), null);
Forger<Bar> forger = forgerFactory.<Bar>create(Bar.class);
Map<String, Object> map = new HashMap<String, Object>();
map.put("bar", "bar");
Map<String, String> submap = new HashMap<String, String>();
submap.put("1", "1");
submap.put("2", "2");
map.put("idMap", submap);
List<String> sublist = new ArrayList<String>();
sublist.add("test");
map.put("values", sublist);
Bar forge = forger.forge(map);
assertThat(forge.getValues().size() > 0);
assertThat(forge.getIdMap().get("1")).isEqualTo(1);
}
}
package us.codecraft.forger.compiler;
import org.junit.Test;
import us.codecraft.forger.Foo;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmail.com
*/
public class GroovyForgerCompilerTest {
@Test
public void testGroovyClassLoader() throws Exception {
GroovyForgerCompiler groovyForgerCompiler = new GroovyForgerCompiler();
Class compiledClass = groovyForgerCompiler.compile(Foo.SOURCE_CODE);
assertThat(compiledClass.getName()).isEqualTo("Foo");
}
}
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="stdout" class="org.apache.log4j.ConsoleAppender">
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yy-MM-dd HH:mm:ss,SSS} %-5p %c(%F:%L) ## %m%n" />
</layout>
</appender>
<logger name="org.springframework" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="net.sf.ehcache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="info" />
<appender-ref ref="stdout" />
</root>
</log4j:configuration>
......@@ -3,94 +3,130 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-avalon</artifactId>
<packaging>war</packaging>
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-scripts</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis-spring</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>org.freemarker</groupId>
<artifactId>freemarker</artifactId>
<version>2.3.19</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>${spring-version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aop</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjrt</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjweaver</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-webmvc</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.1.37</version>
</dependency>
</dependencies>
<packaging>pom</packaging>
<modules>
<module>forger</module>
<module>webmagic-admin</module>
<module>webmagic-worker</module>
<module>webmagic-avalon-common</module>
</modules>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-scripts</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis-spring</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>forger</artifactId>
<version>0.1.1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.freemarker</groupId>
<artifactId>freemarker</artifactId>
<version>2.3.19</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>${spring-version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<version>1.5.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.18</version>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
<version>1.3</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aop</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjrt</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjweaver</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-webmvc</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.1.37</version>
</dependency>
</dependencies>
</dependencyManagement>
<build>
<plugins>
......@@ -104,4 +140,4 @@
</build>
</project>
\ No newline at end of file
</project>
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<!--<bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean">-->
<!--<property name="dataSource" ref="dataSource" />-->
<!--</bean>-->
<!--<bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">-->
<!--<property name="basePackage" value="us.codecraft.blackhole.suite.dao" />-->
<!--</bean>-->
<!--<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource"-->
<!--destroy-method="close">-->
<!--<property name="driverClassName" value="org.sqlite.JDBC" />-->
<!--<property name="url" value="jdbc:sqlite:/usr/local/hostd/zonesfile.db" />-->
<!--</bean>-->
</beans>
\ No newline at end of file
WebMagic-Admin
=====
Admin is the control web of workers.
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>webmagic-parent</artifactId>
<artifactId>webmagic-avalon</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3-SNAPSHOT</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-panel</artifactId>
<artifactId>webmagic-admin</artifactId>
<packaging>war</packaging>
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-scripts</artifactId>
<artifactId>webmagic-avalon-common</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
<build>
......
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="stdout" class="org.apache.log4j.ConsoleAppender">
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yy-MM-dd HH:mm:ss,SSS} %-5p %c(%F:%L) ## %m%n" />
</layout>
</appender>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="info" />
<appender-ref ref="stdout" />
</root>
</log4j:configuration>
<html>
<head>
<script src=""></script>
</head>
<div class="url-box">
<input type="text" id="url-input">
</div>
<div class="content-show">
</div>
</html>
\ No newline at end of file
......@@ -15,8 +15,8 @@
<meta charset="utf-8">
<title>WebMaigc Avalon</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Charisma, a fully featured, responsive, HTML5, Bootstrap admin template.">
<meta name="author" content="Muhammad Usman">
<meta name="description" content="WebMagic, a scalable crawler framework.">
<meta name="author" content="code4crafter@gmail.com">
<!-- The styles -->
<link id="bs-css" href="${staticPath}/css/bootstrap-cerulean.css" rel="stylesheet">
......@@ -123,23 +123,10 @@
<div class="well nav-collapse sidebar-nav">
<ul class="nav nav-tabs nav-stacked main-menu">
<li class="nav-header hidden-tablet">Main</li>
<li><a class="ajax-link" href="index.html"><i class="icon-home"></i><span class="hidden-tablet"> Dashboard</span></a></li>
<li><a class="ajax-link" href="ui.html"><i class="icon-eye-open"></i><span class="hidden-tablet">Spider</span></a></li>
<li><a class="ajax-link" href="form.html"><i class="icon-edit"></i><span class="hidden-tablet">Charts</span></a></li>
<li><a class="ajax-link" href="chart.html"><i class="icon-list-alt"></i><span class="hidden-tablet"> Charts</span></a></li>
<li><a class="ajax-link" href="typography.html"><i class="icon-font"></i><span class="hidden-tablet"> Typography</span></a></li>
<li><a class="ajax-link" href="gallery.html"><i class="icon-picture"></i><span class="hidden-tablet"> Gallery</span></a></li>
<li class="nav-header hidden-tablet">Sample Section</li>
<li><a class="ajax-link" href="table.html"><i class="icon-align-justify"></i><span class="hidden-tablet"> Tables</span></a></li>
<li><a class="ajax-link" href="calendar.html"><i class="icon-calendar"></i><span class="hidden-tablet"> Calendar</span></a></li>
<li><a class="ajax-link" href="grid.html"><i class="icon-th"></i><span class="hidden-tablet"> Grid</span></a></li>
<li><a class="ajax-link" href="file-manager.html"><i class="icon-folder-open"></i><span class="hidden-tablet"> File Manager</span></a></li>
<li><a href="tour.html"><i class="icon-globe"></i><span class="hidden-tablet"> Tour</span></a></li>
<li><a class="ajax-link" href="icon.html"><i class="icon-star"></i><span class="hidden-tablet"> Icons</span></a></li>
<li><a href="error.html"><i class="icon-ban-circle"></i><span class="hidden-tablet"> Error Page</span></a></li>
<li><a href="login.html"><i class="icon-lock"></i><span class="hidden-tablet"> Login Page</span></a></li>
<li><a class="ajax-link" href="dashboard"><i class="icon-home"></i><span class="hidden-tablet">Dashboard</span></a></li>
<li><a class="ajax-link" href="spider"><i class="icon-eye-open"></i><span class="hidden-tablet">Spider</span></a></li>
<li><a class="ajax-link" href="charts"><i class="icon-edit"></i><span class="hidden-tablet">Charts</span></a></li>
</ul>
<label id="for-is-ajax" class="hidden-tablet" for="is-ajax"><input id="is-ajax" type="checkbox"> Ajax on menu</label>
</div><!--/.well -->
</div><!--/span-->
<!-- left menu ends -->
......@@ -173,26 +160,6 @@
<span class="notification">6</span>
</a>
<a data-rel="tooltip" title="4 new pro members." class="well span3 top-block" href="#">
<span class="icon32 icon-color icon-star-on"></span>
<div>Pro Members</div>
<div>228</div>
<span class="notification green">4</span>
</a>
<a data-rel="tooltip" title="$34 new sales." class="well span3 top-block" href="#">
<span class="icon32 icon-color icon-cart"></span>
<div>Sales</div>
<div>$13320</div>
<span class="notification yellow">$34</span>
</a>
<a data-rel="tooltip" title="12 new messages." class="well span3 top-block" href="#">
<span class="icon32 icon-color icon-envelope-closed"></span>
<div>Messages</div>
<div>25</div>
<span class="notification red">12</span>
</a>
</div>
<div class="row-fluid">
......
......@@ -7,7 +7,7 @@
<context-param>
<param-name>contextConfigLocation</param-name>
<param-value>
classpath*:spring/applicationContext*.xml,
classpath*:/config/spring/applicationContext*.xml,
</param-value>
</context-param>
......@@ -33,7 +33,7 @@
<servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
<init-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:/spring/applicationContext*.xml</param-value>
<param-value>classpath*:config/spring/applicationContext*.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
......
@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu);
/*@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu);*/
/*!
* Bootstrap v2.0.4
*
......
@import url('https://fonts.googleapis.com/css?family=Droid+Sans:400,700');
/*@import url('https://fonts.googleapis.com/css?family=Droid+Sans:400,700');*/
/*!
* Bootstrap v2.0.4
*
......
@import url('https://fonts.googleapis.com/css?family=Open+Sans:400,700');
/*@import url('https://fonts.googleapis.com/css?family=Open+Sans:400,700');*/
/*!
* Bootstrap v2.0.4
*
......
@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu);
/*@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu);*/
/*!
* Bootstrap v2.0.4
*
......
@import url(https://fonts.googleapis.com/css?family=Ubuntu);
/*@import url(https://fonts.googleapis.com/css?family=Ubuntu);*/
/*!
* Bootstrap v2.0.4
*
......
@import url(https://fonts.googleapis.com/css?family=Shojumaru);
/*@import url(https://fonts.googleapis.com/css?family=Shojumaru);*/
select{
background-color:#fff;
......
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>webmagic-avalon</artifactId>
<groupId>us.codecraft</groupId>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>webmagic-avalon-common</artifactId>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>forger</artifactId>
<version>0.1.1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis-spring</artifactId>
</dependency>
<dependency>
<groupId>org.freemarker</groupId>
<artifactId>freemarker</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aop</artifactId>
<version>${spring-version}</version>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjrt</artifactId>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjweaver</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-webmvc</artifactId>
</dependency>
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<version>1.3.175</version>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>./lib/</classpathPrefix>
<mainClass>us.codecraft.webmagic.main.QuickStarter</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>sonatype-nexus-snapshots</id>
<name>Sonatype Nexus Snapshots</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
</project>
package us.codecraft.webmagic.dao;
import us.codecraft.webmagic.model.DynamicClass;
/**
* @author code4crafter@gmail.com
*/
public interface DynamicClassDao {
public int add(DynamicClass dynamicClass);
}
package us.codecraft.webmagic.exception;
/**
* @author code4crafter@gmail.com
*/
public class DynamicClassCompileException extends Exception{
public DynamicClassCompileException(String message) {
super(message);
}
public DynamicClassCompileException(String message, Throwable cause) {
super(message, cause);
}
}
package us.codecraft.webmagic.model;
import java.util.Date;
/**
* @author code4crafter@gmail.com
*/
public class DynamicClass {
private String className;
private String sourceCode;
private Date addTime;
private Date updateTime;
public String getClassName() {
return className;
}
public void setClassName(String className) {
this.className = className;
}
public String getSourceCode() {
return sourceCode;
}
public void setSourceCode(String sourceCode) {
this.sourceCode = sourceCode;
}
public Date getAddTime() {
return addTime;
}
public void setAddTime(Date addTime) {
this.addTime = addTime;
}
public Date getUpdateTime() {
return updateTime;
}
public void setUpdateTime(Date updateTime) {
this.updateTime = updateTime;
}
}
package us.codecraft.webmagic.service;
import us.codecraft.webmagic.exception.DynamicClassCompileException;
/**
* @author code4crafter@gmail.com
*/
public interface DynamicClassService {
public Class compileAndSave(String sourceCode) throws DynamicClassCompileException;
}
package us.codecraft.webmagic.service.impl;
import org.codehaus.groovy.control.CompilationFailedException;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import us.codecraft.forger.Forger;
import us.codecraft.forger.ForgerFactory;
import us.codecraft.webmagic.dao.DynamicClassDao;
import us.codecraft.webmagic.exception.DynamicClassCompileException;
import us.codecraft.webmagic.model.DynamicClass;
import us.codecraft.webmagic.service.DynamicClassService;
/**
* @author code4crafter@gmail.com
*/
@Service
public class DynamicClassServiceImpl implements DynamicClassService {
@Autowired
private DynamicClassDao dynamicClassDao;
@Autowired
private ForgerFactory forgerFactory;
@Override
public Class compileAndSave(String sourceCode) throws DynamicClassCompileException {
Forger<Object> forger;
try {
forger = forgerFactory.compile(sourceCode);
} catch (CompilationFailedException e) {
throw new DynamicClassCompileException(e.getMessage(),e);
}
String className = forger.getClazz().getCanonicalName();
DynamicClass dynamicClass = new DynamicClass();
dynamicClass.setClassName(className);
dynamicClass.setSourceCode(sourceCode);
dynamicClassDao.add(dynamicClass);
return forger.getClazz();
}
}
number_format=#
classic_compatible=true
default_encoding=UTF-8
template_update_delay=0
#########################
template_exception_handler=rethrow
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="stdout" class="org.apache.log4j.ConsoleAppender">
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yy-MM-dd HH:mm:ss,SSS} %-5p %c(%F:%L) ## %m%n" />
</layout>
</appender>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="info" />
<appender-ref ref="stdout" />
</root>
</log4j:configuration>
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2013 the original author or authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="us.codecraft.webmagic.dao.DynamicClassDao">
<insert id="add" parameterType="us.codecraft.webmagic.model.DynamicClass" databaseId="mysql">
insert into DynamicClass (`ClassName`,`SourceCode`,`AddTime`,`UpdateTime`)
values (#{className},#{sourceCode},now(),now())
</insert>
<insert id="add" parameterType="us.codecraft.webmagic.model.DynamicClass" databaseId="h2">
insert into DynamicClass (`ClassName`,`SourceCode`,`AddTime`,`UpdateTime`)
values (#{className},#{sourceCode},now(),now())
</insert>
</mapper>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:mvc="http://www.springframework.org/schema/mvc" xmlns:tx="http://www.springframework.org/schema/tx"
xsi:schemaLocation="http://www.springframework.org/schema/mvc
http://www.springframework.org/schema/mvc/spring-mvc-4.0.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.0.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-4.0.xsd">
<context:annotation-config/>
<bean id="messageSource" class="org.springframework.context.support.ResourceBundleMessageSource">
<property name="basenames">
<list>
<value>web_messages</value>
</list>
</property>
</bean>
<context:component-scan base-package="us.codecraft.webmagic"/>
</beans>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:jdbc="http://www.springframework.org/schema/jdbc"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.0.xsd http://www.springframework.org/schema/jdbc http://www.springframework.org/schema/jdbc/spring-jdbc.xsd">
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="com.mysql.jdbc.Driver"/>
<property name="url" value="jdbc:mysql://127.0.0.1:3306/WebMagic?characterEncoding=UTF-8"/>
<property name="username" value="webmagic"/>
<property name="password" value="webmagic"/>
</bean>
<beans profile="test">
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="org.h2.Driver"/>
<property name="url" value="jdbc:h2:mem:WebMagic;DB_CLOSE_DELAY=-1"/>
</bean>
<!--Refer to https://github.com/springside/springside4/wiki/H2-Database -->
<jdbc:initialize-database data-source="dataSource" ignore-failures="ALL">
<jdbc:script location="classpath:sql/h2/schema.sql" />
<!--<jdbc:script location="classpath:data/h2/import-data.sql" encoding="UTF-8"/>-->
</jdbc:initialize-database>
</beans>
<beans profile="standalone">
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="org.h2.Driver"/>
<property name="url" value="jdbc:h2:file:~/.h2/WebMagic;AUTO_SERVER=TRUE"/>
</bean>
<!--Refer to https://github.com/springside/springside4/wiki/H2-Database -->
<jdbc:initialize-database data-source="dataSource" ignore-failures="ALL">
<jdbc:script location="classpath:sql/h2/schema.sql" />
<!--<jdbc:script location="classpath:data/h2/import-data.sql" encoding="UTF-8"/>-->
</jdbc:initialize-database>
</beans>
</beans>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
<property name="basePackage" value="us.codecraft.webmagic.dao" />
</bean>
<bean id="vendorProperties" class="org.springframework.beans.factory.config.PropertiesFactoryBean">
<property name="properties">
<props>
<prop key="SQL Server">sqlserver</prop>
<prop key="DB2">db2</prop>
<prop key="Oracle">oracle</prop>
<prop key="MySQL">mysql</prop>
<prop key="H2">h2</prop>
</props>
</property>
</bean>
<bean id="databaseIdProvider" class="org.apache.ibatis.mapping.VendorDatabaseIdProvider">
<property name="properties" ref="vendorProperties"/>
</bean>
<bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="databaseIdProvider" ref="databaseIdProvider" />
<property name="mapperLocations" value="classpath*:/config/mapper/**/*.xml" />
</bean>
</beans>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:mvc="http://www.springframework.org/schema/mvc" xmlns:tx="http://www.springframework.org/schema/tx"
xsi:schemaLocation="http://www.springframework.org/schema/mvc
http://www.springframework.org/schema/mvc/spring-mvc-4.0.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.0.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-4.0.xsd">
<bean id="forgerFactory" class="us.codecraft.forger.ForgerFactory">
<constructor-arg>
<bean class="us.codecraft.forger.property.AnnotationPropertyLoader"></bean>
</constructor-arg>
<constructor-arg>
<bean class="us.codecraft.forger.compiler.GroovyForgerCompiler"></bean>
</constructor-arg>
</bean>
</beans>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:mvc="http://www.springframework.org/schema/mvc" xmlns:tx="http://www.springframework.org/schema/tx"
xsi:schemaLocation="http://www.springframework.org/schema/mvc
http://www.springframework.org/schema/mvc/spring-mvc-4.0.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.0.xsd http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx.xsd">
<bean id="transactionManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<constructor-arg ref="dataSource">
</constructor-arg>
</bean>
<tx:annotation-driven/>
</beans>
\ No newline at end of file
......@@ -2,29 +2,19 @@
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:mvc="http://www.springframework.org/schema/mvc"
xmlns:mvc="http://www.springframework.org/schema/mvc" xmlns:tx="http://www.springframework.org/schema/tx"
xsi:schemaLocation="http://www.springframework.org/schema/mvc
http://www.springframework.org/schema/mvc/spring-mvc-4.0.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.0.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-4.0.xsd">
<context:annotation-config/>
<bean id="messageSource" class="org.springframework.context.support.ResourceBundleMessageSource">
<property name="basenames">
<list>
<value>web_messages</value>
</list>
</property>
</bean>
<context:component-scan base-package="us.codecraft.webmagic.avalon"/>
<bean class="org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter">
<property name="messageConverters">
<list>
<bean id="fastJsonHttpMessageConverter" class="com.alibaba.fastjson.support.spring.FastJsonHttpMessageConverter">
<bean id="fastJsonHttpMessageConverter"
class="com.alibaba.fastjson.support.spring.FastJsonHttpMessageConverter">
<property name="supportedMediaTypes">
<list>
<value>text/html;charset=UTF-8</value>
......@@ -38,10 +28,6 @@
<mvc:resources mapping="/static/**" location="/static/" />
<mvc:annotation-driven>
</mvc:annotation-driven>
<mvc:annotation-driven/>
</beans>
\ No newline at end of file
CREATE TABLE DynamicClass(
Id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`ClassName` varchar(200) NOT NULL,
`SourceCode` text NOT NULL,
`AddTime` datetime NOT NULL,
`UpdateTime` datetime NOT NULL,
UNIQUE INDEX `un_class_name` (`ClassName`)
);
\ No newline at end of file
CREATE TABLE `DynamicClass` (
`Id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`ClassName` varchar(200) NOT NULL,
`SourceCode` text NOT NULL,
`AddTime` datetime NOT NULL,
`UpdateTime` datetime NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `un_class_name` (`ClassName`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `Spider` (
`Id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`PageProcessorId` int(11) unsigned NOT NULL AUTO_INCREMENT,
`PipelineId` int(11) unsigned NOT NULL AUTO_INCREMENT,
`SchedulerId` int(11) unsigned NOT NULL AUTO_INCREMENT,
`Config` text NOT NULL,
`AddTime` datetime NOT NULL,
`UpdateTime` datetime NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `un_class_name` (`ClassName`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `PageProcessor` (
`Id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`ClassName` varchar(200) NOT NULL,
`Params` text NOT NULL,
`AddTime` datetime NOT NULL,
`UpdateTime` datetime NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `un_class_name` (`ClassName`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
\ No newline at end of file
package us.codecraft.webmagic;
import org.junit.runner.RunWith;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.ContextConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
import org.springframework.transaction.annotation.Transactional;
/**
* @author code4crafter@gmail.com
*/
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = {"classpath*:/config/spring/applicationContext*.xml"})
@ActiveProfiles("test")
@Transactional
public abstract class AbstractTest {
}
package us.codecraft.webmagic;
import us.codecraft.forger.property.Inject;
import us.codecraft.forger.property.format.Formatter;
/**
* @author code4crafter@gmail.com
*/
public class Foo {
@Formatter("")
@Inject("fooa")
private String foo;
public static final String SOURCE_CODE="package us.codecraft.webmagic;\n" +
"\n" +
"import us.codecraft.forger.property.Inject;\n" +
"import us.codecraft.forger.property.format.Formatter;\n" +
"\n" +
"/**\n" +
" * @author code4crafter@gmail.com\n" +
" */\n" +
"public class Foo {\n" +
"\n" +
" @Formatter(\"\")\n" +
" @Inject(\"fooa\")\n" +
" private String foo;\n" +
"\n" +
" public String getFoo() {\n" +
" return foo;\n" +
" }\n" +
"\n" +
" public String foo() {\n" +
" return foo;\n" +
" }\n" +
"}";
public String getFoo() {
return foo;
}
public String foo() {
return foo;
}
}
package us.codecraft.webmagic.dao;
import org.junit.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.test.annotation.Rollback;
import org.springframework.transaction.annotation.Transactional;
import us.codecraft.webmagic.AbstractTest;
import us.codecraft.webmagic.model.DynamicClass;
/**
* @author code4crafter@gmail.com
*/
public class DynamicClassDaoTest extends AbstractTest {
@Autowired
private DynamicClassDao dynamicClassDao;
@Test
@Transactional
@Rollback(true)
public void testAdd() throws Exception {
DynamicClass dynamicClass = new DynamicClass();
dynamicClass.setClassName("test");
dynamicClass.setSourceCode("testSource");
dynamicClassDao.add(dynamicClass);
}
}
package us.codecraft.webmagic.service;
import org.junit.Before;
import org.junit.Test;
import org.mockito.InjectMocks;
import org.mockito.Mock;
import org.mockito.MockitoAnnotations;
import org.mockito.Spy;
import org.springframework.beans.factory.annotation.Autowired;
import us.codecraft.forger.ForgerFactory;
import us.codecraft.webmagic.AbstractTest;
import us.codecraft.webmagic.Foo;
import us.codecraft.webmagic.dao.DynamicClassDao;
import us.codecraft.webmagic.exception.DynamicClassCompileException;
import us.codecraft.webmagic.service.impl.DynamicClassServiceImpl;
import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.failBecauseExceptionWasNotThrown;
/**
* @author code4crafter@gmail.com
*/
public class DynamicClassServiceImplTest extends AbstractTest {
@Before
public void setUp() {
MockitoAnnotations.initMocks(this);
}
@Spy
@Autowired
private ForgerFactory forgerFactory;
@InjectMocks
private DynamicClassService dynamicClassService = new DynamicClassServiceImpl();
@Mock
private DynamicClassDao dynamicClassDao;
@Test
public void testCompileAndSave() throws Exception {
Class aClass = dynamicClassService.compileAndSave(Foo.SOURCE_CODE);
assertThat(aClass.getCanonicalName()).isEqualTo("us.codecraft.webmagic.Foo");
}
@Test
public void testCompileFail() {
try {
dynamicClassService.compileAndSave("class s((");
failBecauseExceptionWasNotThrown(DynamicClassCompileException.class);
} catch (DynamicClassCompileException e) {
}
}
}
WebMagic-Worker
=====
Worker is the spider container.
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>webmagic-parent</artifactId>
<artifactId>webmagic-avalon</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>webmagic-lucene</artifactId>
<artifactId>webmagic-worker</artifactId>
<packaging>war</packaging>
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>4.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>4.4.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<artifactId>webmagic-avalon-common</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjrt</artifactId>
</dependency>
</dependencies>
<build>
......@@ -39,8 +38,21 @@
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>./lib/</classpathPrefix>
<mainClass>us.codecraft.webmagic.main.QuickStarter</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
</project>
\ No newline at end of file
</project>
package us.codecraft.webmagic.worker;
import us.codecraft.webmagic.Spider;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
/**
* Container of Spiders.
*
* @author code4crafter@gmail.com
*/
public class Worker {
public static final int DEFAULT_POOL_SIZE = 10;
private int poolSize;
private ExecutorService executorService;
private Map<String,Spider> spiderMap;
public Worker(int poolSize) {
this.poolSize = poolSize;
this.executorService = initExecutorService();
this.spiderMap = new ConcurrentHashMap<String, Spider>();
}
public Worker() {
this(DEFAULT_POOL_SIZE);
}
protected ExecutorService initExecutorService() {
return Executors.newFixedThreadPool(poolSize);
}
public void addSpider(Spider spider) {
spider.setExecutorService(executorService);
spiderMap.put(spider.getUUID(), spider);
}
public Spider getSpider(String uuid){
return spiderMap.get(uuid);
}
}
package us.codecraft.webmagic.worker.controller;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.ResponseBody;
import us.codecraft.webmagic.worker.Worker;
import java.util.HashMap;
import java.util.Map;
/**
* @author code4crafter@gmail.com
*/
@Controller
@RequestMapping("spider")
public class SpiderController {
@Autowired
private Worker worker;
@RequestMapping("create")
@ResponseBody
public Map<String, Object> create(@RequestParam("id") String id) {
HashMap<String, Object> map = new HashMap<String, Object>();
map.put("code", 200);
return map;
}
}
number_format=#
classic_compatible=true
default_encoding=UTF-8
template_update_delay=0
#########################
template_exception_handler=rethrow
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="stdout" class="org.apache.log4j.ConsoleAppender">
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yy-MM-dd HH:mm:ss,SSS} %-5p %c(%F:%L) ## %m%n" />
</layout>
</appender>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="info" />
<appender-ref ref="stdout" />
</root>
</log4j:configuration>
<%@ page language="java" contentType="text/html; charset=utf8"
pageEncoding="utf8"%>
<!DOCTYPE html>
<!--
Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.com/styleguide/templates/2.0
-->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<title>Page not found &middot; GitLab Pages</title>
<style type="text/css" media="screen">
body {
background-color: #f1f1f1;
margin: 0;
font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
}
.container { margin: 50px auto 40px auto; width: 600px; text-align: center; }
a { color: #4183c4; text-decoration: none; }
a:hover { text-decoration: underline; }
h1 { width: 800px; position:relative; left: -100px; letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px 0 50px 0; text-shadow: 0 1px 0 #fff; }
p { color: rgba(0, 0, 0, 0.5); margin: 20px 0; line-height: 1.6; }
ul { list-style: none; margin: 25px 0; padding: 0; }
li { display: table-cell; font-weight: bold; width: 1%; }
.logo { display: inline-block; margin-top: 35px; }
.logo-img-2x { display: none; }
@media
only screen and (-webkit-min-device-pixel-ratio: 2),
only screen and ( min--moz-device-pixel-ratio: 2),
only screen and ( -o-min-device-pixel-ratio: 2/1),
only screen and ( min-device-pixel-ratio: 2),
only screen and ( min-resolution: 192dpi),
only screen and ( min-resolution: 2dppx) {
.logo-img-1x { display: none; }
.logo-img-2x { display: inline-block; }
}
#suggestions {
margin-top: 35px;
color: #ccc;
}
#suggestions a {
color: #666666;
font-weight: 200;
font-size: 14px;
margin: 0 10px;
}
</style>
</head>
<body>
<div class="container">
<h1>404</h1>
<p><strong>There isn't a Gitlab Page here.</strong></p>
<img alt="" src="" />
<p>Forgive my poor design.</p>
<p>You can edit 404.jsp to customize your 404 page.</p>
</div>
</body>
</html>
<%@ page language="java" contentType="text/html; charset=utf8"
pageEncoding="utf8" isErrorPage="true" import="java.io.*"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf8">
<title>500</title>
</head>
<body>
页面出错啦!
<%
StringWriter stringWriter = new StringWriter();
exception.printStackTrace(new PrintWriter(stringWriter));
out.println(stringWriter.toString());
%>
</body>
</html>
\ No newline at end of file
<web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd">
<display-name>Archetype Created Web Application</display-name>
<context-param>
<param-name>contextConfigLocation</param-name>
<param-value>
classpath*:/config/spring/applicationContext*.xml,
</param-value>
</context-param>
<context-param>
<param-name>contextClass</param-name>
<param-value>org.springframework.web.context.support.XmlWebApplicationContext</param-value>
</context-param>
<!--由Sprng载入的Log4j配置文件位置 -->
<context-param>
<param-name>log4jConfigLocation</param-name>
<param-value>classpath:log/log4j.xml</param-value>
</context-param>
<context-param>
<param-name>log4jRefreshInterval</param-name>
<param-value>60000</param-value>
</context-param>
<servlet>
<servlet-name>spring</servlet-name>
<servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
<init-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath*:/config/spring/applicationContext*.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
<servlet-name>spring</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
<error-page>
<error-code>404</error-code>
<location>/WEB-INF/jsp/404.jsp</location>
</error-page>
<error-page>
<error-code>500</error-code>
<location>/WEB-INF/jsp/500.jsp</location>
</error-page>
</web-app>
package us.codecraft.webmagic.worker;
import org.junit.Test;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import static org.assertj.core.api.Assertions.assertThat;
import static org.mockito.Mockito.*;
/**
* @author code4crafter@gmail.com
*/
public class WorkerTest {
@Test
public void testWorkerAsSpiderContains() throws Exception {
PageProcessor pageProcessor = mock(PageProcessor.class);
Site site = mock(Site.class);
when(pageProcessor.getSite()).thenReturn(site);
when(site.getDomain()).thenReturn("codecraft.us");
Worker worker = new Worker();
Spider spider = Spider.create(pageProcessor);
worker.addSpider(spider);
assertThat(worker.getSpider("codecraft.us")).isEqualTo(spider);
}
}
......@@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -50,11 +50,6 @@
<artifactId>commons-collections</artifactId>
</dependency>
<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
......@@ -70,6 +65,17 @@
<artifactId>commons-io</artifactId>
</dependency>
<dependency>
<groupId>com.jayway.jsonpath</groupId>
<artifactId>json-path</artifactId>
<version>0.8.1</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
</dependency>
</dependencies>
</project>
......@@ -2,6 +2,7 @@ package us.codecraft.webmagic;
import org.apache.commons.lang3.StringUtils;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.Json;
import us.codecraft.webmagic.selector.Selectable;
import us.codecraft.webmagic.utils.UrlUtils;
......@@ -31,6 +32,8 @@ public class Page {
private Html html;
private Json json;
private String rawText;
private Selectable url;
......@@ -72,10 +75,23 @@ public class Page {
return html;
}
/**
* get json content of page
*
* @return json
* @since 0.5.0
*/
public Json getJson() {
if (json == null) {
json = new Json(rawText);
}
return json;
}
/**
* @param html
* @deprecated since 0.4.0
* The html is parse just when first time of calling {@link #getHtml()}, so use {@link #setRawText(String)} instead.
* The html is parse just when first time of calling {@link #getHtml()}, so use {@link #setRawText(String)} instead.
*/
public void setHtml(Html html) {
this.html = html;
......@@ -94,7 +110,7 @@ public class Page {
synchronized (targetRequests) {
for (String s : requests) {
if (StringUtils.isBlank(s) || s.equals("#") || s.startsWith("javascript:")) {
break;
continue;
}
s = UrlUtils.canonicalizeUrl(s, url.toString());
targetRequests.add(new Request(s));
......@@ -111,7 +127,7 @@ public class Page {
synchronized (targetRequests) {
for (String s : requests) {
if (StringUtils.isBlank(s) || s.equals("#") || s.startsWith("javascript:")) {
break;
continue;
}
s = UrlUtils.canonicalizeUrl(s, url.toString());
targetRequests.add(new Request(s).setPriority(priority));
......
......@@ -21,6 +21,8 @@ public class Request implements Serializable {
private String url;
private String method;
/**
* Store additional information in extras.
*/
......@@ -106,10 +108,25 @@ public class Request implements Serializable {
this.url = url;
}
/**
* The http method of the request. Get for default.
* @return httpMethod
* @see us.codecraft.webmagic.utils.HttpConstant.Method
* @since 0.5.0
*/
public String getMethod() {
return method;
}
public void setMethod(String method) {
this.method = method;
}
@Override
public String toString() {
return "Request{" +
"url='" + url + '\'' +
", method='" + method + '\'' +
", extras=" + extras +
", priority=" + priority +
'}';
......
package us.codecraft.webmagic;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
/**
......@@ -14,7 +15,7 @@ import java.util.Map;
*/
public class ResultItems {
private Map<String, Object> fields = new HashMap<String, Object>();
private Map<String, Object> fields = new LinkedHashMap<String, Object>();
private Request request;
......
package us.codecraft.webmagic;
import com.google.common.collect.HashBasedTable;
import com.google.common.collect.Table;
import org.apache.http.HttpHost;
import us.codecraft.webmagic.utils.UrlUtils;
......@@ -18,7 +20,9 @@ public class Site {
private String userAgent;
private Map<String, String> cookies = new LinkedHashMap<String, String>();
private Map<String, String> defaultCookies = new LinkedHashMap<String, String>();
private Table<String, String, String> cookies = HashBasedTable.create();
private String charset;
......@@ -45,6 +49,10 @@ public class Site {
private boolean useGzip = true;
/**
* @see us.codecraft.webmagic.utils.HttpConstant.Header
* @deprecated
*/
public static interface HeaderConst {
public static final String REFERER = "Referer";
......@@ -72,7 +80,20 @@ public class Site {
* @return this
*/
public Site addCookie(String name, String value) {
cookies.put(name, value);
defaultCookies.put(name, value);
return this;
}
/**
* Add a cookie with specific domain.
*
* @param domain
* @param name
* @param value
* @return
*/
public Site addCookie(String domain, String name, String value) {
cookies.put(domain, name, value);
return this;
}
......@@ -93,7 +114,16 @@ public class Site {
* @return get cookies
*/
public Map<String, String> getCookies() {
return cookies;
return defaultCookies;
}
/**
* get cookies of all domains
*
* @return get cookies
*/
public Map<String,Map<String, String>> getAllCookies() {
return cookies.rowMap();
}
/**
......@@ -203,10 +233,10 @@ public class Site {
* Add a url to start url.<br>
* Because urls are more a Spider's property than Site, move it to {@link Spider#addUrl(String...)}}
*
* @deprecated
* @see Spider#addUrl(String...)
* @param startUrl
* @return this
* @see Spider#addUrl(String...)
* @deprecated
*/
public Site addStartUrl(String startUrl) {
return addStartRequest(new Request(startUrl));
......@@ -216,10 +246,10 @@ public class Site {
* Add a url to start url.<br>
* Because urls are more a Spider's property than Site, move it to {@link Spider#addRequest(Request...)}}
*
* @deprecated
* @see Spider#addRequest(Request...)
* @param startUrl
* @param startRequest
* @return this
* @see Spider#addRequest(Request...)
* @deprecated
*/
public Site addStartRequest(Request startRequest) {
this.startRequests.add(startRequest);
......@@ -312,6 +342,7 @@ public class Site {
/**
* set up httpProxy for this site
*
* @param httpProxy
* @return
*/
......@@ -364,7 +395,8 @@ public class Site {
if (acceptStatCode != null ? !acceptStatCode.equals(site.acceptStatCode) : site.acceptStatCode != null)
return false;
if (charset != null ? !charset.equals(site.charset) : site.charset != null) return false;
if (cookies != null ? !cookies.equals(site.cookies) : site.cookies != null) return false;
if (defaultCookies != null ? !defaultCookies.equals(site.defaultCookies) : site.defaultCookies != null)
return false;
if (domain != null ? !domain.equals(site.domain) : site.domain != null) return false;
if (headers != null ? !headers.equals(site.headers) : site.headers != null) return false;
if (startRequests != null ? !startRequests.equals(site.startRequests) : site.startRequests != null)
......@@ -378,7 +410,7 @@ public class Site {
public int hashCode() {
int result = domain != null ? domain.hashCode() : 0;
result = 31 * result + (userAgent != null ? userAgent.hashCode() : 0);
result = 31 * result + (cookies != null ? cookies.hashCode() : 0);
result = 31 * result + (defaultCookies != null ? defaultCookies.hashCode() : 0);
result = 31 * result + (charset != null ? charset.hashCode() : 0);
result = 31 * result + (startRequests != null ? startRequests.hashCode() : 0);
result = 31 * result + sleepTime;
......@@ -395,7 +427,7 @@ public class Site {
return "Site{" +
"domain='" + domain + '\'' +
", userAgent='" + userAgent + '\'' +
", cookies=" + cookies +
", cookies=" + defaultCookies +
", charset='" + charset + '\'' +
", startRequests=" + startRequests +
", sleepTime=" + sleepTime +
......
......@@ -13,17 +13,14 @@ import us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.scheduler.Scheduler;
import us.codecraft.webmagic.utils.EnvironmentUtil;
import us.codecraft.webmagic.utils.ThreadUtils;
import us.codecraft.webmagic.selector.thread.CountableThreadPool;
import us.codecraft.webmagic.utils.UrlUtils;
import java.io.Closeable;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.UUID;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.locks.Condition;
......@@ -78,6 +75,8 @@ public class Spider implements Runnable, Task {
protected Logger logger = LoggerFactory.getLogger(getClass());
protected CountableThreadPool threadPool;
protected ExecutorService executorService;
protected int threadNum = 1;
......@@ -100,10 +99,14 @@ public class Spider implements Runnable, Task {
private Condition newUrlCondition = newUrlLock.newCondition();
private final AtomicInteger threadAlive = new AtomicInteger(0);
private List<SpiderListener> spiderListeners;
private final AtomicLong pageCount = new AtomicLong(0);
private Date startTime;
private int emptySleepTime = 30000;
/**
* create a spider with pageProcessor.
*
......@@ -143,7 +146,7 @@ public class Spider implements Runnable, Task {
* Set startUrls of Spider.<br>
* Prior to startUrls of Site.
*
* @param startUrls
* @param startRequests
* @return this
*/
public Spider startRequest(List<Request> startRequests) {
......@@ -186,7 +189,14 @@ public class Spider implements Runnable, Task {
*/
public Spider setScheduler(Scheduler scheduler) {
checkIfRunning();
Scheduler oldScheduler = this.scheduler;
this.scheduler = scheduler;
if (oldScheduler != null) {
Request request;
while ((request = oldScheduler.poll(this)) != null) {
this.scheduler.push(request, this);
}
}
return this;
}
......@@ -219,7 +229,7 @@ public class Spider implements Runnable, Task {
/**
* set pipelines for Spider
*
* @param pipeline
* @param pipelines
* @return this
* @see Pipeline
* @since 0.4.1
......@@ -273,8 +283,12 @@ public class Spider implements Runnable, Task {
pipelines.add(new ConsolePipeline());
}
downloader.setThread(threadNum);
if (executorService == null || executorService.isShutdown()) {
executorService = ThreadUtils.newFixedThreadPool(threadNum);
if (threadPool == null || threadPool.isShutdown()) {
if (executorService != null && !executorService.isShutdown()) {
threadPool = new CountableThreadPool(threadNum, executorService);
} else {
threadPool = new CountableThreadPool(threadNum);
}
}
if (startRequests != null) {
for (Request request : startRequests) {
......@@ -282,7 +296,7 @@ public class Spider implements Runnable, Task {
}
startRequests.clear();
}
threadAlive.set(0);
startTime = new Date();
}
@Override
......@@ -293,23 +307,23 @@ public class Spider implements Runnable, Task {
while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) {
Request request = scheduler.poll(this);
if (request == null) {
if (threadAlive.get() == 0 && exitWhenComplete) {
if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
break;
}
// wait until new url added
waitNewUrl();
} else {
final Request requestFinal = request;
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
threadPool.execute(new Runnable() {
@Override
public void run() {
try {
processRequest(requestFinal);
onSuccess(requestFinal);
} catch (Exception e) {
logger.error("download " + requestFinal + " error", e);
onError(requestFinal);
logger.error("process request " + requestFinal + " error", e);
} finally {
threadAlive.decrementAndGet();
pageCount.incrementAndGet();
signalNewUrl();
}
......@@ -324,6 +338,22 @@ public class Spider implements Runnable, Task {
}
}
protected void onError(Request request) {
if (CollectionUtils.isNotEmpty(spiderListeners)) {
for (SpiderListener spiderListener : spiderListeners) {
spiderListener.onError(request);
}
}
}
protected void onSuccess(Request request) {
if (CollectionUtils.isNotEmpty(spiderListeners)) {
for (SpiderListener spiderListener : spiderListeners) {
spiderListener.onSuccess(request);
}
}
}
private void checkRunningStat() {
while (true) {
int statNow = stat.get();
......@@ -342,7 +372,7 @@ public class Spider implements Runnable, Task {
for (Pipeline pipeline : pipelines) {
destroyEach(pipeline);
}
executorService.shutdown();
threadPool.shutdown();
}
private void destroyEach(Object object) {
......@@ -373,6 +403,7 @@ public class Spider implements Runnable, Task {
Page page = downloader.download(request, this);
if (page == null) {
sleep(site.getSleepTime());
onError(request);
return;
}
// for cycle retry
......@@ -478,7 +509,7 @@ public class Spider implements Runnable, Task {
/**
* Add urls with information to crawl.<br/>
*
* @param urls
* @param requests
* @return
*/
public Spider addRequest(Request... requests) {
......@@ -490,16 +521,15 @@ public class Spider implements Runnable, Task {
}
private void waitNewUrl() {
newUrlLock.lock();
try {
newUrlLock.lock();
//double check
if (threadAlive.get() == 0 && exitWhenComplete) {
if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
return;
}
try {
newUrlCondition.await();
} catch (InterruptedException e) {
}
newUrlCondition.await(emptySleepTime, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
logger.warn("waitNewUrl - interrupted, error {}", e);
} finally {
newUrlLock.unlock();
}
......@@ -542,12 +572,18 @@ public class Spider implements Runnable, Task {
}
/**
* switch off xsoup
* start with more than one threads
*
* @return
* @param threadNum
* @return this
*/
public static void xsoupOff() {
EnvironmentUtil.setUseXsoup(false);
public Spider thread(ExecutorService executorService, int threadNum) {
checkIfRunning();
this.threadNum = threadNum;
if (threadNum <= 0) {
throw new IllegalArgumentException("threadNum should be more than one!");
}
return this;
}
public boolean isExitWhenComplete() {
......@@ -624,7 +660,10 @@ public class Spider implements Runnable, Task {
* @since 0.4.1
*/
public int getThreadAlive() {
return threadAlive.get();
if (threadPool == null) {
return 0;
}
return threadPool.getThreadAlive();
}
/**
......@@ -653,8 +692,40 @@ public class Spider implements Runnable, Task {
return uuid;
}
public Spider setExecutorService(ExecutorService executorService) {
checkIfRunning();
this.executorService = executorService;
return this;
}
@Override
public Site getSite() {
return site;
}
public List<SpiderListener> getSpiderListeners() {
return spiderListeners;
}
public Spider setSpiderListeners(List<SpiderListener> spiderListeners) {
this.spiderListeners = spiderListeners;
return this;
}
public Date getStartTime() {
return startTime;
}
public Scheduler getScheduler() {
return scheduler;
}
/**
* Set wait time when no url is polled.<br></br>
*
* @param emptySleepTime In MILLISECONDS.
*/
public void setEmptySleepTime(int emptySleepTime) {
this.emptySleepTime = emptySleepTime;
}
}
package us.codecraft.webmagic;
/**
* Listener of Spider on page processing. Used for monitor and such on.
*
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public interface SpiderListener {
public void onSuccess(Request request);
public void onError(Request request);
}
......@@ -34,6 +34,12 @@ public abstract class AbstractDownloader implements Downloader {
return (Html) page.getHtml();
}
protected void onSuccess(Request request) {
}
protected void onError(Request request) {
}
protected Page addToCycleRetry(Request request, Site site) {
Page page = new Page();
Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES);
......
......@@ -3,10 +3,12 @@ package us.codecraft.webmagic.downloader;
import com.google.common.collect.Sets;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.annotation.ThreadSafe;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.client.methods.RequestBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
......@@ -16,6 +18,7 @@ import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.utils.HttpConstant;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.UrlUtils;
......@@ -74,33 +77,21 @@ public class HttpClientDownloader extends AbstractDownloader {
} else {
acceptStatCode = Sets.newHashSet(200);
}
logger.info("downloading page " + request.getUrl());
RequestBuilder requestBuilder = RequestBuilder.get().setUri(request.getUrl());
if (headers != null) {
for (Map.Entry<String, String> headerEntry : headers.entrySet()) {
requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
}
}
RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
.setConnectionRequestTimeout(site.getTimeOut())
.setSocketTimeout(site.getTimeOut())
.setConnectTimeout(site.getTimeOut())
.setCookieSpec(CookieSpecs.BEST_MATCH);
if (site != null && site.getHttpProxy() != null) {
requestConfigBuilder.setProxy(site.getHttpProxy());
}
requestBuilder.setConfig(requestConfigBuilder.build());
logger.info("downloading page {}", request.getUrl());
CloseableHttpResponse httpResponse = null;
try {
httpResponse = getHttpClient(site).execute(requestBuilder.build());
HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers);
httpResponse = getHttpClient(site).execute(httpUriRequest);
int statusCode = httpResponse.getStatusLine().getStatusCode();
if (acceptStatCode.contains(statusCode)) {
if (statusAccept(acceptStatCode, statusCode)) {
//charset
if (charset == null) {
String value = httpResponse.getEntity().getContentType().getValue();
charset = UrlUtils.getCharset(value);
}
return handleResponse(request, charset, httpResponse, task);
Page page = handleResponse(request, charset, httpResponse, task);
onSuccess(request);
return page;
} else {
logger.warn("code error " + statusCode + "\t" + request.getUrl());
return null;
......@@ -110,6 +101,7 @@ public class HttpClientDownloader extends AbstractDownloader {
if (site.getCycleRetryTimes() > 0) {
return addToCycleRetry(request, site);
}
onError(request);
return null;
} finally {
try {
......@@ -123,6 +115,58 @@ public class HttpClientDownloader extends AbstractDownloader {
}
}
@Override
public void setThread(int thread) {
httpClientGenerator.setPoolSize(thread);
}
protected boolean statusAccept(Set<Integer> acceptStatCode, int statusCode) {
return acceptStatCode.contains(statusCode);
}
protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map<String, String> headers) {
RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl());
if (headers != null) {
for (Map.Entry<String, String> headerEntry : headers.entrySet()) {
requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
}
}
RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
.setConnectionRequestTimeout(site.getTimeOut())
.setSocketTimeout(site.getTimeOut())
.setConnectTimeout(site.getTimeOut())
.setCookieSpec(CookieSpecs.BEST_MATCH);
if (site != null && site.getHttpProxy() != null) {
requestConfigBuilder.setProxy(site.getHttpProxy());
}
requestBuilder.setConfig(requestConfigBuilder.build());
return requestBuilder.build();
}
protected RequestBuilder selectRequestMethod(Request request) {
String method = request.getMethod();
if (method == null || method.equalsIgnoreCase(HttpConstant.Method.GET)) {
//default get
return RequestBuilder.get();
} else if (method.equalsIgnoreCase(HttpConstant.Method.POST)) {
RequestBuilder requestBuilder = RequestBuilder.post();
NameValuePair[] nameValuePair = (NameValuePair[]) request.getExtra("nameValuePair");
if (nameValuePair.length > 0) {
requestBuilder.addParameters(nameValuePair);
}
return requestBuilder;
} else if (method.equalsIgnoreCase(HttpConstant.Method.HEAD)) {
return RequestBuilder.head();
} else if (method.equalsIgnoreCase(HttpConstant.Method.PUT)) {
return RequestBuilder.put();
} else if (method.equalsIgnoreCase(HttpConstant.Method.DELETE)) {
return RequestBuilder.delete();
} else if (method.equalsIgnoreCase(HttpConstant.Method.TRACE)) {
return RequestBuilder.trace();
}
throw new IllegalArgumentException("Illegal HTTP Method " + method);
}
protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
String content = IOUtils.toString(httpResponse.getEntity().getContent(), charset);
Page page = new Page();
......@@ -132,9 +176,4 @@ public class HttpClientDownloader extends AbstractDownloader {
page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
return page;
}
@Override
public void setThread(int thread) {
httpClientGenerator.setPoolSize(thread);
}
}
......@@ -36,7 +36,7 @@ public class HttpClientGenerator {
connectionManager.setDefaultMaxPerRoute(100);
}
public HttpClientGenerator setPoolSize(int poolSize){
public HttpClientGenerator setPoolSize(int poolSize) {
connectionManager.setMaxTotal(poolSize);
return this;
}
......@@ -76,10 +76,15 @@ public class HttpClientGenerator {
private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
CookieStore cookieStore = new BasicCookieStore();
if (site.getCookies() != null) {
for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
cookie.setDomain(site.getDomain());
cookieStore.addCookie(cookie);
}
for (Map.Entry<String, Map<String, String>> domainEntry : site.getAllCookies().entrySet()) {
for (Map.Entry<String, String> cookieEntry : domainEntry.getValue().entrySet()) {
BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
cookie.setDomain(site.getDomain());
cookie.setDomain(domainEntry.getKey());
cookieStore.addCookie(cookie);
}
}
......
......@@ -11,11 +11,12 @@ import us.codecraft.webmagic.processor.PageProcessor;
*/
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
private Site site = Site.me().setRetryTimes(3).setSleepTime(0);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all());
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
......
......@@ -34,6 +34,6 @@ public class OschinaBlogPageProcessor implements PageProcessor {
}
public static void main(String[] args) {
Spider.create(new OschinaBlogPageProcessor()).addUrl("http://my.oschina.net/flashsword/blog").thread(2).run();
Spider.create(new OschinaBlogPageProcessor()).addUrl("http://my.oschina.net/flashsword/blog").run();
}
}
package us.codecraft.webmagic.scheduler;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
/**
* Remove duplicate urls and only push urls which are not duplicate.<br></br>
*
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public abstract class DuplicatedRemoveScheduler implements Scheduler {
protected Logger logger = LoggerFactory.getLogger(getClass());
@Override
public void push(Request request, Task task) {
logger.trace("get a candidate url {}", request.getUrl());
if (isDuplicate(request, task) || shouldReserved(request)) {
logger.debug("push to queue {}", request.getUrl());
pushWhenNoDuplicate(request, task);
}
}
/**
* Reset duplicate check.
*/
public abstract void resetDuplicateCheck(Task task);
/**
* @param request
* @return
*/
protected abstract boolean isDuplicate(Request request, Task task);
protected boolean shouldReserved(Request request) {
return request.getExtra(Request.CYCLE_TRIED_TIMES) != null;
}
protected void pushWhenNoDuplicate(Request request, Task task) {
}
}
package us.codecraft.webmagic.scheduler;
import com.google.common.collect.Sets;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
......@@ -10,24 +8,27 @@ import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
/**
* Base Scheduler with duplicated urls removed locally.
* Base Scheduler with duplicated urls removed by hash set.<br></br>
*
* @author code4crafter@gmail.com
* @since 0.5.0
*/
public abstract class LocalDuplicatedRemovedScheduler implements Scheduler {
protected Logger logger = LoggerFactory.getLogger(getClass());
public abstract class LocalDuplicatedRemoveScheduler extends DuplicatedRemoveScheduler implements MonitorableScheduler {
private Set<String> urls = Sets.newSetFromMap(new ConcurrentHashMap<String, Boolean>());
@Override
public void push(Request request, Task task) {
logger.debug("push to queue " + request.getUrl());
if (request.getExtra(Request.CYCLE_TRIED_TIMES) != null || urls.add(request.getUrl())) {
pushWhenNoDuplicate(request, task);
}
public void resetDuplicateCheck(Task task) {
urls.clear();
}
@Override
protected boolean isDuplicate(Request request, Task task) {
return urls.add(request.getUrl());
}
protected abstract void pushWhenNoDuplicate(Request request, Task task);
@Override
public int getTotalRequestsCount(Task task) {
return urls.size();
}
}
package us.codecraft.webmagic.scheduler;
import us.codecraft.webmagic.Task;
/**
* The scheduler whose requests can be counted for monitor.
*
* @author code4crafter@gmail.com
* @since 0.5.0
*/
public interface MonitorableScheduler extends Scheduler {
public int getLeftRequestsCount(Task task);
public int getTotalRequestsCount(Task task);
}
\ No newline at end of file
......@@ -17,7 +17,7 @@ import java.util.concurrent.PriorityBlockingQueue;
* @since 0.2.1
*/
@ThreadSafe
public class PriorityScheduler extends LocalDuplicatedRemovedScheduler {
public class PriorityScheduler extends LocalDuplicatedRemoveScheduler {
public static final int INITIAL_CAPACITY = 5;
......@@ -60,4 +60,9 @@ public class PriorityScheduler extends LocalDuplicatedRemovedScheduler {
}
return priorityQueueMinus.poll();
}
@Override
public int getLeftRequestsCount(Task task) {
return noPriorityQueue.size();
}
}
......@@ -16,7 +16,7 @@ import java.util.concurrent.LinkedBlockingQueue;
* @since 0.1.0
*/
@ThreadSafe
public class QueueScheduler extends LocalDuplicatedRemovedScheduler {
public class QueueScheduler extends LocalDuplicatedRemoveScheduler {
private BlockingQueue<Request> queue = new LinkedBlockingQueue<Request>();
......@@ -29,4 +29,9 @@ public class QueueScheduler extends LocalDuplicatedRemovedScheduler {
public synchronized Request poll(Task task) {
return queue.poll();
}
@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}
}
......@@ -4,7 +4,6 @@ import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.utils.EnvironmentUtil;
import java.util.ArrayList;
import java.util.List;
......@@ -24,7 +23,7 @@ public class Html extends PlainText {
*/
private Document document;
private boolean init = false;
private boolean needInitCache = true;
public Html(List<String> strings) {
super(strings);
......@@ -34,12 +33,22 @@ public class Html extends PlainText {
super(text);
}
public Html(List<String> strings, boolean needInitCache) {
super(strings);
this.needInitCache = needInitCache;
}
public Html(String text, boolean needInitCache) {
super(text);
this.needInitCache = needInitCache;
}
/**
* lazy init
*/
private void initDocument() {
if (this.document == null && !init) {
init = true;
if (this.document == null && needInitCache) {
needInitCache = false;
//just init once whether the parsing succeeds or not
try {
this.document = Jsoup.parse(getText());
......@@ -68,7 +77,7 @@ public class Html extends PlainText {
results.add(result);
}
}
return new Html(results);
return new Html(results, false);
}
@Override
......@@ -79,7 +88,7 @@ public class Html extends PlainText {
List<String> result = selector.selectList(string);
results.addAll(result);
}
return new Html(results);
return new Html(results, false);
}
@Override
......@@ -96,23 +105,18 @@ public class Html extends PlainText {
@Override
public Selectable xpath(String xpath) {
if (EnvironmentUtil.useXsoup()) {
XsoupSelector xsoupSelector = new XsoupSelector(xpath);
if (document != null) {
return new Html(xsoupSelector.selectList(document));
}
return selectList(xsoupSelector, strings);
} else {
XpathSelector xpathSelector = new XpathSelector(xpath);
return selectList(xpathSelector, strings);
XpathSelector xpathSelector = Selectors.xpath(xpath);
if (document != null) {
return new Html(xpathSelector.selectList(document), false);
}
return selectList(xpathSelector, strings);
}
@Override
public Selectable $(String selector) {
CssSelector cssSelector = Selectors.$(selector);
if (document != null) {
return new Html(cssSelector.selectList(document));
return new Html(cssSelector.selectList(document), false);
}
return selectList(cssSelector, strings);
}
......@@ -121,12 +125,13 @@ public class Html extends PlainText {
public Selectable $(String selector, String attrName) {
CssSelector cssSelector = Selectors.$(selector, attrName);
if (document != null) {
return new Html(cssSelector.selectList(document));
return new Html(cssSelector.selectList(document), false);
}
return selectList(cssSelector, strings);
}
public Document getDocument() {
initDocument();
return document;
}
......
package us.codecraft.webmagic.selector;
import com.alibaba.fastjson.JSON;
import org.jsoup.parser.TokenQueue;
import java.util.List;
/**
* parse json
* @author code4crafter@gmail.com
* @since 0.5.0
*/
public class Json extends PlainText {
public Json(List<String> strings) {
super(strings);
}
public Json(String text) {
super(text);
}
/**
* remove padding for JSONP
* @param padding
* @return
*/
public Json removePadding(String padding) {
String text = getText();
TokenQueue tokenQueue = new TokenQueue(text);
tokenQueue.consumeWhitespace();
tokenQueue.consume(padding);
tokenQueue.consumeWhitespace();
String chompBalanced = tokenQueue.chompBalanced('(', ')');
return new Json(chompBalanced);
}
public <T> T toObject(Class<T> clazz) {
if (getText() == null) {
return null;
}
return JSON.parseObject(getText(), clazz);
}
public <T> List<T> toList(Class<T> clazz) {
if (getText() == null) {
return null;
}
return JSON.parseArray(getText(), clazz);
}
public String getText() {
if (strings != null && strings.size() > 0) {
return strings.get(0);
}
return null;
}
@Override
public Selectable jsonPath(String jsonPath) {
JsonPathSelector jsonPathSelector = new JsonPathSelector(jsonPath);
return selectList(jsonPathSelector,strings);
}
}
package us.codecraft.webmagic.selector;
import com.jayway.jsonpath.JsonPath;
import us.codecraft.webmagic.utils.Experimental;
import java.util.ArrayList;
import java.util.List;
......@@ -13,7 +12,6 @@ import java.util.List;
* @author code4crafter@gmail.com <br>
* @since 0.2.1
*/
@Experimental
public class JsonPathSelector implements Selector {
private String jsonPathStr;
......@@ -22,7 +20,7 @@ public class JsonPathSelector implements Selector {
public JsonPathSelector(String jsonPathStr) {
this.jsonPathStr = jsonPathStr;
this.jsonPath = JsonPath.compile(jsonPathStr);
this.jsonPath = JsonPath.compile(this.jsonPathStr);
}
@Override
......
......@@ -109,7 +109,12 @@ public class PlainText implements Selectable {
}
@Override
public String toString() {
public Selectable jsonPath(String jsonPath) {
throw new UnsupportedOperationException();
}
@Override
public String get() {
if (CollectionUtils.isNotEmpty(all())) {
return all().get(0);
} else {
......@@ -117,6 +122,21 @@ public class PlainText implements Selectable {
}
}
@Override
public Selectable select(Selector selector) {
return select(selector, strings);
}
@Override
public Selectable selectList(Selector selector) {
return selectList(selector, strings);
}
@Override
public String toString() {
return get();
}
@Override
public boolean match() {
return strings != null && strings.size() > 0;
......
......@@ -99,6 +99,13 @@ public interface Selectable {
*/
public String toString();
/**
* single string result
*
* @return single string result
*/
public String get();
/**
* if result exist for select
*
......@@ -112,4 +119,28 @@ public interface Selectable {
* @return multi string result
*/
public List<String> all();
/**
* extract by JSON Path expression
*
* @param jsonPath
* @return
*/
public Selectable jsonPath(String jsonPath);
/**
* extract by custom selector
*
* @param selector
* @return
*/
public Selectable select(Selector selector);
/**
* extract by custom selector
*
* @param selector
* @return
*/
public Selectable selectList(Selector selector);
}
......@@ -32,8 +32,12 @@ public abstract class Selectors {
return new XpathSelector(expr);
}
public static XsoupSelector xsoup(String expr) {
return new XsoupSelector(expr);
/**
* @Deprecated
* @see #xpath(String)
*/
public static XpathSelector xsoup(String expr) {
return new XpathSelector(expr);
}
public static AndSelector and(Selector... selectors) {
......
package us.codecraft.webmagic.selector;
import org.htmlcleaner.*;
import org.jsoup.nodes.Element;
import us.codecraft.xsoup.XPathEvaluator;
import us.codecraft.xsoup.Xsoup;
import java.util.ArrayList;
import java.util.List;
/**
* XPath selector based on HtmlCleaner.<br>
* XPath selector based on Xsoup.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.1.0
* @since 0.3.0
*/
public class XpathSelector implements Selector {
public class XpathSelector extends BaseElementSelector {
private String xpathStr;
private XPathEvaluator xPathEvaluator;
public XpathSelector(String xpathStr) {
this.xpathStr = xpathStr;
this.xPathEvaluator = Xsoup.compile(xpathStr);
}
@Override
public String select(String text) {
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode tagNode = htmlCleaner.clean(text);
if (tagNode == null) {
return null;
}
try {
Object[] objects = tagNode.evaluateXPath(xpathStr);
if (objects != null && objects.length >= 1) {
if (objects[0] instanceof TagNode) {
TagNode tagNode1 = (TagNode) objects[0];
return htmlCleaner.getInnerHtml(tagNode1);
} else {
return objects[0].toString();
}
}
} catch (XPatherException e) {
e.printStackTrace();
}
return null;
public String select(Element element) {
return xPathEvaluator.evaluate(element).get();
}
@Override
public List<String> selectList(String text) {
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode tagNode = htmlCleaner.clean(text);
if (tagNode == null) {
return null;
}
List<String> results = new ArrayList<String>();
try {
Object[] objects = tagNode.evaluateXPath(xpathStr);
if (objects != null && objects.length >= 1) {
for (Object object : objects) {
if (object instanceof TagNode) {
TagNode tagNode1 = (TagNode) object;
results.add(htmlCleaner.getInnerHtml(tagNode1));
} else {
results.add(object.toString());
}
}
}
} catch (XPatherException e) {
e.printStackTrace();
}
return results;
public List<String> selectList(Element element) {
return xPathEvaluator.evaluate(element).list();
}
}
package us.codecraft.webmagic.selector;
import org.jsoup.nodes.Element;
import us.codecraft.xsoup.XPathEvaluator;
import us.codecraft.xsoup.Xsoup;
import java.util.List;
/**
* XPath selector based on Xsoup.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.3.0
*/
public class XsoupSelector extends BaseElementSelector {
private XPathEvaluator xPathEvaluator;
public XsoupSelector(String xpathStr) {
this.xPathEvaluator = Xsoup.compile(xpathStr);
}
@Override
public String select(Element element) {
return xPathEvaluator.evaluate(element).get();
}
@Override
public List<String> selectList(Element element) {
return xPathEvaluator.evaluate(element).list();
}
}
package us.codecraft.webmagic.selector.thread;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.ReentrantLock;
/**
* Thread pool for workers.<br></br>
* Use {@link java.util.concurrent.ExecutorService} as inner implement. <br></br>
* New feature: <br></br>
* 1. Block when thread pool is full to avoid poll many urls without process. <br></br>
* 2. Count of thread alive for monitor.
*
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public class CountableThreadPool {
private int threadNum;
private AtomicInteger threadAlive = new AtomicInteger();
private ReentrantLock reentrantLock = new ReentrantLock();
private Condition condition = reentrantLock.newCondition();
public CountableThreadPool(int threadNum) {
this.threadNum = threadNum;
this.executorService = Executors.newFixedThreadPool(threadNum);
}
public CountableThreadPool(int threadNum, ExecutorService executorService) {
this.threadNum = threadNum;
this.executorService = executorService;
}
public void setExecutorService(ExecutorService executorService) {
this.executorService = executorService;
}
public int getThreadAlive() {
return threadAlive.get();
}
public int getThreadNum() {
return threadNum;
}
private ExecutorService executorService;
public void execute(final Runnable runnable) {
if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}
public boolean isShutdown() {
return executorService.isShutdown();
}
public void shutdown() {
executorService.shutdown();
}
}
package us.codecraft.webmagic.utils;
import org.apache.commons.lang3.BooleanUtils;
import java.util.Properties;
/**
* @author code4crafter@gmail.com
* @since 0.3.0
*/
public abstract class EnvironmentUtil {
private static final String USE_XSOUP = "xsoup";
public static boolean useXsoup() {
Properties properties = System.getProperties();
Object o = properties.get(USE_XSOUP);
if (o == null) {
return true;
}
return BooleanUtils.toBoolean(((String) o).toLowerCase());
}
public static void setUseXsoup(boolean useXsoup) {
Properties properties = System.getProperties();
properties.setProperty(USE_XSOUP, BooleanUtils.toString(useXsoup, "true", "false"));
}
}
package us.codecraft.webmagic.utils;
/**
* Some constants of Http protocal.
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public abstract class HttpConstant {
public static abstract class Method {
public static final String GET = "GET";
public static final String HEAD = "HEAD";
public static final String POST = "POST";
public static final String PUT = "PUT";
public static final String DELETE = "DELETE";
public static final String TRACE = "TRACE";
public static final String CONNECT = "CONNECT";
}
public static abstract class Header {
public static final String REFERER = "Referer";
public static final String USER_AGENT = "User-Agent";
}
}
package us.codecraft.webmagic.utils;
import com.google.common.util.concurrent.MoreExecutors;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.SynchronousQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
/**
* @author code4crafer@gmail.com
* @since 0.1.0
*/
public class ThreadUtils {
public static ExecutorService newFixedThreadPool(int threadSize) {
if (threadSize <= 0) {
throw new IllegalArgumentException("ThreadSize must be greater than 0!");
}
if (threadSize == 1) {
return MoreExecutors.sameThreadExecutor();
}
return new ThreadPoolExecutor(threadSize - 1, threadSize - 1, 0L, TimeUnit.MILLISECONDS,
new SynchronousQueue<Runnable>(), new ThreadPoolExecutor.CallerRunsPolicy());
}
}
......@@ -43,12 +43,22 @@ public class UrlUtils {
if (url.startsWith("?"))
url = base.getPath() + url;
URL abs = new URL(base, url);
return abs.toExternalForm();
return encodeIllegalCharacterInUrl(abs.toExternalForm());
} catch (MalformedURLException e) {
return "";
}
}
/**
*
* @param url
* @return
*/
public static String encodeIllegalCharacterInUrl(String url) {
//TODO more charator support
return url.replace(" ", "%20");
}
public static String getHost(String url) {
String host = url;
int i = StringUtils.ordinalIndexOf(url, "/", 3);
......@@ -73,18 +83,37 @@ public class UrlUtils {
return domain;
}
private static Pattern patternForHref = Pattern.compile("(<a[^<>]*href=)[\"']{0,1}([^\"'<>\\s]*)[\"']{0,1}", Pattern.CASE_INSENSITIVE);
/**
* allow blank space in quote
*/
private static Pattern patternForHrefWithQuote = Pattern.compile("(<a[^<>]*href=)[\"']([^\"'<>]*)[\"']", Pattern.CASE_INSENSITIVE);
/**
* disallow blank space without quote
*/
private static Pattern patternForHrefWithoutQuote = Pattern.compile("(<a[^<>]*href=)([^\"'<>\\s]+)", Pattern.CASE_INSENSITIVE);
public static String fixAllRelativeHrefs(String html, String url) {
html = replaceByPattern(html, url, patternForHrefWithQuote);
html = replaceByPattern(html, url, patternForHrefWithoutQuote);
return html;
}
public static String replaceByPattern(String html, String url, Pattern pattern) {
StringBuilder stringBuilder = new StringBuilder();
Matcher matcher = patternForHref.matcher(html);
Matcher matcher = pattern.matcher(html);
int lastEnd = 0;
boolean modified = false;
while (matcher.find()) {
modified = true;
stringBuilder.append(StringUtils.substring(html, lastEnd, matcher.start()));
stringBuilder.append(matcher.group(1));
stringBuilder.append("\"").append(canonicalizeUrl(matcher.group(2), url)).append("\"");
lastEnd = matcher.end();
}
if (!modified) {
return html;
}
stringBuilder.append(StringUtils.substring(html, lastEnd));
return stringBuilder.toString();
}
......
package us.codecraft.webmagic;
import org.junit.Assert;
import org.junit.Test;
import us.codecraft.webmagic.selector.Html;
......@@ -14,7 +13,8 @@ public class HtmlTest {
@Test
public void testRegexSelector() {
Html selectable = new Html("aaaaaaab");
Assert.assertEquals("abbabbab", (selectable.regex("(.*)").replace("aa(a)", "$1bb").toString()));
// Assert.assertEquals("abbabbab", (selectable.regex("(.*)").replace("aa(a)", "$1bb").toString()));
System.out.println(selectable.regex("(.*)").replace("aa(a)", "$1bb").toString());
}
......
package us.codecraft.webmagic;
import org.junit.Test;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmail.com
*/
public class ResultItemsTest {
@Test
public void testOrderOfEntries() throws Exception {
ResultItems resultItems = new ResultItems();
resultItems.put("a", "a");
resultItems.put("b", "b");
resultItems.put("c", "c");
assertThat(resultItems.getAll().keySet()).containsExactly("a","b","c");
}
}
......@@ -37,7 +37,7 @@ public class SpiderTest {
@Test
public void testWaitAndNotify() throws InterruptedException {
for (int i = 0; i < 10000; i++) {
System.out.println("round" + i);
System.out.println("round " + i);
testRound();
}
}
......
......@@ -8,6 +8,8 @@ import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.selector.Html;
import java.io.UnsupportedEncodingException;
import static org.assertj.core.api.Assertions.assertThat;
import static org.junit.Assert.assertTrue;
......@@ -28,10 +30,16 @@ public class HttpClientDownloaderTest {
@Test
public void testDownloader() {
HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
Html html = httpClientDownloader.download("http://www.oschina.net");
Html html = httpClientDownloader.download("https://github.com");
assertTrue(!html.getText().isEmpty());
}
@Test(expected = IllegalArgumentException.class)
public void testDownloaderInIllegalUrl() throws UnsupportedEncodingException {
HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
httpClientDownloader.download("http://www.oschina.net/>");
}
@Test
public void testCycleTriedTimes() {
HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
......
......@@ -29,6 +29,6 @@ public class ExtractorsTest {
Assert.assertEquals("bb", and($("title"), regex("aa(bb)cc")).select(html2));
OrSelector or = or($("div h1 a", "innerHtml"), xpath("//title"));
Assert.assertEquals("aabbcc", or.select(html));
Assert.assertEquals("aabbcc", or.select(html2));
Assert.assertEquals("<title>aabbcc</title>", or.select(html2));
}
}
package us.codecraft.webmagic.selector;
import org.junit.Test;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmai.com
* @since 0.5.0
*/
public class JsonTest {
private String text = "callback({\"name\":\"json\"})";
@Test
public void testRemovePadding() throws Exception {
String name = new Json(text).removePadding("callback").jsonPath("$.name").get();
assertThat(name).isEqualTo("json");
}
}
package us.codecraft.webmagic.selector;
import org.junit.Test;
import java.util.List;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmail.com
*/
public class SelectorTest {
private String html = "<div><a href='http://whatever.com/aaa'></a></div><div><a href='http://whatever.com/bbb'></a></div>";
@Test
public void testChain() throws Exception {
Html selectable = new Html(html);
List<String> linksWithoutChain = selectable.links().all();
Selectable xpath = selectable.xpath("//div");
List<String> linksWithChainFirstCall = xpath.links().all();
List<String> linksWithChainSecondCall = xpath.links().all();
assertThat(linksWithoutChain).hasSameSizeAs(linksWithChainFirstCall);
assertThat(linksWithChainFirstCall).hasSameSizeAs(linksWithChainSecondCall);
}
}
package us.codecraft.webmagic.utils;
import org.junit.Test;
import static junit.framework.Assert.*;
/**
* @author code4crafter@gmail.com
*/
public class EnvironmentUtilTest {
@Test
public void test() {
assertTrue(EnvironmentUtil.useXsoup());
EnvironmentUtil.setUseXsoup(false);
assertFalse(EnvironmentUtil.useXsoup());
}
}
......@@ -3,6 +3,8 @@ package us.codecraft.webmagic.utils;
import org.junit.Assert;
import org.junit.Test;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-4-21
......@@ -12,19 +14,44 @@ public class UrlUtilsTest {
@Test
public void testFixRelativeUrl() {
String fixrelativeurl = UrlUtils.canonicalizeUrl("aa", "http://www.dianping.com/sh/ss/com");
System.out.println("fix: " + fixrelativeurl);
Assert.assertEquals("http://www.dianping.com/sh/ss/aa", fixrelativeurl);
fixrelativeurl = UrlUtils.canonicalizeUrl("../aa", "http://www.dianping.com/sh/ss/com");
Assert.assertEquals("http://www.dianping.com/sh/aa", fixrelativeurl);
fixrelativeurl = UrlUtils.canonicalizeUrl("..aa", "http://www.dianping.com/sh/ss/com");
Assert.assertEquals("http://www.dianping.com/sh/ss/..aa", fixrelativeurl);
fixrelativeurl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com/");
Assert.assertEquals("http://www.dianping.com/sh/aa", fixrelativeurl);
fixrelativeurl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com");
Assert.assertEquals("http://www.dianping.com/aa", fixrelativeurl);
String absoluteUrl = UrlUtils.canonicalizeUrl("aa", "http://www.dianping.com/sh/ss/com");
assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/ss/aa");
absoluteUrl = UrlUtils.canonicalizeUrl("../aa", "http://www.dianping.com/sh/ss/com");
assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/aa");
absoluteUrl = UrlUtils.canonicalizeUrl("..aa", "http://www.dianping.com/sh/ss/com");
assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/ss/..aa");
absoluteUrl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com/");
assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/aa");
absoluteUrl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com");
assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/aa");
}
@Test
public void testFixAllRelativeHrefs() {
String originHtml = "<a href=\"/start\">";
String replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/");
assertThat(replacedHtml).isEqualTo("<a href=\"http://www.dianping.com/start\">");
originHtml = "<a href=\"/start a\">";
replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/");
assertThat(replacedHtml).isEqualTo("<a href=\"http://www.dianping.com/start%20a\">");
originHtml = "<a href='/start a'>";
replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/");
assertThat(replacedHtml).isEqualTo("<a href=\"http://www.dianping.com/start%20a\">");
originHtml = "<a href=/start tag>";
replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/");
assertThat(replacedHtml).isEqualTo("<a href=\"http://www.dianping.com/start\" tag>");
}
@Test
public void test(){
UrlUtils.canonicalizeUrl("start tag", "http://www.dianping.com/");
}
@Test
......
......@@ -8,21 +8,11 @@
</layout>
</appender>
<logger name="org.springframework" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="net.sf.ehcache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="info" />
<appender-ref ref="stdout" />
......
......@@ -3,17 +3,13 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>webmagic-extension</artifactId>
<dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
......@@ -28,11 +24,6 @@
<groupId>junit</groupId>
<artifactId>junit</artifactId>
</dependency>
<dependency>
<groupId>com.jayway.jsonpath</groupId>
<artifactId>json-path</artifactId>
<version>0.8.1</version>
</dependency>
</dependencies>
</project>
package us.codecraft.webmagic.configurable;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.utils.Experimental;
import java.util.List;
/**
* @author code4crafter@gmail.com <br>
*/
@Experimental
public class ConfigurablePageProcessor implements PageProcessor {
private Site site;
private List<ExtractRule> extractRules;
public ConfigurablePageProcessor(Site site, List<ExtractRule> extractRules) {
this.site = site;
this.extractRules = extractRules;
}
@Override
public void process(Page page) {
for (ExtractRule extractRule : extractRules) {
if (extractRule.isMulti()) {
List<String> results = page.getHtml().selectDocumentForList(extractRule.getSelector());
if (extractRule.isNotNull() && results.size() == 0) {
page.setSkip(true);
} else {
page.getResultItems().put(extractRule.getFieldName(), results);
}
} else {
String result = page.getHtml().selectDocument(extractRule.getSelector());
if (extractRule.isNotNull() && result == null) {
page.setSkip(true);
} else {
page.getResultItems().put(extractRule.getFieldName(), result);
}
}
}
}
@Override
public Site getSite() {
return site;
}
}
package us.codecraft.webmagic.configurable;
/**
* @author code4crafter@gmail.com
* @date 14-4-5
*/
public enum ExpressionType {
XPath, Regex, Css, JsonPath;
}
package us.codecraft.webmagic.configurable;
import us.codecraft.webmagic.selector.JsonPathSelector;
import us.codecraft.webmagic.selector.Selector;
import static us.codecraft.webmagic.selector.Selectors.*;
/**
* @author code4crafter@gmail.com
* @date 14-4-5
*/
public class ExtractRule {
private String fieldName;
private ExpressionType expressionType;
private String expressionValue;
private String[] expressionParams;
private boolean multi = false;
private volatile Selector selector;
private boolean notNull = false;
public String getFieldName() {
return fieldName;
}
public void setFieldName(String fieldName) {
this.fieldName = fieldName;
}
public ExpressionType getExpressionType() {
return expressionType;
}
public void setExpressionType(ExpressionType expressionType) {
this.expressionType = expressionType;
}
public String getExpressionValue() {
return expressionValue;
}
public void setExpressionValue(String expressionValue) {
this.expressionValue = expressionValue;
}
public String[] getExpressionParams() {
return expressionParams;
}
public void setExpressionParams(String[] expressionParams) {
this.expressionParams = expressionParams;
}
public boolean isMulti() {
return multi;
}
public void setMulti(boolean multi) {
this.multi = multi;
}
public Selector getSelector() {
if (selector == null) {
synchronized (this) {
if (selector == null) {
selector = compileSelector();
}
}
}
return selector;
}
private Selector compileSelector() {
switch (expressionType) {
case Css:
if (expressionParams.length >= 1) {
return $(expressionValue, expressionParams[0]);
} else {
return $(expressionValue);
}
case XPath:
return xpath(expressionValue);
case Regex:
if (expressionParams.length >= 1) {
return regex(expressionValue, Integer.parseInt(expressionParams[0]));
} else {
return regex(expressionValue);
}
case JsonPath:
return new JsonPathSelector(expressionValue);
default:
return xpath(expressionValue);
}
}
public void setSelector(Selector selector) {
this.selector = selector;
}
public boolean isNotNull() {
return notNull;
}
public void setNotNull(boolean notNull) {
this.notNull = notNull;
}
}
package us.codecraft.webmagic.configurable;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.Map;
/**
* Inject property to object by {@link Inject} annotation.
*
* @author yihua.huang@dianping.com
*/
public class PropertyLoader<T> {
public T load(T object, Map<String, String> properties) {
return object;
}
}
package us.codecraft.webmagic.example;
import java.util.List;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.configurable.Inject;
import us.codecraft.webmagic.processor.PageProcessor;
/**
* @author code4crafter@gmail.com <br>
*/
public class ConfigurableBlogPageProcessor implements PageProcessor {
private Site site = Site.me().setDomain("my.oschina.net");
@Inject("linkRegex")
private String linkRegex;
@Inject("titleXpath")
private String titleXpath;
@Inject("contentXpath")
private String contentXpath;
@Inject("tagsXpath")
private String tagsXpath;
@Override
public void process(Page page) {
List<String> links = page.getHtml().links().regex(linkRegex).all();
page.addTargetRequests(links);
page.putField("title", page.getHtml().xpath(titleXpath).toString());
if (page.getResultItems().get("title") == null) {
//skip this page
page.setSkip(true);
}
page.putField("content", page.getHtml().smartContent().toString());
page.putField("tags", page.getHtml().xpath(tagsXpath).all());
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new ConfigurableBlogPageProcessor()).addUrl("http://my.oschina.net/flashsword/blog").thread(2).run();
}
}
package us.codecraft.webmagic.example;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.monitor.SpiderMonitor;
import us.codecraft.webmagic.processor.example.GithubRepoPageProcessor;
import us.codecraft.webmagic.processor.example.OschinaBlogPageProcessor;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public class MonitorExample {
public static void main(String[] args) throws Exception {
Spider oschinaSpider = Spider.create(new OschinaBlogPageProcessor())
.addUrl("http://my.oschina.net/flashsword/blog");
Spider githubSpider = Spider.create(new GithubRepoPageProcessor())
.addUrl("https://github.com/code4craft");
SpiderMonitor.instance().register(oschinaSpider);
SpiderMonitor.instance().register(githubSpider);
oschinaSpider.start();
githubSpider.start();
}
}
......@@ -26,11 +26,11 @@ public class OschinaBlog {
@ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
private List<String> tags;
@Formatter("yyyy-MM-dd HH:mm")
@ExtractBy("//div[@class='BlogStat']/regex('\\d+-\\d+-\\d+\\s+\\d+:\\d+')")
private Date date;
public static void main(String[] args) {
//results will be saved to "/data/webmagic/" in json format
OOSpider.create(Site.me(), new JsonFilePageModelPipeline("/data/webmagic/"), OschinaBlog.class)
.addUrl("http://my.oschina.net/flashsword/blog").run();
}
......
package us.codecraft.webmagic.example;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.*;
import us.codecraft.webmagic.handler.CompositePageProcessor;
import us.codecraft.webmagic.handler.CompositePipeline;
import us.codecraft.webmagic.handler.PatternProcessor;
import us.codecraft.webmagic.handler.RequestMatcher;
/**
* Created with IntelliJ IDEA.
* User: Sebastian MA
* Date: April 04, 2014
* Time: 21:23
*/
public class PatternProcessorExample {
private static Logger log = Logger.getLogger(PatternProcessorExample.class);
public static void main(String... args) {
// define a patternProcessor which handles only "http://item.jd.com/.*"
PatternProcessor githubRepoProcessor = new PatternProcessor("https://github\\.com/[\\w\\-]+/[\\w\\-]+") {
@Override
public RequestMatcher.MatchOther processPage(Page page) {
page.putField("reponame", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
return RequestMatcher.MatchOther.YES;
}
@Override
public RequestMatcher.MatchOther processResult(ResultItems resultItems, Task task) {
log.info("Extracting from repo" + resultItems.getRequest());
System.out.println("Repo name: "+resultItems.get("reponame"));
return RequestMatcher.MatchOther.YES;
}
};
PatternProcessor githubUserProcessor = new PatternProcessor("https://github\\.com/[\\w\\-]+") {
@Override
public RequestMatcher.MatchOther processPage(Page page) {
log.info("Extracting from " + page.getUrl());
page.addTargetRequests(page.getHtml().links().regex("https://github\\.com/[\\w\\-]+/[\\w\\-]+").all());
page.addTargetRequests(page.getHtml().links().regex("https://github\\.com/[\\w\\-]+").all());
page.putField("username", page.getHtml().xpath("//span[@class='vcard-fullname']/text()").toString());
return RequestMatcher.MatchOther.YES;
}
@Override
public RequestMatcher.MatchOther processResult(ResultItems resultItems, Task task) {
System.out.println("User name: "+resultItems.get("username"));
return RequestMatcher.MatchOther.YES;
}
};
CompositePageProcessor pageProcessor = new CompositePageProcessor(Site.me().setDomain("github.com").setRetryTimes(3));
CompositePipeline pipeline = new CompositePipeline();
pageProcessor.setSubPageProcessors(githubRepoProcessor, githubUserProcessor);
pipeline.setSubPipeline(githubRepoProcessor, githubUserProcessor);
Spider.create(pageProcessor).addUrl("https://github.com/code4craft").thread(5).addPipeline(pipeline).runAsync();
}
}
package us.codecraft.webmagic.handler;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafter@gmail.com
* @date 14-4-5
*/
public class CompositePageProcessor implements PageProcessor {
private Site site;
private List<SubPageProcessor> subPageProcessors = new ArrayList<SubPageProcessor>();
public CompositePageProcessor(Site site) {
this.site = site;
}
@Override
public void process(Page page) {
for (SubPageProcessor subPageProcessor : subPageProcessors) {
if (subPageProcessor.match(page.getRequest())) {
SubPageProcessor.MatchOther matchOtherProcessorProcessor = subPageProcessor.processPage(page);
if (matchOtherProcessorProcessor == null || matchOtherProcessorProcessor != SubPageProcessor.MatchOther.YES) {
return;
}
}
}
}
public CompositePageProcessor setSite(Site site) {
this.site = site;
return this;
}
public CompositePageProcessor addSubPageProcessor(SubPageProcessor subPageProcessor) {
this.subPageProcessors.add(subPageProcessor);
return this;
}
public CompositePageProcessor setSubPageProcessors(SubPageProcessor... subPageProcessors) {
this.subPageProcessors = new ArrayList<SubPageProcessor>();
for (SubPageProcessor subPageProcessor : subPageProcessors) {
this.subPageProcessors.add(subPageProcessor);
}
return this;
}
@Override
public Site getSite() {
return site;
}
}
package us.codecraft.webmagic.handler;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafer@gmail.com
*/
public class CompositePipeline implements Pipeline {
private List<SubPipeline> subPipelines = new ArrayList<SubPipeline>();
@Override
public void process(ResultItems resultItems, Task task) {
for (SubPipeline subPipeline : subPipelines) {
if (subPipeline.match(resultItems.getRequest())) {
RequestMatcher.MatchOther matchOtherProcessorProcessor = subPipeline.processResult(resultItems, task);
if (matchOtherProcessorProcessor == null || matchOtherProcessorProcessor != RequestMatcher.MatchOther.YES) {
return;
}
}
}
}
public CompositePipeline addSubPipeline(SubPipeline subPipeline) {
this.subPipelines.add(subPipeline);
return this;
}
public CompositePipeline setSubPipeline(SubPipeline... subPipelines) {
this.subPipelines = new ArrayList<SubPipeline>();
for (SubPipeline subPipeline : subPipelines) {
this.subPipelines.add(subPipeline);
}
return this;
}
}
package us.codecraft.webmagic.handler;
/**
* @author code4crafer@gmail.com
*/
public abstract class PatternProcessor extends PatternRequestMatcher implements SubPipeline, SubPageProcessor {
/**
* @param pattern url pattern to handle
*/
public PatternProcessor(String pattern) {
super(pattern);
}
}
package us.codecraft.webmagic.handler;
import us.codecraft.webmagic.Request;
import java.util.regex.Pattern;
/**
* Created with IntelliJ IDEA.
* User: Sebastian MA
* Date: April 03, 2014
* Time: 10:00
* <p></p>
* A PatternHandler is in charge of both page extraction and data processing by implementing
* its two abstract methods.
*/
public abstract class PatternRequestMatcher implements RequestMatcher {
/**
* match pattern. only matched page should be handled.
*/
protected String pattern;
private Pattern patternCompiled;
/**
* @param pattern url pattern to handle
*/
public PatternRequestMatcher(String pattern) {
this.pattern = pattern;
this.patternCompiled = Pattern.compile(pattern);
}
@Override
public boolean match(Request request) {
return patternCompiled.matcher(request.getUrl()).matches();
}
}
package us.codecraft.webmagic.handler;
import us.codecraft.webmagic.Request;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public interface RequestMatcher {
/**
* Check whether to process the page.<br></br>
* Please DO NOT change page status in this method.
*
* @param page
*
* @return
*/
public boolean match(Request page);
public enum MatchOther {
YES, NO
}
}
package us.codecraft.webmagic.handler;
import us.codecraft.webmagic.Page;
/**
* @author code4crafter@gmail.com
* @date 14-4-5
*/
public interface SubPageProcessor extends RequestMatcher {
/**
* process the page, extract urls to fetch, extract the data and store
*
* @param page
*
* @return whether continue to match
*/
public MatchOther processPage(Page page);
}
package us.codecraft.webmagic.handler;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public interface SubPipeline extends RequestMatcher {
/**
* process the page, extract urls to fetch, extract the data and store
*
* @param page
* @param task
* @return whether continue to match
*/
public MatchOther processResult(ResultItems resultItems, Task task);
}
......@@ -7,9 +7,7 @@ import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selector;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
......@@ -25,8 +23,6 @@ class ModelPageProcessor implements PageProcessor {
private Site site;
private Set<Pattern> targetUrlPatterns = new HashSet<Pattern>();
public static ModelPageProcessor create(Site site, Class... clazzs) {
ModelPageProcessor modelPageProcessor = new ModelPageProcessor(site);
for (Class clazz : clazzs) {
......@@ -38,8 +34,6 @@ class ModelPageProcessor implements PageProcessor {
public ModelPageProcessor addPageModel(Class clazz) {
PageModelExtractor pageModelExtractor = PageModelExtractor.create(clazz);
targetUrlPatterns.addAll(pageModelExtractor.getTargetUrlPatterns());
targetUrlPatterns.addAll(pageModelExtractor.getHelpUrlPatterns());
pageModelExtractorList.add(pageModelExtractor);
return this;
}
......@@ -55,11 +49,14 @@ class ModelPageProcessor implements PageProcessor {
extractLinks(page, pageModelExtractor.getTargetUrlRegionSelector(), pageModelExtractor.getTargetUrlPatterns());
Object process = pageModelExtractor.process(page);
if (process == null || (process instanceof List && ((List) process).size() == 0)) {
page.getResultItems().setSkip(true);
continue;
}
postProcessPageModel(pageModelExtractor.getClazz(), process);
page.putField(pageModelExtractor.getClazz().getCanonicalName(), process);
}
if (page.getResultItems().getAll().size() == 0) {
page.getResultItems().setSkip(true);
}
}
private void extractLinks(Page page, Selector urlRegionSelector, List<Pattern> urlPatterns) {
......@@ -67,7 +64,7 @@ class ModelPageProcessor implements PageProcessor {
if (urlRegionSelector == null) {
links = page.getHtml().links().all();
} else {
links = urlRegionSelector.selectList(page.getHtml().toString());
links = page.getHtml().selectList(urlRegionSelector).links().all();
}
for (String link : links) {
for (Pattern targetUrlPattern : urlPatterns) {
......
......@@ -9,6 +9,7 @@ import us.codecraft.webmagic.model.formatter.BasicTypeFormatter;
import us.codecraft.webmagic.model.formatter.ObjectFormatter;
import us.codecraft.webmagic.model.formatter.ObjectFormatters;
import us.codecraft.webmagic.selector.*;
import us.codecraft.webmagic.utils.ClassUtils;
import us.codecraft.webmagic.utils.ExtractorUtils;
import java.lang.annotation.Annotation;
......@@ -53,7 +54,7 @@ class PageModelExtractor {
this.clazz = clazz;
initClassExtractors();
fieldExtractors = new ArrayList<FieldExtractor>();
for (Field field : clazz.getDeclaredFields()) {
for (Field field : ClassUtils.getFieldsIncludeSuperClass(clazz)) {
field.setAccessible(true);
FieldExtractor fieldExtractor = getAnnotationExtractBy(clazz, field);
FieldExtractor fieldExtractorTmp = getAnnotationExtractCombo(clazz, field);
......@@ -76,9 +77,21 @@ class PageModelExtractor {
}
private void checkFormat(Field field, FieldExtractor fieldExtractor) {
//check custom formatter
Formatter formatter = field.getAnnotation(Formatter.class);
if (formatter != null && !formatter.formatter().equals(ObjectFormatter.class)) {
if (formatter != null) {
if (!formatter.formatter().equals(ObjectFormatter.class)) {
ObjectFormatter objectFormatter = initFormatter(formatter.formatter());
objectFormatter.initParam(formatter.value());
fieldExtractor.setObjectFormatter(objectFormatter);
return;
}
}
}
if (!fieldExtractor.isMulti() && !String.class.isAssignableFrom(field.getType())) {
Class<?> fieldClazz = BasicTypeFormatter.detectBasicClass(field.getType());
ObjectFormatter objectFormatter = getObjectFormatter(field, fieldClazz);
ObjectFormatter objectFormatter = getObjectFormatter(field, fieldClazz, formatter);
if (objectFormatter == null) {
throw new IllegalStateException("Can't find formatter for field " + field.getName() + " of type " + fieldClazz);
} else {
......@@ -88,10 +101,9 @@ class PageModelExtractor {
if (!List.class.isAssignableFrom(field.getType())) {
throw new IllegalStateException("Field " + field.getName() + " must be list");
}
Formatter formatter = field.getAnnotation(Formatter.class);
if (formatter != null) {
if (!formatter.subClazz().equals(Void.class)) {
ObjectFormatter objectFormatter = getObjectFormatter(field, formatter.subClazz());
ObjectFormatter objectFormatter = getObjectFormatter(field, formatter.subClazz(), formatter);
if (objectFormatter == null) {
throw new IllegalStateException("Can't find formatter for field " + field.getName() + " of type " + formatter.subClazz());
} else {
......@@ -102,14 +114,7 @@ class PageModelExtractor {
}
}
private ObjectFormatter getObjectFormatter(Field field, Class<?> fieldClazz) {
Formatter formatter = field.getAnnotation(Formatter.class);
if (formatter != null) {
if (!formatter.formatter().equals(ObjectFormatter.class)) {
ObjectFormatter objectFormatter = initFormatter(formatter.formatter());
objectFormatter.initParam(formatter.value());
}
}
private ObjectFormatter getObjectFormatter(Field field, Class<?> fieldClazz, Formatter formatter) {
return initFormatter(ObjectFormatters.get(fieldClazz));
}
......@@ -340,9 +345,7 @@ class PageModelExtractor {
private Object convert(String value, ObjectFormatter objectFormatter) {
try {
Object format = objectFormatter.format(value);
if (logger.isDebugEnabled()) {
logger.debug("String " + value + " is converted to " + format);
}
logger.debug("String {} is converted to {}", value, format);
return format;
} catch (Exception e) {
logger.error("convert " + value + " to " + objectFormatter.clazz() + " error!", e);
......
......@@ -21,7 +21,7 @@ public @interface Formatter {
*
* @return formatter params
*/
String[] value();
String[] value() default "";
/**
* Specific the class of field of class of elements in collection for field. <br/>
......
......@@ -10,7 +10,8 @@ import java.util.Date;
*/
public class DateFormatter implements ObjectFormatter<Date> {
private String[] datePatterns = new String[]{"yyyy-MM-dd HH:mm"};
public static final String[] DEFAULT_PATTERN = new String[]{"yyyy-MM-dd HH:mm"};
private String[] datePatterns = DEFAULT_PATTERN;
@Override
public Date format(String raw) throws Exception {
......@@ -24,6 +25,8 @@ public class DateFormatter implements ObjectFormatter<Date> {
@Override
public void initParam(String[] extra) {
datePatterns = extra;
if (extra != null && !(extra.length == 1 && extra[0].length() == 0)) {
datePatterns = extra;
}
}
}
package us.codecraft.webmagic.monitor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.SpiderListener;
import us.codecraft.webmagic.utils.Experimental;
import javax.management.*;
import java.lang.management.ManagementFactory;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
@Experimental
public class SpiderMonitor {
private static SpiderMonitor INSTANCE = new SpiderMonitor();
private AtomicBoolean started = new AtomicBoolean(false);
private Logger logger = LoggerFactory.getLogger(getClass());
private MBeanServer mbeanServer;
private String jmxServerName;
private List<SpiderStatusMXBean> spiderStatuses = new ArrayList<SpiderStatusMXBean>();
protected SpiderMonitor() {
jmxServerName = "WebMagic";
mbeanServer = ManagementFactory.getPlatformMBeanServer();
}
/**
* Register spider for monitor.
*
* @param spiders
* @return
*/
public synchronized SpiderMonitor register(Spider... spiders) throws JMException {
for (Spider spider : spiders) {
MonitorSpiderListener monitorSpiderListener = new MonitorSpiderListener();
if (spider.getSpiderListeners() == null) {
List<SpiderListener> spiderListeners = new ArrayList<SpiderListener>();
spiderListeners.add(monitorSpiderListener);
spider.setSpiderListeners(spiderListeners);
} else {
spider.getSpiderListeners().add(monitorSpiderListener);
}
SpiderStatusMXBean spiderStatusMBean = getSpiderStatusMBean(spider, monitorSpiderListener);
registerMBean(spiderStatusMBean);
spiderStatuses.add(spiderStatusMBean);
}
return this;
}
protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) {
return new SpiderStatus(spider, monitorSpiderListener);
}
public static SpiderMonitor instance() {
return INSTANCE;
}
public class MonitorSpiderListener implements SpiderListener {
private final AtomicInteger successCount = new AtomicInteger(0);
private final AtomicInteger errorCount = new AtomicInteger(0);
private List<String> errorUrls = Collections.synchronizedList(new ArrayList<String>());
@Override
public void onSuccess(Request request) {
successCount.incrementAndGet();
}
@Override
public void onError(Request request) {
errorUrls.add(request.getUrl());
errorCount.incrementAndGet();
}
public AtomicInteger getSuccessCount() {
return successCount;
}
public AtomicInteger getErrorCount() {
return errorCount;
}
public List<String> getErrorUrls() {
return errorUrls;
}
}
protected void registerMBean(SpiderStatusMXBean spiderStatus) throws MalformedObjectNameException, InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException {
ObjectName objName = new ObjectName(jmxServerName + ":name=" + spiderStatus.getName());
mbeanServer.registerMBean(spiderStatus, objName);
}
}
package us.codecraft.webmagic.monitor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.scheduler.MonitorableScheduler;
import java.util.Date;
import java.util.List;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public class SpiderStatus implements SpiderStatusMXBean {
protected final Spider spider;
protected Logger logger = LoggerFactory.getLogger(getClass());
protected final SpiderMonitor.MonitorSpiderListener monitorSpiderListener;
public SpiderStatus(Spider spider, SpiderMonitor.MonitorSpiderListener monitorSpiderListener) {
this.spider = spider;
this.monitorSpiderListener = monitorSpiderListener;
}
public String getName() {
return spider.getUUID();
}
public int getLeftPageCount() {
if (spider.getScheduler() instanceof MonitorableScheduler) {
return ((MonitorableScheduler) spider.getScheduler()).getLeftRequestsCount(spider);
}
logger.warn("Get leftPageCount fail, try to use a Scheduler implement MonitorableScheduler for monitor count!");
return -1;
}
public int getTotalPageCount() {
if (spider.getScheduler() instanceof MonitorableScheduler) {
return ((MonitorableScheduler) spider.getScheduler()).getTotalRequestsCount(spider);
}
logger.warn("Get totalPageCount fail, try to use a Scheduler implement MonitorableScheduler for monitor count!");
return -1;
}
@Override
public int getSuccessPageCount() {
return monitorSpiderListener.getSuccessCount().get();
}
@Override
public int getErrorPageCount() {
return monitorSpiderListener.getErrorCount().get();
}
public List<String> getErrorPages() {
return monitorSpiderListener.getErrorUrls();
}
@Override
public String getStatus() {
return spider.getStatus().name();
}
@Override
public int getThread() {
return spider.getThreadAlive();
}
public void start() {
spider.start();
}
public void stop() {
spider.stop();
}
@Override
public Date getStartTime() {
return spider.getStartTime();
}
@Override
public int getPagePerSecond() {
int runSeconds = (int) (System.currentTimeMillis() - getStartTime().getTime()) / 1000;
return getSuccessPageCount() / runSeconds;
}
}
package us.codecraft.webmagic.monitor;
import java.util.Date;
import java.util.List;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public interface SpiderStatusMXBean {
public String getName();
public String getStatus();
public int getThread();
public int getTotalPageCount();
public int getLeftPageCount();
public int getSuccessPageCount();
public int getErrorPageCount();
public List<String> getErrorPages();
public void start();
public void stop();
public Date getStartTime();
public int getPagePerSecond();
}
......@@ -23,7 +23,7 @@ import java.util.concurrent.atomic.AtomicInteger;
* @author code4crafter@gmail.com <br>
* @since 0.2.0
*/
public class FileCacheQueueScheduler implements Scheduler {
public class FileCacheQueueScheduler extends LocalDuplicatedRemoveScheduler {
private Logger logger = LoggerFactory.getLogger(getClass());
......@@ -145,18 +145,12 @@ public class FileCacheQueueScheduler implements Scheduler {
}
@Override
public synchronized void push(Request request, Task task) {
protected void pushWhenNoDuplicate(Request request, Task task) {
if (!inited.get()) {
init(task);
}
if (logger.isDebugEnabled()) {
logger.debug("push to queue " + request.getUrl());
}
if (urls.add(request.getUrl())) {
queue.add(request);
fileUrlWriter.println(request.getUrl());
}
queue.add(request);
fileUrlWriter.println(request.getUrl());
}
@Override
......@@ -167,4 +161,9 @@ public class FileCacheQueueScheduler implements Scheduler {
fileCursorWriter.println(cursor.incrementAndGet());
return queue.poll();
}
@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}
}
......@@ -14,7 +14,7 @@ import us.codecraft.webmagic.Task;
* @author code4crafter@gmail.com <br>
* @since 0.2.0
*/
public class RedisScheduler implements Scheduler {
public class RedisScheduler extends DuplicatedRemoveScheduler implements MonitorableScheduler {
private JedisPool pool;
......@@ -33,21 +33,39 @@ public class RedisScheduler implements Scheduler {
}
@Override
public synchronized void push(Request request, Task task) {
public void resetDuplicateCheck(Task task) {
Jedis jedis = pool.getResource();
try {
// if cycleRetriedTimes is set, allow duplicated.
Object cycleRetriedTimes = request.getExtra(Request.CYCLE_TRIED_TIMES);
// use set to remove duplicate url
if (cycleRetriedTimes != null || !jedis.sismember(SET_PREFIX + task.getUUID(), request.getUrl())) {
// use list to store queue
jedis.rpush(QUEUE_PREFIX + task.getUUID(), request.getUrl());
jedis.sadd(SET_PREFIX + task.getUUID(), request.getUrl());
if (request.getExtras() != null) {
String field = DigestUtils.shaHex(request.getUrl());
String value = JSON.toJSONString(request);
jedis.hset((ITEM_PREFIX + task.getUUID()), field, value);
}
jedis.del(getSetKey(task));
} finally {
pool.returnResource(jedis);
}
}
@Override
protected boolean isDuplicate(Request request, Task task) {
Jedis jedis = pool.getResource();
try {
boolean isDuplicate = !jedis.sismember(getSetKey(task), request.getUrl());
if (!isDuplicate) {
jedis.sadd(getSetKey(task), request.getUrl());
}
return isDuplicate;
} finally {
pool.returnResource(jedis);
}
}
@Override
protected void pushWhenNoDuplicate(Request request, Task task) {
Jedis jedis = pool.getResource();
try {
jedis.rpush(getQueueKey(task), request.getUrl());
if (request.getExtras() != null) {
String field = DigestUtils.shaHex(request.getUrl());
String value = JSON.toJSONString(request);
jedis.hset((ITEM_PREFIX + task.getUUID()), field, value);
}
} finally {
pool.returnResource(jedis);
......@@ -58,7 +76,7 @@ public class RedisScheduler implements Scheduler {
public synchronized Request poll(Task task) {
Jedis jedis = pool.getResource();
try {
String url = jedis.lpop(QUEUE_PREFIX + task.getUUID());
String url = jedis.lpop(getQueueKey(task));
if (url == null) {
return null;
}
......@@ -75,4 +93,34 @@ public class RedisScheduler implements Scheduler {
pool.returnResource(jedis);
}
}
protected String getSetKey(Task task) {
return SET_PREFIX + task.getUUID();
}
protected String getQueueKey(Task task) {
return QUEUE_PREFIX + task.getUUID();
}
@Override
public int getLeftRequestsCount(Task task) {
Jedis jedis = pool.getResource();
try {
Long size = jedis.llen(getQueueKey(task));
return size.intValue();
} finally {
pool.returnResource(jedis);
}
}
@Override
public int getTotalRequestsCount(Task task) {
Jedis jedis = pool.getResource();
try {
Long size = jedis.scard(getQueueKey(task));
return size.intValue();
} finally {
pool.returnResource(jedis);
}
}
}
package us.codecraft.webmagic.utils;
import java.lang.reflect.Field;
import java.util.LinkedHashSet;
import java.util.Set;
/**
* @author code4crafter@gmail.com
* @since 0.5.0
*/
public abstract class ClassUtils {
public static Set<Field> getFieldsIncludeSuperClass(Class clazz) {
Set<Field> fields = new LinkedHashSet<Field>();
Class current = clazz;
while (current != null) {
Field[] currentFields = current.getDeclaredFields();
for (Field currentField : currentFields) {
fields.add(currentField);
}
current = current.getSuperclass();
}
return fields;
}
}
......@@ -37,12 +37,7 @@ public class ExtractorUtils {
}
private static Selector getXpathSelector(String value) {
Selector selector;
if (EnvironmentUtil.useXsoup()) {
selector = new XsoupSelector(value);
} else {
selector = new XpathSelector(value);
}
Selector selector = new XpathSelector(value);
return selector;
}
......
package us.codecraft.webmagic.utils;
import java.net.Inet6Address;
import java.net.InetAddress;
import java.net.NetworkInterface;
import java.net.SocketException;
import java.util.Enumeration;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public abstract class IPUtils {
public static String getFirstNoLoopbackIPAddresses() throws SocketException {
Enumeration<NetworkInterface> networkInterfaces = NetworkInterface.getNetworkInterfaces();
InetAddress localAddress = null;
while (networkInterfaces.hasMoreElements()) {
NetworkInterface networkInterface = networkInterfaces.nextElement();
Enumeration<InetAddress> inetAddresses = networkInterface.getInetAddresses();
while (inetAddresses.hasMoreElements()) {
InetAddress address = inetAddresses.nextElement();
if (!address.isLoopbackAddress() && !Inet6Address.class.isInstance(address)) {
return address.getHostAddress();
} else if (!address.isLoopbackAddress()) {
localAddress = address;
}
}
}
return localAddress.getHostAddress();
}
}
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="stdout" class="org.apache.log4j.ConsoleAppender">
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yy-MM-dd HH:mm:ss,SSS} %-5p %c(%F:%L) ## %m%n" />
</layout>
</appender>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="info" />
<appender-ref ref="stdout" />
</root>
</log4j:configuration>
<!--This is a draft of config file.
If you have any advice, go https://github.com/code4craft/webmagic/issues/106 and comment!-->
<spider>
<site>
<charset>utf-8</charset>
<user-agent></user-agent>
<cookies>
<cookie domain="" path="" name="" value="">
</cookie>
</cookies>
<heads>
<head name="" value=""/>
</heads>
</site>
<startUrls>
<url></url>
</startUrls>
<extraction targetUrl="" helpUrl="">
<field name="title">
<extractor type="xpath" value="//div[@class='title']"/>
</field>
<field name="content">
<extractor type="xpath" value="//div[@class='content']"/>
</field>
</extraction>
</spider>
\ No newline at end of file
package us.codecraft.webmagic.configurable;
import org.junit.Test;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.downloader.MockGithubDownloader;
import java.util.ArrayList;
import java.util.List;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmail.com
* @date 14-4-5
*/
public class ConfigurablePageProcessorTest {
@Test
public void test() throws Exception {
List<ExtractRule> extractRules = new ArrayList<ExtractRule>();
ExtractRule extractRule = new ExtractRule();
extractRule.setExpressionType(ExpressionType.XPath);
extractRule.setExpressionValue("//title");
extractRule.setFieldName("title");
extractRules.add(extractRule);
extractRule = new ExtractRule();
extractRule.setExpressionType(ExpressionType.XPath);
extractRule.setExpressionValue("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()");
extractRule.setFieldName("star");
extractRules.add(extractRule);
ResultItems resultItems = Spider.create(new ConfigurablePageProcessor(Site.me(), extractRules))
.setDownloader(new MockGithubDownloader()).get("https://github.com/code4craft/webmagic");
assertThat(resultItems.getAll()).containsEntry("title", "<title>code4craft/webmagic &middot; GitHub</title>");
assertThat(resultItems.getAll()).containsEntry("star", " 86 ");
}
}
package us.codecraft.webmagic.model;
import us.codecraft.webmagic.model.annotation.ExtractBy;
/**
* @author code4crafter@gmail.com
*/
public class BaseRepo {
@ExtractBy("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
protected int star;
}
package us.codecraft.webmagic.model;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.HelpUrl;
import us.codecraft.webmagic.model.annotation.TargetUrl;
/**
* @author code4crafter@gmail.com <br>
* @since 0.3.2
*/
@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl({"https://github.com/\\w+\\?tab=repositories", "https://github.com/\\w+", "https://github.com/explore/*"})
public class GithubRepo extends BaseRepo{
@ExtractBy("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()")
private int fork;
public static void main(String[] args) {
OOSpider.create(Site.me().setSleepTime(100)
, new ConsolePageModelPipeline(), GithubRepo.class)
.addUrl("https://github.com/code4craft").thread(10).run();
}
public int getStar() {
return star;
}
public int getFork() {
return fork;
}
}
......@@ -5,7 +5,6 @@ import org.junit.Test;
import us.codecraft.webmagic.downloader.MockGithubDownloader;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.example.GithubRepo;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
/**
......
package us.codecraft.webmagic.model;
import org.junit.Test;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.TargetUrl;
import us.codecraft.webmagic.selector.PlainText;
import static org.assertj.core.api.Assertions.assertThat;
/**
* @author code4crafter@gmail.com
* @date 14-4-4
*/
public class ModelPageProcessorTest {
@TargetUrl("http://codecraft.us/foo")
public static class ModelFoo {
@ExtractBy(value = "//div/@foo", notNull = true)
private String foo;
}
@TargetUrl("http://codecraft.us/bar")
public static class ModelBar {
@ExtractBy(value = "//div/@bar", notNull = true)
private String bar;
}
@Test
public void testMultiModel_should_not_skip_when_match() throws Exception {
Page page = new Page();
page.setRawText("<div foo='foo'></div>");
page.setRequest(new Request("http://codecraft.us/foo"));
page.setUrl(PlainText.create("http://codecraft.us/foo"));
ModelPageProcessor modelPageProcessor = ModelPageProcessor.create(null, ModelFoo.class, ModelBar.class);
modelPageProcessor.process(page);
assertThat(page.getResultItems().isSkip()).isFalse();
}
}
package us.codecraft.webmagic.monitor;
import us.codecraft.webmagic.Spider;
/**
* @author code4crafer@gmail.com
*/
public class CustomSpiderStatus extends SpiderStatus implements CustomSpiderStatusMXBean {
public CustomSpiderStatus(Spider spider, SpiderMonitor.MonitorSpiderListener monitorSpiderListener) {
super(spider, monitorSpiderListener);
}
@Override
public String getSchedulerName() {
return spider.getScheduler().getClass().getName();
}
}
package us.codecraft.webmagic.monitor;
/**
* @author code4crafer@gmail.com
*/
public interface CustomSpiderStatusMXBean extends SpiderStatusMXBean {
public String getSchedulerName();
}
package us.codecraft.webmagic.monitor;
import org.junit.Test;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.example.GithubRepoPageProcessor;
import us.codecraft.webmagic.processor.example.OschinaBlogPageProcessor;
/**
* @author code4crafer@gmail.com
* @since 0.5.0
*/
public class SpiderMonitorTest {
@Test
public void testInherit() throws Exception {
SpiderMonitor spiderMonitor = new SpiderMonitor(){
@Override
protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) {
return new CustomSpiderStatus(spider, monitorSpiderListener);
}
};
Spider oschinaSpider = Spider.create(new OschinaBlogPageProcessor())
.addUrl("http://my.oschina.net/flashsword/blog").thread(2);
Spider githubSpider = Spider.create(new GithubRepoPageProcessor())
.addUrl("https://github.com/code4craft");
spiderMonitor.register(oschinaSpider, githubSpider);
}
}
package us.codecraft.webmagic.utils;
import org.junit.Test;
/**
* @author code4crafer@gmail.com
*/
public class IPUtilsTest {
@Test
public void testGetFirstNoLoopbackIPAddresses() throws Exception {
System.out.println(IPUtils.getFirstNoLoopbackIPAddresses());
}
}
......@@ -8,23 +8,13 @@
</layout>
</appender>
<logger name="org.springframework" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="net.sf.ehcache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<root>
<level value="debug" />
<level value="info" />
<appender-ref ref="stdout" />
</root>
......
webmagic-lucene
--------
尝试将webmagic与lucene结合,打造一个搜索引擎。开发中,不作为webmagic主要模块。
\ No newline at end of file
package us.codecraft.webmagic.pipeline;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-8-5 <br>
* Time: 下午2:11 <br>
*/
public class LucenePipeline implements Pipeline {
private Directory directory;
private Analyzer analyzer;
private IndexWriterConfig config;
private void init() throws IOException {
analyzer = new StandardAnalyzer(Version.LUCENE_44);
directory = new RAMDirectory();
config = new IndexWriterConfig(Version.LUCENE_44, analyzer);
}
public LucenePipeline() {
try {
init();
} catch (IOException e) {
e.printStackTrace();
}
}
public List<Document> search(String fieldName, String value) throws IOException, ParseException {
List<Document> documents = new ArrayList<Document>();
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_44, fieldName, analyzer);
Query query = parser.parse(value);
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
documents.add(hitDoc);
}
ireader.close();
return documents;
}
@Override
public void process(ResultItems resultItems, Task task) {
if (resultItems.isSkip()){
return;
}
Document doc = new Document();
Map<String,Object> all = resultItems.getAll();
if (all==null){
return;
}
for (Map.Entry<String, Object> objectEntry : all.entrySet()) {
doc.add(new Field(objectEntry.getKey(), objectEntry.getValue().toString(), TextField.TYPE_STORED));
}
try {
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.addDocument(doc);
indexWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
package us.codecraft.webmagic.lucene;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryparser.classic.ParseException;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.TargetUrl;
import us.codecraft.webmagic.pipeline.LucenePipeline;
import java.io.IOException;
import java.util.List;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-8-2 <br>
* Time: 上午7:52 <br>
*/
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
@ExtractBy("//title")
private String title;
@ExtractBy(value = "div.BlogContent", type = ExtractBy.Type.Css)
private String content;
@Override
public String toString() {
return "OschinaBlog{" +
"title='" + title + '\'' +
", content='" + content + '\'' +
'}';
}
public static void main(String[] args) {
LucenePipeline pipeline = new LucenePipeline();
OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"), OschinaBlog.class).pipeline(pipeline).runAsync();
while (true) {
try {
List<Document> search = pipeline.search("title", "webmagic");
System.out.println(search);
Thread.sleep(3000);
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public String getTitle() {
return title;
}
public String getContent() {
return content;
}
}
Worker:
任务执行者,提供Http接口,监控运行状态,终止和开始job
队列:
仍然使用redis
Panel:
提供Web管理后台,管理
1. 新建任务
1. 通过脚本
2. 配置
3. 分配机器
2. 已有任务
3. 任务查看
\ No newline at end of file
......@@ -3,7 +3,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -34,6 +34,25 @@
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
......
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ExtractBy;
/**
* @author code4crafter@gmail.com
* @date 14-4-9
*/
public class BaiduNews {
@ExtractBy("//h3[@class='c-title']/a/text()")
private String name;
@ExtractBy("//div[@class='c-summary']/text()")
private String description;
@Override
public String toString() {
return "BaiduNews{" +
"name='" + name + '\'' +
", description='" + description + '\'' +
'}';
}
public static void main(String[] args) {
OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(0), BaiduNews.class);
//single download
BaiduNews baike = ooSpider.<BaiduNews>get("http://news.baidu.com/ns?tn=news&cl=2&rn=20&ct=1&fr=bks0000&ie=utf-8&word=httpclient");
System.out.println(baike);
ooSpider.close();
}
public String getName() {
return name;
}
public String getDescription() {
return description;
}
}
\ No newline at end of file
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.monitor.SpiderMonitor;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.ExtractByUrl;
import us.codecraft.webmagic.model.annotation.HelpUrl;
import us.codecraft.webmagic.model.annotation.TargetUrl;
import javax.management.JMException;
import java.io.IOException;
/**
* @author code4crafter@gmail.com <br>
*/
......@@ -25,14 +30,17 @@ public class Kr36NewsModel {
@ExtractByUrl
private String url;
public static void main(String[] args) {
public static void main(String[] args) throws IOException, JMException {
//Just for benchmark
OOSpider.create(Site.me().addStartUrl("http://www.36kr.com/").setSleepTime(0), new PageModelPipeline() {
Spider thread = OOSpider.create(Site.me().addStartUrl("http://www.36kr.com/").setSleepTime(0), new PageModelPipeline() {
@Override
public void process(Object o, Task task) {
}
},Kr36NewsModel.class).thread(20).run();
}, Kr36NewsModel.class).thread(20);
thread.start();
SpiderMonitor spiderMonitor = SpiderMonitor.instance();
spiderMonitor.register(thread);
}
public String getTitle() {
......
......@@ -3,7 +3,6 @@ package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.MultiPageModel;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ComboExtract;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.ExtractByUrl;
import us.codecraft.webmagic.model.annotation.TargetUrl;
......@@ -26,9 +25,8 @@ public class News163 implements MultiPageModel {
@ExtractByUrl(value = "http://news\\.163\\.com/\\d+/\\d+/\\d+/\\w+_(\\d+)\\.html", notNull = false)
private String page;
@ComboExtract(value = {@ExtractBy("//div[@class=\"ep-pages\"]//a/@href"),
@ExtractBy(value = "http://news\\.163\\.com/\\d+/\\d+/\\d+/\\w+_(\\d+)\\.html", type = ExtractBy.Type.Regex)},
multi = true, notNull = false)
@ExtractBy(value = "//div[@class=\"ep-pages\"]//a/regex('http://news\\.163\\.com/\\d+/\\d+/\\d+/\\w+_(\\d+)\\.html',1)"
, multi = true, notNull = false)
private List<String> otherPage;
@ExtractBy("//h1[@id=\"h1title\"]/text()")
......@@ -74,8 +72,8 @@ public class News163 implements MultiPageModel {
}
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://news.163.com/13/0802/05/958I1E330001124J_2.html"), News163.class)
.scheduler(new RedisScheduler("localhost")).clearPipeline().pipeline(new MultiPagePipeline()).pipeline(new ConsolePipeline()).run();
OOSpider.create(Site.me(), News163.class).addUrl("http://news.163.com/13/0802/05/958I1E330001124J_2.html")
.scheduler(new RedisScheduler("localhost")).addPipeline(new MultiPagePipeline()).addPipeline(new ConsolePipeline()).run();
}
}
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.ConsolePageModelPipeline;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.TargetUrl;
/**
* @author code4crafter@gmail.com
* @date 14-4-11
*/
@TargetUrl("http://meishi.qq.com/beijing/c/all[\\-p2]*")
@ExtractBy(value = "//ul[@id=\"promos_list2\"]/li",multi = true)
public class QQMeishi {
@ExtractBy("//div[@class=info]/a[@class=title]/h4/text()")
private String shopName;
@ExtractBy("//div[@class=info]/a[@class=title]/text()")
private String promo;
public static void main(String[] args) {
OOSpider.create(Site.me(), new ConsolePageModelPipeline(), QQMeishi.class).addUrl("http://meishi.qq.com/beijing/c/all").thread(4).run();
}
}
package us.codecraft.webmagic.samples;
import org.apache.commons.collections.CollectionUtils;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.JsonPathSelector;
import java.util.List;
/**
* @author code4crafter@gmail.com
* @since 0.5.0
*/
public class AngularJSProcessor implements PageProcessor {
private Site site = Site.me();
private static final String ARITICALE_URL = "http://angularjs\\.cn/api/article/\\w+";
private static final String LIST_URL = "http://angularjs\\.cn/api/article/latest.*";
@Override
public void process(Page page) {
if (page.getUrl().regex(LIST_URL).match()) {
List<String> ids = new JsonPathSelector("$.data[*]._id").selectList(page.getRawText());
if (CollectionUtils.isNotEmpty(ids)) {
for (String id : ids) {
page.addTargetRequest("http://angularjs.cn/api/article/" + id);
}
}
} else {
page.putField("title", new JsonPathSelector("$.data.title").select(page.getRawText()));
page.putField("content", new JsonPathSelector("$.data.content").select(page.getRawText()));
}
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new AngularJSProcessor()).addUrl("http://angularjs.cn/api/article/latest?p=1&s=20").run();
}
}
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
/**
* @author code4crafter@gmail.com <br>
*/
public class SinaBlogProcesser implements PageProcessor {
private Site site;
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().xpath("//div[@class='articalfrontback SG_j_linedot1 clearfix']").links().all());
page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2"));
page.putField("content",page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']"));
page.putField("id",page.getUrl().regex("http://blog\\.sina\\.com\\.cn/s/blog_(\\w+)"));
page.putField("date",page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\\((.*)\\)"));
// page.putField("tags",page.getHtml().xpath("//td[@class='blog_tag']/h3/a"));
}
@Override
public Site getSite() {
if (site==null){
site = Site.me().setDomain("blog.sina.com.cn").addStartUrl("http://blog.sina.com.cn/s/blog_4701280b0102egl0.html").setSleepTime(3000).
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
}
return site;
}
public static void main(String[] args) {
Spider.create(new SinaBlogProcesser()).run();
}
}
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
/**
* @author code4crafter@gmail.com <br>
*/
public class SinaBlogProcessor implements PageProcessor {
public static final String URL_LIST = "http://blog\\.sina\\.com\\.cn/s/articlelist_1487828712_0_\\d+\\.html";
public static final String URL_POST = "http://blog\\.sina\\.com\\.cn/s/blog_\\w+\\.html";
private Site site = Site
.me()
.setDomain("blog.sina.com.cn")
.setSleepTime(3000)
.setUserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
@Override
public void process(Page page) {
//列表页
if (page.getUrl().regex(URL_LIST).match()) {
page.addTargetRequests(page.getHtml().xpath("//div[@class=\"articleList\"]").links().regex(URL_POST).all());
page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all());
//文章页
} else {
page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2"));
page.putField("content", page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']"));
page.putField("date",
page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\\((.*)\\)"));
}
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new SinaBlogProcessor()).addUrl("http://blog.sina.com.cn/s/articlelist_1487828712_0_1.html")
.run();
}
}
package us.codecraft.webmagic.samples.formatter;
import us.codecraft.webmagic.model.formatter.ObjectFormatter;
/**
* @author yihua.huang@dianping.com
*/
public class StringTemplateFormatter implements ObjectFormatter<String> {
private String template;
@Override
public String format(String raw) throws Exception {
return String.format(template, raw);
}
@Override
public Class<String> clazz() {
return String.class;
}
@Override
public void initParam(String[] extra) {
template = extra[0];
}
}
......@@ -9,8 +9,9 @@ import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.PriorityScheduler;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static us.codecraft.webmagic.selector.Selectors.regex;
import static us.codecraft.webmagic.selector.Selectors.xpath;
/**
......@@ -19,16 +20,16 @@ import static us.codecraft.webmagic.selector.Selectors.xpath;
public class ZipCodePageProcessor implements PageProcessor {
private Site site = Site.me().setCharset("gb2312")
.setSleepTime(100).addStartUrl("http://www.ip138.com/post/");
.setSleepTime(100);
@Override
public void process(Page page) {
if (page.getUrl().toString().equals("http://www.ip138.com/post/")) {
processCountry(page);
} else if (page.getUrl().regex("http://www\\.ip138\\.com/post/\\w+[/]?$").toString() != null) {
processProvince(page);
} else {
} else if (page.getUrl().regex("http://www\\.ip138\\.com/\\d{6}[/]?$").toString() != null) {
processDistrict(page);
} else {
processProvince(page);
}
}
......@@ -45,28 +46,26 @@ public class ZipCodePageProcessor implements PageProcessor {
private void processProvince(Page page) {
//这里仅靠xpath没法精准定位,所以使用正则作为筛选,不符合正则的会被过滤掉
List<String> districts = page.getHtml().xpath("//body/table/tbody/tr/td").regex(".*http://www\\.ip138\\.com/post/\\w+/\\w+.*").all();
List<String> districts = page.getHtml().xpath("//body/table/tbody/tr[@bgcolor=\"#ffffff\"]").all();
Pattern pattern = Pattern.compile("<td>([^<>]+)</td>.*?href=\"(.*?)\"",Pattern.DOTALL);
for (String district : districts) {
String link = xpath("//@href").select(district);
String title = xpath("/text()").select(district);
Request request = new Request(link).setPriority(1).putExtra("province", page.getRequest().getExtra("province")).putExtra("district", title);
page.addTargetRequest(request);
Matcher matcher = pattern.matcher(district);
while (matcher.find()) {
String title = matcher.group(1);
String link = matcher.group(2);
Request request = new Request(link).setPriority(1).putExtra("province", page.getRequest().getExtra("province")).putExtra("district", title);
page.addTargetRequest(request);
}
}
}
private void processDistrict(Page page) {
String province = page.getRequest().getExtra("province").toString();
String district = page.getRequest().getExtra("district").toString();
List<String> counties = page.getHtml().xpath("//body/table/tbody/tr").regex(".*<td>\\d+</td>.*").all();
String regex = "<td[^<>]*>([^<>]+)</td><td[^<>]*>([^<>]+)</td><td[^<>]*>([^<>]+)</td><td[^<>]*>([^<>]+)</td>";
for (String county : counties) {
String county0 = regex(regex, 1).select(county);
String county1 = regex(regex, 2).select(county);
String zipCode = regex(regex, 3).select(county);
page.putField("result", StringUtils.join(new String[]{province, district,
county0, county1, zipCode}, "\t"));
}
List<String> links = page.getHtml().links().regex("http://www\\.ip138\\.com/post/\\w+/\\w+").all();
String zipCode = page.getHtml().regex("<h2>邮编:(\\d+)</h2>").toString();
page.putField("result", StringUtils.join(new String[]{province, district,
zipCode}, "\t"));
List<String> links = page.getHtml().links().regex("http://www\\.ip138\\.com/\\d{6}[/]?$").all();
for (String link : links) {
page.addTargetRequest(new Request(link).setPriority(2).putExtra("province", province).putExtra("district", district));
}
......@@ -79,11 +78,8 @@ public class ZipCodePageProcessor implements PageProcessor {
}
public static void main(String[] args) {
Spider.create(new ZipCodePageProcessor()).scheduler(new PriorityScheduler()).run();
Spider spider = Spider.create(new ZipCodePageProcessor()).scheduler(new PriorityScheduler()).addUrl("http://www.ip138.com/post/");
PriorityScheduler scheduler = new PriorityScheduler();
Spider spider = Spider.create(new ZipCodePageProcessor()).scheduler(scheduler);
scheduler.push(new Request("http://www.baidu.com/s?wd=webmagic&f=12&rsp=0&oq=webmagix&tn=baiduhome_pg&ie=utf-8"),spider);
spider.run();
}
}
......@@ -5,7 +5,7 @@ import org.junit.Test;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.samples.SinaBlogProcesser;
import us.codecraft.webmagic.samples.SinaBlogProcessor;
import us.codecraft.webmagic.scheduler.FileCacheQueueScheduler;
import java.io.IOException;
......@@ -20,7 +20,7 @@ public class SinablogProcessorTest {
@Ignore
@Test
public void test() throws IOException {
SinaBlogProcesser sinaBlogProcesser = new SinaBlogProcesser();
SinaBlogProcessor sinaBlogProcessor = new SinaBlogProcessor();
//pipeline是抓取结束后的处理
//默认放到/data/webmagic/ftl/[domain]目录下
JsonFilePipeline pipeline = new JsonFilePipeline("/data/webmagic/");
......@@ -29,7 +29,7 @@ public class SinablogProcessorTest {
//ConsolePipeline输出结果到控制台
//FileCacheQueueSchedular保存url,支持断点续传,临时文件输出到/data/temp/webmagic/cache目录
//Spider.run()执行
Spider.create(sinaBlogProcesser).pipeline(new FilePipeline()).pipeline(pipeline).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).
Spider.create(sinaBlogProcessor).pipeline(new FilePipeline()).pipeline(pipeline).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).
run();
}
}
......@@ -3,7 +3,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -15,9 +15,15 @@
<artifactId>webmagic-core</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>Saxon-HE</artifactId>
<version>9.5.1-1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
......@@ -36,4 +42,4 @@
</plugins>
</build>
</project>
\ No newline at end of file
</project>
......@@ -1350,7 +1350,7 @@ public class XpathSelectorTest {
+ "</script>\n" + "\n" + " \n" + " \n" + " </body>\n" + "</html>\n";
String text2 = "<div>aaa</div>";
XpathSelector xpathSelector = new XpathSelector(
"//div[@id='main']/div[@class='blog_main']/div[1][@class='blog_title']/h3/a");
"//div[@id='main']/div[@class='blog_main']/div[@class='blog_title']/h3/a/text()");
String select = xpathSelector.select(text);
Assert.assertEquals("jsoup 解析页面商品信息", select);
}
......
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
......@@ -3,7 +3,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -16,6 +16,10 @@
<artifactId>jruby</artifactId>
<version>1.7.6</version>
</dependency>
<dependency><groupId>org.python</groupId>
<artifactId>jython</artifactId>
<version>2.5.3</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
......@@ -40,25 +44,6 @@
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
......
......@@ -7,7 +7,9 @@ public enum Language {
JavaScript("javascript","js/defines.js",""),
JRuby("jruby","ruby/defines.rb","");
JRuby("jruby","ruby/defines.rb",""),
Jython("jython","python/defines.py","");
private String engineName;
......
package us.codecraft.webmagic.scripts;
import org.apache.commons.io.IOUtils;
import org.jruby.RubyHash;
import org.python.core.PyDictionary;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
......@@ -10,6 +12,8 @@ import javax.script.ScriptEngine;
import javax.script.ScriptException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.Map;
/**
* @author code4crafter@gmail.com
......@@ -50,20 +54,35 @@ public class ScriptProcessor implements PageProcessor {
context.setAttribute("page", page, ScriptContext.ENGINE_SCOPE);
context.setAttribute("config", site, ScriptContext.ENGINE_SCOPE);
try {
engine.eval(defines + "\n" + script, context);
// switch (language) {
// case JavaScript:
// NativeObject o = (NativeObject) engine.get("result");
// if (o != null) {
// for (Map.Entry<Object, Object> objectObjectEntry : o.entrySet()) {
// page.getResultItems().put(objectObjectEntry.getKey().toString(), objectObjectEntry.getValue());
switch (language) {
case JavaScript:
engine.eval(defines + "\n" + script, context);
// NativeObject o = (NativeObject) engine.get("result");
// if (o != null) {
// for (Object o1 : o.getIds()) {
// String key = String.valueOf(o1);
// page.getResultItems().put(key, NativeObject.getProperty(o, key));
// }
// }
// }
// break;
// case JRuby:
// Object o1 = engine.get("result");
// break;
// }
break;
case JRuby:
RubyHash oRuby = (RubyHash) engine.eval(defines + "\n" + script, context);
Iterator itruby = oRuby.entrySet().iterator();
while (itruby.hasNext()) {
Map.Entry pairs = (Map.Entry) itruby.next();
page.getResultItems().put(pairs.getKey().toString(), pairs.getValue());
}
break;
case Jython:
engine.eval(defines + "\n" + script, context);
PyDictionary oJython = (PyDictionary) engine.get("result");
Iterator it = oJython.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pairs = (Map.Entry) it.next();
page.getResultItems().put(pairs.getKey().toString(), pairs.getValue());
}
break;
}
} catch (ScriptException e) {
e.printStackTrace();
}
......@@ -72,6 +91,7 @@ public class ScriptProcessor implements PageProcessor {
}
}
@Override
public Site getSite() {
return site;
......
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
......@@ -9,3 +9,4 @@ var config = {
title = $("div.BlogTitle h1"),
content = $("div.BlogContent")
urls("http://my\\.oschina\\.net/flashsword/blog/\\d+")
config;
File mode changed from 100644 to 100755
def xpath(str):
return page.getHtml().xpath(str).toString()
def css(str):
return page.getHtml().css(str).toString()
def urls(str):
links=page.getHtml().links().regex(str).all()
page.addTargetRequests(links);
def tomap(key,value):
return "hello world"
title=xpath("div[@class=BlogTitle]")
urls="http://my\\.oschina\\.net/flashsword/blog/\\d+"
result={"title":title,"urls":urls}
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
urls "http://my\\.oschina\\.net/flashsword/blog/\\d+"
title = css "div.BlogTitle h1"
content = css "div.BlogContent"
urls "http://my\\.oschina\\.net/flashsword/blog/\\d+"
\ No newline at end of file
return {"title"=>title,"content"=>content}
package us.codecraft.webmagic.scripts;
import org.junit.Ignore;
import org.junit.Test;
import us.codecraft.webmagic.Spider;
......@@ -7,6 +8,7 @@ import us.codecraft.webmagic.Spider;
* @author code4crafter@gmail.com
* @since 0.4.1
*/
@Ignore
public class ScriptProcessorTest {
@Test
......@@ -22,4 +24,12 @@ public class ScriptProcessorTest {
pageProcessor.getSite().setSleepTime(0);
Spider.create(pageProcessor).addUrl("http://my.oschina.net/flashsword/blog").setSpawnUrl(false).run();
}
@Test
public void testPythonProcessor() {
ScriptProcessor pageProcessor = ScriptProcessorBuilder.custom().language(Language.Jython).scriptFromClassPathFile("python/oschina.py").build();
pageProcessor.getSite().setSleepTime(0);
Spider.create(pageProcessor).addUrl("http://my.oschina.net/flashsword/blog").setSpawnUrl(false).run();
}
}
File mode changed from 100644 to 100755
......@@ -3,7 +3,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3</version>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -37,4 +37,4 @@
</plugins>
</build>
</project>
\ No newline at end of file
</project>
......@@ -22,7 +22,7 @@ public class HuabanProcessor implements PageProcessor {
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("http://huaban\\.com/.*").all());
if (page.getUrl().toString().contains("pins")) {
page.putField("img", page.getHtml().xpath("//div[@id='pin_img']/img/@src").toString());
page.putField("img", page.getHtml().xpath("//div[@id='pin_img']/a/img/@src").toString());
} else {
page.getResultItems().setSkip(true);
}
......@@ -30,16 +30,17 @@ public class HuabanProcessor implements PageProcessor {
@Override
public Site getSite() {
if (site == null) {
site = Site.me().setDomain("huaban.com").addStartUrl("http://huaban.com/").setSleepTime(0);
if (null == site) {
site = Site.me().setDomain("huaban.com").setSleepTime(0);
}
return site;
}
public static void main(String[] args) {
Spider.create(new HuabanProcessor()).thread(5)
.pipeline(new FilePipeline("/data/webmagic/test/"))
.downloader(new SeleniumDownloader("/Users/yihua/Downloads/chromedriver"))
.addPipeline(new FilePipeline("/data/webmagic/test/"))
.setDownloader(new SeleniumDownloader("/Users/yihua/Downloads/chromedriver"))
.addUrl("http://huaban.com/")
.runAsync();
}
}
webmagic
---------
![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg)
[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)
[Readme in English](https://github.com/code4craft/webmagic/tree/master/en_docs)
[用户手册](https://github.com/code4craft/webmagic/blob/master/user-manual.md)
>webmagic是一个开源的Java垂直爬虫框架,目标是简化爬虫的开发流程,让开发者专注于逻辑功能的开发。webmagic的核心非常简单,但是覆盖爬虫的整个流程,也是很好的学习爬虫开发的材料。作者曾经在前公司进行过一年的垂直爬虫的开发,webmagic就是为了解决爬虫开发的一些重复劳动而产生的框架。
>web爬虫是一种技术,webmagic致力于将这种技术的实现成本降低,但是出于对资源提供者的尊重,webmagic不会做反封锁的事情,包括:验证码破解、代理切换、自动登录等。
......@@ -25,22 +29,37 @@ python爬虫 **scrapy** [https://github.com/scrapy/scrapy](https://github.com/sc
Java爬虫 **Spiderman** [https://gitcafe.com/laiweiwei/Spiderman](https://gitcafe.com/laiweiwei/Spiderman)
webmagic的github地址:[https://github.com/code4craft/webmagic](https://github.com/code4craft/webmagic)
## 快速开始
### 使用maven
webmagic使用maven管理依赖,在项目中添加对应的依赖即可使用webmagic:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.4.2</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.4.2</version>
</dependency>
```xml
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.4.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.4.3</version>
</dependency>
```
WebMagic 使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。
```xml
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
```
#### 项目结构
......@@ -68,11 +87,7 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较
### 不使用maven
不使用maven的用户,可以下载这个二进制打包版本(感谢[oschina](http://www.oschina.net/)):
git clone http://git.oschina.net/flashsword20/webmagic-bin.git
**bin/lib**目录下,有项目依赖的所有jar包,直接在IDE里import即可。
在项目的**lib**目录下,有依赖的所有jar包,直接在IDE里import即可。
### 第一个爬虫
......@@ -80,31 +95,33 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较
PageProcessor是webmagic-core的一部分,定制一个PageProcessor即可实现自己的爬虫逻辑。以下是抓取osc博客的一段代码:
public class OschinaBlogPageProcesser implements PageProcessor {
```java
public class OschinaBlogPageProcesser implements PageProcessor {
private Site site = Site.me().setDomain("my.oschina.net")
.addStartUrl("http://my.oschina.net/flashsword/blog");
private Site site = Site.me().setDomain("my.oschina.net");
@Override
public void process(Page page) {
List<String> links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all();
page.addTargetRequests(links);
page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString());
page.putField("content", page.getHtml().$("div.content").toString());
page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all());
}
@Override
public void process(Page page) {
List<String> links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all();
page.addTargetRequests(links);
page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString());
page.putField("content", page.getHtml().$("div.content").toString());
page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all());
}
@Override
public Site getSite() {
return site;
@Override
public Site getSite() {
return site;
}
}
public static void main(String[] args) {
Spider.create(new OschinaBlogPageProcesser())
.pipeline(new ConsolePipeline()).run();
}
public static void main(String[] args) {
Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog")
.addPipeline(new ConsolePipeline()).run();
}
}
```
这里通过page.addTargetRequests()方法来增加要抓取的URL,并通过page.putField()来保存抽取结果。page.getHtml().xpath()则是按照某个规则对结果进行抽取,这里抽取支持链式调用。调用结束后,toString()表示转化为单个String,all()则转化为一个String列表。
......@@ -116,24 +133,26 @@ Spider是爬虫的入口类。Pipeline是结果输出和持久化的接口,这
webmagic-extension包括了注解方式编写爬虫的方法,只需基于一个POJO增加注解即可完成一个爬虫。以下仍然是抓取oschina博客的一段代码,功能与OschinaBlogPageProcesser完全相同:
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
```java
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
@ExtractBy("//title")
private String title;
@ExtractBy("//title")
private String title;
@ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
private String content;
@ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
private String content;
@ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
private List<String> tags;
@ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
private List<String> tags;
public static void main(String[] args) {
OOSpider.create(
Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
new ConsolePageModelPipeline(), OschinaBlog.class).run();
}
}
public static void main(String[] args) {
OOSpider.create(
Site.me(),
new ConsolePageModelPipeline(), OschinaBlog.class).addUrl("http://my.oschina.net/flashsword/blog").run();
}
}
```
这个例子定义了一个Model类,Model类的字段'title'、'content'、'tags'均为要抽取的属性。这个类在Pipeline里是可以复用的。
......@@ -145,10 +164,43 @@ webmagic-extension包括了注解方式编写爬虫的方法,只需基于一
webmagic-samples目录里有一些定制PageProcessor以抽取不同站点的例子。
作者还有一个使用webmagic进行抽取并持久化到数据库的项目[JobHunter](http://git.oschina.net/flashsword20/jobhunter)。这个项目整合了Spring,自定义了Pipeline,使用mybatis进行数据持久化。
webmagic的使用可以参考:[oschina openapi 应用:博客搬家](http://my.oschina.net/oscfox/blog/194507)
### 协议
webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0)
### 贡献者:
以下是为WebMagic提交过代码或者issue的朋友:
* [ccliangbo](https://github.com/ccliangbo)
* [yuany](https://github.com/yuany)
* [yxssfxwzy](https://github.com/yxssfxwzy)
* [linkerlin](https://github.com/linkerlin)
* [d0ngw](https://github.com/d0ngw)
* [xuchaoo](https://github.com/xuchaoo)
* [supermicah](https://github.com/supermicah)
* [SimpleExpress](https://github.com/SimpleExpress)
* [aruanruan](https://github.com/aruanruan)
* [l1z2g9](https://github.com/l1z2g9)
* [zhegexiaohuozi](https://github.com/zhegexiaohuozi)
* [ywooer](https://github.com/ywooer)
* [yyw258520](https://github.com/yyw258520)
* [perfecking](https://github.com/perfecking)
* [lidongyang](http://my.oschina.net/lidongyang)
* [seveniu](https://github.com/seveniu)
* [sebastian1118](https://github.com/sebastian1118)
### 邮件组:
Gmail:
[https://groups.google.com/forum/#!forum/webmagic-java](https://groups.google.com/forum/#!forum/webmagic-java)
QQ:
[http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988](http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988)
### QQ群:
373225642
WebMagic in Action
========
WebMagic是一个简单灵活、便于二次开发的爬虫框架。除了可以便捷的实现一个爬虫,WebMagic还提供多线程功能,以及基本的分布式功能。
你可以直接使用WebMagic进行爬虫开发,也可以定制WebMagic以适应复杂项目的需要。
## 1. 在项目中使用WebMagic
WebMagic主要包含两个jar包:`webmagic-core-{version}.jar``webmagic-extension-{version}.jar`。在项目中添加这两个包的依赖,即可使用WebMagic。
### 1.1 使用Maven
WebMagic基于Maven进行构建,推荐使用Maven来安装WebMagic。在项目中添加以下坐标即可:
```xml
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.4.3</version>
</dependency>
```
WebMagic使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。
```xml
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.4.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
```
### 1.2 不使用Maven
不使用maven的用户,可以下载附带二进制jar包的版本(感谢[oschina](http://www.oschina.net/)):
git clone http://git.oschina.net/flashsword20/webmagic.git
**lib**目录下,有项目依赖的所有jar包,直接在IDE里,将这些jar添加到Libraries即可。
![import jars](http://static.oschina.net/uploads/space/2014/0403/143318_gBQE_190591.jpeg)
### 1.3 第一个项目
在你的项目中添加了WebMagic的依赖之后,即可开始第一个爬虫的开发了!我们这里拿一个抓取Github信息的例子:
```java
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
//skip this page
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
点击main方法,选择“运行”,你会发现爬虫已经可以正常工作了!
![runlog](http://static.oschina.net/uploads/space/2014/0403/103741_3Gf5_190591.png)
<div style="page-break-after:always"></div>
## 2.下载和编译源码
WebMagic是一个纯Java项目,如果你熟悉Maven,那么下载并编译源码是非常简单的。如果不熟悉Maven也没关系,这部分会介绍如何在Eclipse里导入这个项目。
### 2.1 下载源码
WebMagic目前有两个仓库:
* [https://github.com/code4craft/webmagic](https://github.com/code4craft/webmagic)
github上的仓库保存最新版本,所有issue、pull request都在这里。大家觉得项目不错的话别忘了去给个star哦!
* [http://git.oschina.net/flashsword20/webmagic](http://git.oschina.net/flashsword20/webmagic)
此仓库包含所有编译好的依赖包,只保存项目的稳定版本,最新版本仍在github上更新。oschina在国内比较稳定,主要作为镜像。
无论在哪个仓库,使用
git clone https://github.com/code4craft/webmagic.git
或者
git clone http://git.oschina.net/flashsword20/webmagic.git
即可下载最新代码。
如果你对git本身使用也不熟悉,建议看看@黄勇的 [从 Git@OSC 下载 Smart 源码](http://my.oschina.net/huangyong/blog/200075)
### 2.2 导入项目
Intellij Idea默认自带Maven支持,import项目时选择Maven项目即可。
#### 2.2.1 使用m2e插件
使用Eclipse的用户,推荐安装m2e插件,安装地址:https://www.eclipse.org/m2e/download/[](https://www.eclipse.org/m2e/download/)
安装后,在File->Import中选择Maven->Existing Maven Projects即可导入项目。
![m2e-import](http://static.oschina.net/uploads/space/2014/0403/104427_eNuc_190591.png)
导入后看到项目选择界面,点击finish即可。
![m2e-import2](http://static.oschina.net/uploads/space/2014/0403/104735_6vwG_190591.png)
#### 2.2.2 使用Maven Eclipse插件
如果没有安装m2e插件,只要你安装了Maven,也是比较好办的。在项目根目录下使用命令:
mvn eclipse:eclipse
生成maven项目结构的eclipse配置文件,然后在File->Import中选择General->Existing Projects into Workspace即可导入项目。
![eclipse-import-1](http://static.oschina.net/uploads/space/2014/0403/100025_DAcy_190591.png)
导入后看到项目选择界面,点击finish即可。
![eclipse-import-2](http://static.oschina.net/uploads/space/2014/0403/100227_73DJ_190591.png)
### 2.3 编译和执行源码
导入成功之后,应该就没有编译错误了!此时你可以运行一下webmagic-core项目中自带的exmaple:"us.codecraft.webmagic.processor.example.GithubRepoPageProcessor"。
同样,看到控制台输出如下,则表示源码编译和执行成功了!
![runlog](http://static.oschina.net/uploads/space/2014/0403/103741_3Gf5_190591.png)
<div style="page-break-after:always"></div>
## 3. 基本的爬虫
### 3.1 实现PageProcessor
在WebMagic里,实现一个基本的爬虫只需要编写一个类,实现`PageProcessor`接口即可。这个类基本上包含了抓取一个网站,你需要写的所有代码。
以之前的`GithubRepoPageProcessor`为例,我将PageProcessor的定制分为三个部分,分别是爬虫的配置、页面元素的抽取和链接的发现。
```java
public class GithubRepoPageProcessor implements PageProcessor {
// 部分一:抓取网站的相关配置,包括编码、抓取间隔、重试次数等
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
// process是定制爬虫逻辑的核心接口,在这里编写抽取逻辑
public void process(Page page) {
// 部分二:定义如何抽取页面信息,并保存下来
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name") == null) {
//skip this page
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
// 部分三:从页面发现后续的url地址来抓取
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
//从"https://github.com/code4craft"开始抓
.addUrl("https://github.com/code4craft")
//开启5个线程抓取
.thread(5)
//启动爬虫
.run();
}
}
```
#### 3.1.1 爬虫的配置
第一部分关于爬虫的配置,包括编码、抓取间隔、超时时间、重试次数等,也包括一些模拟的参数,例如User Agent、cookie,以及代理的设置,我们会在第5章-“爬虫的配置”里进行介绍。在这里我们先简单设置一下:重试次数为3次,抓取间隔为一秒。
#### 3.1.2 页面元素的抽取
第二部分是爬虫的核心部分:对于下载到的Html页面,你如何从中抽取到你想要的信息?WebMagic里主要使用了三种抽取技术:XPath、正则表达式和CSS选择器。
1. XPath
XPath本来是用于XML中获取元素的一种查询语言,但是用于Html也是比较方便的。例如:
```java
page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()")
```
这段代码使用了XPath,它的意思是“查找所有class属性为'entry-title public'的h1元素,并找到他的strong子节点的a子节点,并提取a节点的文本信息”。
对应的Html是这样子的:
![xpath-html](http://static.oschina.net/uploads/space/2014/0404/104607_Aqq8_190591.png)
2. CSS选择器
CSS选择器是与XPath类似的语言。如果大家做过前端开发,肯定知道$('h1.entry-title')这种写法的含义。客观的说,它比XPath写起来要简单一些,但是如果写复杂一点的抽取规则,就相对要麻烦一点。
3. 正则表达式
正则表达式则是一种通用的文本抽取语言。
```java
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
```
这段代码就用到了正则表达式,它表示匹配所有"https://github.com/code4craft/webmagic"这样的链接。
XPath、CSS选择器和正则表达式的具体用法会在第4章“抽取工具详解”中讲到。
#### 3.1.3 链接的发现
有了处理页面的逻辑,我们的爬虫就接近完工了!
但是现在还有一个问题:一个站点的页面是很多的,一开始我们不可能全部列举出来,于是如何发现后续的链接,是一个爬虫不可缺少的一部分。
```java
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
```
这段代码的分为两部分,`page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()`用于获取所有满足"(https://github\\.com/\\w+/\\w+)"这个正则表达式的链接,`page.addTargetRequests()`则将这些链接加入到待抓取的队列中去。
### 3.2 使用Selectable的链式API
`Selectable`相关的链式API是WebMagic的一个核心功能。使用Selectable接口,你可以直接完成页面元素的链式抽取,也无需去关心抽取的细节。
在刚才的例子中可以看到,page.getHtml()返回的是一个`Html`对象,它实现了`Selectable`接口。这个接口包含一些重要的方法,我将它分为两类:抽取部分和获取结果部分。
#### 3.2.1 抽取部分API:
| 方法 | 说明 | 示例 |
| ------------ | ------------- | ------------ |
| xpath(String xpath) | 使用XPath选择 | html.xpath("//div[@class='title']") |
| \$(String selector) | 使用Css选择器选择 | html.\$("div.title") |
| \$(String selector,String attr) | 使用Css选择器选择 | html.\$("div.title","text") |
| css(String selector) | 功能同$(),使用Css选择器选择 | html.css("div.title") |
| links() | 选择所有链接 | html.links() |
| regex(String regex) | 使用正则表达式抽取 | html.regex("\<div\>(.\*?)\</div>") |
| regex(String regex,int group) | 使用正则表达式抽取,并指定捕获组 | html.regex("\<div\>(.\*?)\</div>",1) |
| replace(String regex, String replacement) | 替换内容| html.replace("\<script>.\*\</script>","")|
这部分抽取API返回的都是一个`Selectable`接口,意思是说,抽取是支持链式调用的。下面我用一个实例来讲解链式API的使用。
例如,我现在要抓取github上所有的Java项目,这些项目可以在[https://github.com/search?l=Java&p=1&q=stars%3A%3E1&s=stars&type=Repositories](https://github.com/search?l=Java&p=1&q=stars%3A%3E1&s=stars&type=Repositories)搜索结果中看到。
为了避免抓取范围太宽,我指定只从分页部分抓取链接。这个抓取规则是比较复杂的,我会要怎么写呢?
![selectable-chain-ui](http://static.oschina.net/uploads/space/2014/0404/151454_2T01_190591.png)
首先看到页面的html结构是这个样子的:
![selectable-chain](http://static.oschina.net/uploads/space/2014/0404/151632_88Oq_190591.png)
那么我可以先用CSS选择器提取出这个div,然后在取到所有的链接。为了保险起见,我再使用正则表达式限定一下提取出的URL的格式,那么最终的写法是这样子的:
```java
List<String> urls = page.getHtml().css("div.pagination").links().regex(".*/search/\?l=java.*").all();
```
然后,我们可以把这些URL加到抓取列表中去:
```java
List<String> urls = page.getHtml().css("div.pagination").links().regex(".*/search/\?l=java.*").all();
page.addTargetRequests(urls);
```
是不是比较简单?除了发现链接,Selectable的链式抽取还可以完成很多工作。我们会在第9章示例中再讲到。
#### 3.2.2 获取结果的API:
当链式调用结束时,我们一般都想要拿到一个字符串类型的结果。这时候就需要用到获取结果的API了。我们知道,一条抽取规则,无论是XPath、CSS选择器或者正则表达式,总有可能抽取到多条元素。WebMagic对这些进行了统一,你可以通过不同的API获取到一个或者多个元素。
| 方法 | 说明 | 示例 |
| ------------ | ------------- | ------------ |
| get() | 返回一条String类型的结果 | String link= html.links().get()|
| toString() | 功能同get(),返回一条String类型的结果 | String link= html.links().toString()|
| all() | 返回所有抽取结果 | List<String> links= html.links().all()|
| match() | 是否有匹配结果 | if (html.links().match()){ xxx; }|
例如,我们知道页面只会有一条结果,那么可以使用selectable.get()或者selectable.toString()拿到这条结果。
这里selectable.toString()采用了toString()这个接口,是为了在输出以及和一些框架结合的时候,更加方便。因为一般情况下,我们都只需要选择一个元素!
selectable.all()则会获取到所有元素。
好了,到现在为止,在回过头看看3.1中的GithubRepoPageProcessor,可能就觉得更加清晰了吧?指定main方法,已经可以看到抓取结果在控制台输出了。
### 3.3 保存结果
好了,爬虫编写完成,现在我们可能还有一个问题:我如果想把抓取的结果保存下来,要怎么做呢?WebMagic用于保存结果的组件叫做`Pipeline`。例如我们通过“控制台输出结果”这件事也是通过一个内置的Pipeline完成的,它叫做`ConsolePipeline`。那么,我现在想要把结果用Json的格式保存下来,怎么做呢?我只需要将Pipeline的实现换成"JsonFilePipeline"就可以了。
```java
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
//从"https://github.com/code4craft"开始抓
.addUrl("https://github.com/code4craft")
.addPipeline(new JsonFilePipeline("D:\webmagic\"))
//开启5个线程抓取
.thread(5)
//启动爬虫
.run();
}
```
这样子下载下来的文件就会保存在D盘的webmagic目录中了。
通过定制Pipeline,我们还可以实现保存结果到文件、数据库等一系列功能。这个会在第7章“抽取结果的处理”中介绍。
至此为止,我们已经完成了一个基本爬虫的编写,也具有了一些定制功能。
<div style="page-break-after:always"></div>
## 4. 抽取工具详解
### 4.1 XPath
### 4.2 CSS选择器
### 4.3 正则表达式
### 4.4 JsonPath
## 5. 配置爬虫
### 5.1 抓取频率
### 5.2 编码
### 5.3 代理
### 5.4 设置cookie/UA等http头信息
### 5.5 重试机制
### 5.6 多线程
## 6. 爬虫的启动和终止
### 6.1 启动爬虫
### 6.2 终止爬虫
### 6.3 设置执行时间
### 6.4 定期抓取
## 7. 抽取结果的处理
### 7.1 输出到控制台
### 7.2 保存到文件
### 7.3 JSON格式输出
### 7.4 自定义持久化方式(mysql/mongodb…)
## 8. 管理URL
### 8.1 手动添加URL
### 8.2 在URL中保存信息
### 8.3 几种URL管理方式
### 8.4 自己管理爬虫的URL
## 9. 实例
### 9.1 基本的列表+详情页的抓取
### 9.2 抓取动态页面
### 9.3 分页抓取
### 9.4 定期抓取
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment