Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
W
webmagic
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
沈俊林
webmagic
Commits
866ab0a0
Commit
866ab0a0
authored
Aug 03, 2013
by
yihua.huang
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update email
parent
7c9e9ce8
Changes
17
Show whitespace changes
Inline
Side-by-side
Showing
17 changed files
with
109 additions
and
28 deletions
+109
-28
ResultItems.java
...core/src/main/java/us/codecraft/webmagic/ResultItems.java
+1
-1
Spider.java
...agic-core/src/main/java/us/codecraft/webmagic/Spider.java
+0
-4
AfterExtractor.java
...java/us/codecraft/webmagic/annotation/AfterExtractor.java
+15
-0
ExtractBy.java
...main/java/us/codecraft/webmagic/annotation/ExtractBy.java
+1
-1
ExtractByUrl.java
...n/java/us/codecraft/webmagic/annotation/ExtractByUrl.java
+1
-1
FieldExtractor.java
...java/us/codecraft/webmagic/annotation/FieldExtractor.java
+1
-1
HelpUrl.java
...c/main/java/us/codecraft/webmagic/annotation/HelpUrl.java
+1
-1
OOSpider.java
.../main/java/us/codecraft/webmagic/annotation/OOSpider.java
+29
-0
ObjectPageProcessor.java
...us/codecraft/webmagic/annotation/ObjectPageProcessor.java
+1
-1
ObjectPipeline.java
...java/us/codecraft/webmagic/annotation/ObjectPipeline.java
+23
-5
PageModelExtractor.java
.../us/codecraft/webmagic/annotation/PageModelExtractor.java
+13
-1
PageModelPipeline.java
...a/us/codecraft/webmagic/annotation/PageModelPipeline.java
+14
-0
TargetUrl.java
...main/java/us/codecraft/webmagic/annotation/TargetUrl.java
+2
-1
Destroyable.java
...in/java/us/codecraft/webmagic/downloader/Destroyable.java
+1
-1
TestFetcher.java
...st/java/us/codecraft/webmagic/annotation/TestFetcher.java
+2
-6
IteyeBlog.java
...a/us/codecraft/webmagic/annotation/samples/IteyeBlog.java
+2
-2
OschinaBlog.java
...us/codecraft/webmagic/annotation/samples/OschinaBlog.java
+2
-2
No files found.
webmagic-core/src/main/java/us/codecraft/webmagic/ResultItems.java
View file @
866ab0a0
...
@@ -5,7 +5,7 @@ import java.util.Map;
...
@@ -5,7 +5,7 @@ import java.util.Map;
/**
/**
* 保存抽取结果的类,由PageProcessor处理得到,传递给{@link us.codecraft.webmagic.pipeline.Pipeline}进行持久化。<br>
* 保存抽取结果的类,由PageProcessor处理得到,传递给{@link us.codecraft.webmagic.pipeline.Pipeline}进行持久化。<br>
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-7-25 <br>
* @date: 13-7-25 <br>
* Time: 下午12:20 <br>
* Time: 下午12:20 <br>
*/
*/
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/Spider.java
View file @
866ab0a0
...
@@ -90,10 +90,6 @@ public class Spider implements Runnable, Task {
...
@@ -90,10 +90,6 @@ public class Spider implements Runnable, Task {
return
new
Spider
(
pageProcessor
);
return
new
Spider
(
pageProcessor
);
}
}
public
static
Spider
create
(
Site
site
,
Class
...
pageModels
)
{
return
new
Spider
(
ObjectPageProcessor
.
create
(
site
,
pageModels
));
}
/**
/**
* 重新设置startUrls,会覆盖Site本身的startUrls。
* 重新设置startUrls,会覆盖Site本身的startUrls。
*
*
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/AfterExtractor.java
0 → 100644
View file @
866ab0a0
package
us
.
codecraft
.
webmagic
.
annotation
;
import
us.codecraft.webmagic.Page
;
/**
* 实现这个接口即可在抽取后进行后处理。<br>
*
* @author code4crafter@gmail.com <br>
* @date: 13-8-3 <br>
* Time: 上午9:42 <br>
*/
public
interface
AfterExtractor
<
T
>
{
public
void
afterProcess
(
Page
page
,
T
t
);
}
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/ExtractBy.java
View file @
866ab0a0
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
import
java.lang.annotation.Target
;
import
java.lang.annotation.Target
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午8:40 <br>
* Time: 下午8:40 <br>
*/
*/
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/ExtractByUrl.java
View file @
866ab0a0
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
import
java.lang.annotation.Target
;
import
java.lang.annotation.Target
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午8:40 <br>
* Time: 下午8:40 <br>
*/
*/
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/FieldExtractor.java
View file @
866ab0a0
...
@@ -6,7 +6,7 @@ import java.lang.reflect.Field;
...
@@ -6,7 +6,7 @@ import java.lang.reflect.Field;
import
java.lang.reflect.Method
;
import
java.lang.reflect.Method
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午9:48 <br>
* Time: 下午9:48 <br>
*/
*/
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/HelpUrl.java
View file @
866ab0a0
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
import
java.lang.annotation.Target
;
import
java.lang.annotation.Target
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午8:40 <br>
* Time: 下午8:40 <br>
*/
*/
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/OOSpider.java
0 → 100644
View file @
866ab0a0
package
us
.
codecraft
.
webmagic
.
annotation
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Spider
;
import
us.codecraft.webmagic.processor.PageProcessor
;
/**
* @author code4crafter@gmail.com <br>
* @date: 13-8-3 <br>
* Time: 上午9:51 <br>
*/
public
class
OOSpider
extends
Spider
{
/**
* 使用已定义的抽取规则新建一个Spider。
*
* @param pageProcessor 已定义的抽取规则
*/
public
OOSpider
(
PageProcessor
pageProcessor
)
{
super
(
pageProcessor
);
}
public
static
OOSpider
create
(
Site
site
,
Class
...
pageModels
)
{
OOSpider
ooSpider
=
new
OOSpider
(
ObjectPageProcessor
.
create
(
site
,
pageModels
));
ooSpider
.
pipeline
(
new
ObjectPipeline
());
return
ooSpider
;
}
}
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/ObjectPageProcessor.java
View file @
866ab0a0
...
@@ -12,7 +12,7 @@ import java.util.Set;
...
@@ -12,7 +12,7 @@ import java.util.Set;
import
java.util.regex.Pattern
;
import
java.util.regex.Pattern
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午8:46 <br>
* Time: 下午8:46 <br>
*/
*/
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/ObjectPipeline.java
View file @
866ab0a0
...
@@ -4,18 +4,36 @@ import us.codecraft.webmagic.ResultItems;
...
@@ -4,18 +4,36 @@ import us.codecraft.webmagic.ResultItems;
import
us.codecraft.webmagic.Task
;
import
us.codecraft.webmagic.Task
;
import
us.codecraft.webmagic.pipeline.Pipeline
;
import
us.codecraft.webmagic.pipeline.Pipeline
;
import
java.util.Map
;
import
java.util.concurrent.ConcurrentHashMap
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-2 <br>
* @date: 13-8-2 <br>
* Time: 上午10:47 <br>
* Time: 上午10:47 <br>
*/
*/
public
class
ObjectPipeline
implements
Pipeline
{
public
class
ObjectPipeline
implements
Pipeline
{
@Override
public
void
process
(
ResultItems
resultItems
,
Task
task
)
{
private
Map
<
Class
,
PageModelPipeline
>
pageModelPipelines
=
new
ConcurrentHashMap
<
Class
,
PageModelPipeline
>();
public
ObjectPipeline
()
{
}
public
ObjectPipeline
put
(
Class
clazz
,
PageModelPipeline
pageModelPipeline
)
{
pageModelPipelines
.
put
(
clazz
,
pageModelPipeline
);
return
this
;
}
}
public
<
T
>
T
read
()
{
@Override
return
null
;
public
void
process
(
ResultItems
resultItems
,
Task
task
)
{
if
(
resultItems
.
isSkip
())
{
return
;
}
for
(
Map
.
Entry
<
Class
,
PageModelPipeline
>
classPageModelPipelineEntry
:
pageModelPipelines
.
entrySet
())
{
Object
o
=
resultItems
.
get
(
classPageModelPipelineEntry
.
getKey
().
getCanonicalName
());
if
(
o
!=
null
)
{
classPageModelPipelineEntry
.
getValue
().
process
(
o
,
task
);
}
}
}
}
}
}
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/PageModelExtractor.java
View file @
866ab0a0
...
@@ -16,7 +16,7 @@ import java.util.List;
...
@@ -16,7 +16,7 @@ import java.util.List;
import
java.util.regex.Pattern
;
import
java.util.regex.Pattern
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午9:33 <br>
* Time: 下午9:33 <br>
*/
*/
...
@@ -30,6 +30,8 @@ class PageModelExtractor {
...
@@ -30,6 +30,8 @@ class PageModelExtractor {
private
List
<
FieldExtractor
>
fieldExtractors
;
private
List
<
FieldExtractor
>
fieldExtractors
;
private
AfterExtractor
afterExtractor
;
public
static
PageModelExtractor
create
(
Class
clazz
)
{
public
static
PageModelExtractor
create
(
Class
clazz
)
{
PageModelExtractor
pageModelExtractor
=
new
PageModelExtractor
();
PageModelExtractor
pageModelExtractor
=
new
PageModelExtractor
();
pageModelExtractor
.
init
(
clazz
);
pageModelExtractor
.
init
(
clazz
);
...
@@ -40,6 +42,13 @@ class PageModelExtractor {
...
@@ -40,6 +42,13 @@ class PageModelExtractor {
this
.
clazz
=
clazz
;
this
.
clazz
=
clazz
;
initTargetUrlPatterns
();
initTargetUrlPatterns
();
fieldExtractors
=
new
ArrayList
<
FieldExtractor
>();
fieldExtractors
=
new
ArrayList
<
FieldExtractor
>();
if
(
clazz
.
isAssignableFrom
(
AfterExtractor
.
class
)){
try
{
afterExtractor
=(
AfterExtractor
)
clazz
.
newInstance
();
}
catch
(
Exception
e
)
{
throw
new
IllegalArgumentException
(
e
);
}
}
for
(
Field
field
:
clazz
.
getDeclaredFields
())
{
for
(
Field
field
:
clazz
.
getDeclaredFields
())
{
field
.
setAccessible
(
true
);
field
.
setAccessible
(
true
);
if
(!
field
.
getType
().
isAssignableFrom
(
String
.
class
)){
if
(!
field
.
getType
().
isAssignableFrom
(
String
.
class
)){
...
@@ -147,6 +156,9 @@ class PageModelExtractor {
...
@@ -147,6 +156,9 @@ class PageModelExtractor {
}
}
setField
(
o
,
fieldExtractor
,
value
);
setField
(
o
,
fieldExtractor
,
value
);
}
}
if
(
afterExtractor
!=
null
){
afterExtractor
.
afterProcess
(
page
,
o
);
}
}
catch
(
InstantiationException
e
)
{
}
catch
(
InstantiationException
e
)
{
e
.
printStackTrace
();
e
.
printStackTrace
();
}
catch
(
IllegalAccessException
e
)
{
}
catch
(
IllegalAccessException
e
)
{
...
...
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/PageModelPipeline.java
0 → 100644
View file @
866ab0a0
package
us
.
codecraft
.
webmagic
.
annotation
;
import
us.codecraft.webmagic.Task
;
/**
* @author code4crafter@gmail.com <br>
* @date: 13-8-3 <br>
* Time: 上午9:34 <br>
*/
public
interface
PageModelPipeline
<
T
>
{
public
void
process
(
T
t
,
Task
task
);
}
webmagic-core/src/main/java/us/codecraft/webmagic/annotation/TargetUrl.java
View file @
866ab0a0
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
...
@@ -5,7 +5,7 @@ import java.lang.annotation.Retention;
import
java.lang.annotation.Target
;
import
java.lang.annotation.Target
;
/**
/**
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-8-1 <br>
* @date: 13-8-1 <br>
* Time: 下午8:40 <br>
* Time: 下午8:40 <br>
*/
*/
...
@@ -14,4 +14,5 @@ import java.lang.annotation.Target;
...
@@ -14,4 +14,5 @@ import java.lang.annotation.Target;
public
@interface
TargetUrl
{
public
@interface
TargetUrl
{
String
[]
value
();
String
[]
value
();
}
}
webmagic-core/src/main/java/us/codecraft/webmagic/downloader/Destroyable.java
View file @
866ab0a0
...
@@ -2,7 +2,7 @@ package us.codecraft.webmagic.downloader;
...
@@ -2,7 +2,7 @@ package us.codecraft.webmagic.downloader;
/**
/**
* 比较占用资源的服务可以实现该接口,Spider会在结束时调用destroy()释放资源。<br>
* 比较占用资源的服务可以实现该接口,Spider会在结束时调用destroy()释放资源。<br>
* @author
yihua.huang@dianping
.com <br>
* @author
code4crafter@gmail
.com <br>
* @date: 13-7-26 <br>
* @date: 13-7-26 <br>
* Time: 下午3:10 <br>
* Time: 下午3:10 <br>
*/
*/
...
...
webmagic-core/src/test/java/us/codecraft/webmagic/annotation/TestFetcher.java
View file @
866ab0a0
...
@@ -3,7 +3,6 @@ package us.codecraft.webmagic.annotation;
...
@@ -3,7 +3,6 @@ package us.codecraft.webmagic.annotation;
import
org.junit.Ignore
;
import
org.junit.Ignore
;
import
org.junit.Test
;
import
org.junit.Test
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Spider
;
/**
/**
* @author yihua.huang@dianping.com <br>
* @author yihua.huang@dianping.com <br>
...
@@ -16,12 +15,9 @@ public class TestFetcher {
...
@@ -16,12 +15,9 @@ public class TestFetcher {
@Test
@Test
public
void
test
()
{
public
void
test
()
{
ObjectPipeline
objectPipeline
=
new
ObjectPipeline
();
ObjectPipeline
objectPipeline
=
new
ObjectPipeline
();
Spider
.
create
(
ObjectPageProcessor
.
create
(
Site
.
me
().
addStartUrl
(
"http://my.oschina.net/flashsword/blog/145796"
),
OschinaBlog
.
class
)
)
OOSpider
.
create
(
Site
.
me
().
addStartUrl
(
"http://my.oschina.net/flashsword/blog/145796"
),
OschinaBlog
.
class
)
.
pipeline
(
objectPipeline
)
.
runAsync
()
;
.
pipeline
(
objectPipeline
);
OschinaBlog
oschinaBlog
=
null
;
OschinaBlog
oschinaBlog
=
null
;
while
((
oschinaBlog
=
objectPipeline
.
read
())
!=
null
)
{
System
.
out
.
println
(
oschinaBlog
);
}
}
}
...
...
webmagic-samples/src/main/java/us/codecraft/webmagic/annotation/samples/IteyeBlog.java
View file @
866ab0a0
package
us
.
codecraft
.
webmagic
.
annotation
.
samples
;
package
us
.
codecraft
.
webmagic
.
annotation
.
samples
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Spider
;
import
us.codecraft.webmagic.annotation.ExtractBy
;
import
us.codecraft.webmagic.annotation.ExtractBy
;
import
us.codecraft.webmagic.annotation.OOSpider
;
import
us.codecraft.webmagic.annotation.TargetUrl
;
import
us.codecraft.webmagic.annotation.TargetUrl
;
/**
/**
...
@@ -28,7 +28,7 @@ public class IteyeBlog implements Blog{
...
@@ -28,7 +28,7 @@ public class IteyeBlog implements Blog{
}
}
public
static
void
main
(
String
[]
args
)
{
public
static
void
main
(
String
[]
args
)
{
Spider
.
create
(
Site
.
me
().
addStartUrl
(
"http://dengminhui.iteye.com/blog"
),
IteyeBlog
.
class
).
run
();
OOSpider
.
create
(
Site
.
me
().
addStartUrl
(
"http://dengminhui.iteye.com/blog"
),
IteyeBlog
.
class
).
run
();
}
}
public
String
getTitle
()
{
public
String
getTitle
()
{
...
...
webmagic-samples/src/main/java/us/codecraft/webmagic/annotation/samples/OschinaBlog.java
View file @
866ab0a0
package
us
.
codecraft
.
webmagic
.
annotation
.
samples
;
package
us
.
codecraft
.
webmagic
.
annotation
.
samples
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Site
;
import
us.codecraft.webmagic.Spider
;
import
us.codecraft.webmagic.annotation.ExtractBy
;
import
us.codecraft.webmagic.annotation.ExtractBy
;
import
us.codecraft.webmagic.annotation.OOSpider
;
import
us.codecraft.webmagic.annotation.TargetUrl
;
import
us.codecraft.webmagic.annotation.TargetUrl
;
/**
/**
...
@@ -28,7 +28,7 @@ public class OschinaBlog implements Blog{
...
@@ -28,7 +28,7 @@ public class OschinaBlog implements Blog{
}
}
public
static
void
main
(
String
[]
args
)
{
public
static
void
main
(
String
[]
args
)
{
Spider
.
create
(
Site
.
me
().
addStartUrl
(
"http://my.oschina.net/flashsword/blog"
),
OschinaBlog
.
class
).
run
();
OOSpider
.
create
(
Site
.
me
().
addStartUrl
(
"http://my.oschina.net/flashsword/blog"
),
OschinaBlog
.
class
).
run
();
}
}
public
String
getTitle
()
{
public
String
getTitle
()
{
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment