Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
W
webmagic
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
沈俊林
webmagic
Commits
b13f1da0
Commit
b13f1da0
authored
Apr 01, 2014
by
yihua.huang
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
reformat
parent
7038c00a
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
48 additions
and
48 deletions
+48
-48
README.md
zh_docs/README.md
+48
-48
No files found.
zh_docs/README.md
View file @
b13f1da0
...
...
@@ -38,27 +38,27 @@ webmagic的github地址:[https://github.com/code4craft/webmagic](https://githu
webmagic使用maven管理依赖,在项目中添加对应的依赖即可使用webmagic:
```
xml
<dependency>
<groupId>
us.codecraft
</groupId>
<artifactId>
webmagic-core
</artifactId>
<version>
0.4.3
</version>
</dependency>
<dependency>
<groupId>
us.codecraft
</groupId>
<artifactId>
webmagic-extension
</artifactId>
<version>
0.4.3
</version>
</dependency>
<dependency>
<groupId>
us.codecraft
</groupId>
<artifactId>
webmagic-core
</artifactId>
<version>
0.4.3
</version>
</dependency>
<dependency>
<groupId>
us.codecraft
</groupId>
<artifactId>
webmagic-extension
</artifactId>
<version>
0.4.3
</version>
</dependency>
```
WebMagic 使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。
```
xml
<exclusions>
<exclusion>
<groupId>
org.slf4j
</groupId>
<artifactId>
slf4j-log4j12
</artifactId>
</exclusion>
</exclusions>
<exclusions>
<exclusion>
<groupId>
org.slf4j
</groupId>
<artifactId>
slf4j-log4j12
</artifactId>
</exclusion>
</exclusions>
```
#### 项目结构
...
...
@@ -96,30 +96,30 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较
PageProcessor是webmagic-core的一部分,定制一个PageProcessor即可实现自己的爬虫逻辑。以下是抓取osc博客的一段代码:
```
java
public
class
OschinaBlogPageProcesser
implements
PageProcessor
{
public
class
OschinaBlogPageProcesser
implements
PageProcessor
{
private
Site
site
=
Site
.
me
().
setDomain
(
"my.oschina.net"
);
private
Site
site
=
Site
.
me
().
setDomain
(
"my.oschina.net"
);
@Override
public
void
process
(
Page
page
)
{
List
<
String
>
links
=
page
.
getHtml
().
links
().
regex
(
"http://my\\.oschina\\.net/flashsword/blog/\\d+"
).
all
();
page
.
addTargetRequests
(
links
);
page
.
putField
(
"title"
,
page
.
getHtml
().
xpath
(
"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1"
).
toString
());
page
.
putField
(
"content"
,
page
.
getHtml
().
$
(
"div.content"
).
toString
());
page
.
putField
(
"tags"
,
page
.
getHtml
().
xpath
(
"//div[@class='BlogTags']/a/text()"
).
all
());
}
@Override
public
void
process
(
Page
page
)
{
List
<
String
>
links
=
page
.
getHtml
().
links
().
regex
(
"http://my\\.oschina\\.net/flashsword/blog/\\d+"
).
all
();
page
.
addTargetRequests
(
links
);
page
.
putField
(
"title"
,
page
.
getHtml
().
xpath
(
"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1"
).
toString
());
page
.
putField
(
"content"
,
page
.
getHtml
().
$
(
"div.content"
).
toString
());
page
.
putField
(
"tags"
,
page
.
getHtml
().
xpath
(
"//div[@class='BlogTags']/a/text()"
).
all
());
}
@Override
public
Site
getSite
()
{
return
site
;
@Override
public
Site
getSite
()
{
return
site
;
}
}
public
static
void
main
(
String
[]
args
)
{
Spider
.
create
(
new
OschinaBlogPageProcesser
()).
addUrl
(
"http://my.oschina.net/flashsword/blog"
)
.
addPipeline
(
new
ConsolePipeline
()).
run
();
}
public
static
void
main
(
String
[]
args
)
{
Spider
.
create
(
new
OschinaBlogPageProcesser
()).
addUrl
(
"http://my.oschina.net/flashsword/blog"
)
.
addPipeline
(
new
ConsolePipeline
()).
run
();
}
}
```
...
...
@@ -134,24 +134,24 @@ Spider是爬虫的入口类。Pipeline是结果输出和持久化的接口,这
webmagic-extension包括了注解方式编写爬虫的方法,只需基于一个POJO增加注解即可完成一个爬虫。以下仍然是抓取oschina博客的一段代码,功能与OschinaBlogPageProcesser完全相同:
```
java
@TargetUrl
(
"http://my.oschina.net/flashsword/blog/\\d+"
)
public
class
OschinaBlog
{
@TargetUrl
(
"http://my.oschina.net/flashsword/blog/\\d+"
)
public
class
OschinaBlog
{
@ExtractBy
(
"//title"
)
private
String
title
;
@ExtractBy
(
"//title"
)
private
String
title
;
@ExtractBy
(
value
=
"div.BlogContent"
,
type
=
ExtractBy
.
Type
.
Css
)
private
String
content
;
@ExtractBy
(
value
=
"div.BlogContent"
,
type
=
ExtractBy
.
Type
.
Css
)
private
String
content
;
@ExtractBy
(
value
=
"//div[@class='BlogTags']/a/text()"
,
multi
=
true
)
private
List
<
String
>
tags
;
@ExtractBy
(
value
=
"//div[@class='BlogTags']/a/text()"
,
multi
=
true
)
private
List
<
String
>
tags
;
public
static
void
main
(
String
[]
args
)
{
OOSpider
.
create
(
Site
.
me
(),
new
ConsolePageModelPipeline
(),
OschinaBlog
.
class
).
addUrl
(
"http://my.oschina.net/flashsword/blog"
).
run
();
}
}
public
static
void
main
(
String
[]
args
)
{
OOSpider
.
create
(
Site
.
me
(),
new
ConsolePageModelPipeline
(),
OschinaBlog
.
class
).
addUrl
(
"http://my.oschina.net/flashsword/blog"
).
run
();
}
}
```
这个例子定义了一个Model类,Model类的字段'title'、'content'、'tags'均为要抽取的属性。这个类在Pipeline里是可以复用的。
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment