Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
W
webmagic
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
沈俊林
webmagic
Commits
633e0fe8
Commit
633e0fe8
authored
Nov 28, 2013
by
yihua.huang
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
document for avalon
parent
18a3af4a
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
96 additions
and
1 deletion
+96
-1
webmagic-avalon.md
webmagic-avalon.md
+24
-0
README.md
webmagic-scripts/README.md
+47
-0
github.js
webmagic-scripts/src/main/resources/js/github.js
+14
-0
oschina.js
webmagic-scripts/src/main/resources/js/oschina.js
+1
-1
github.rb
webmagic-scripts/src/main/resources/ruby/github.rb
+10
-0
No files found.
webmagic-avalon.md
0 → 100644
View file @
633e0fe8
WebMagic-Avalon项目手册
=======
WebMagic-Avalon项目的目标是打造一个可配置、可管理的爬虫,以及一个可分享配置/脚本的平台,从而减少熟悉的开发者的开发量,并且让
**不熟悉Java技术的人**
也能简单的使用一个爬虫。
## Part1:webmagic-scripts
目标:使得可以用简单脚本的方式编写爬虫,从而为一些常用场景提供可流通的脚本。
例如:我需要抓github的仓库数据,可以这样写一个脚本(javascript):
[
https://github.com/code4craft/webmagic/tree/master/webmagic-scripts
](
https://github.com/code4craft/webmagic/tree/master/webmagic-scripts
)
这个功能目前实现了一部分,但最终结果仍在实验阶段。欢迎大家积极参与并提出意见。
## Part2:webmagic-pannel
一个集成了加载脚本、管理爬虫的后台。计划中。
## Part3:webmagic-market
一个可以分享、搜索和下载脚本的站点。计划中。
## 如何参与
webmagic目前
\ No newline at end of file
webmagic-scripts/README.md
0 → 100644
View file @
633e0fe8
webmagic-scripts
======
## 目标:
使得可以用简单脚本的方式编写爬虫,从而为一些常用场景提供可流通的脚本。
## 实例:
例如:我需要抓github的仓库数据,可以这样写一个脚本(javascript):
```
javascript
var
name
=
xpath
(
"//h1[@class='entry-title public']/strong/a/text()"
)
var
readme
=
xpath
(
"//div[@id='readme']/tidyText()"
)
var
star
=
xpath
(
"//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
)
var
fork
=
xpath
(
"//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
)
var
url
=
page
.
getUrl
().
toString
()
if
(
name
!=
null
){
println
(
name
)
println
(
readme
)
println
(
star
)
println
(
url
)
}
urls
(
"(https://github
\\
.com/
\\
w+/
\\
w+)"
)
urls
(
"(https://github
\\
.com/
\\
w+)"
)
```
然后使用webmagic加载并启动它,无需下载依赖、编写代码、执行的过程。
如果已经有人写好了脚本,那么你直接使用就可以了!
## 语言:
选用javascript是因为用户面比较广。目前还支持ruby语言,选用ruby是因为ruby的语法编写DSL更简洁:
```
ruby
name
=
xpath
"//h1[@class='entry-title public']/strong/a/text()"
readme
=
xpath
"//div[@id='readme']/tidyText()"
star
=
xpath
"//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
fork
=
xpath
"//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
url
=
$page
.
getUrl
().
toString
()
puts
name
,
readme
,
star
,
fork
,
url
unless
name
==
nil
urls
"(https://github
\\
.com/
\\
w+/
\\
w+)"
urls
"(https://github
\\
.com/
\\
w+)"
```
这个功能目前仍在实验阶段。欢迎大家积极参与并提出意见。
\ No newline at end of file
webmagic-scripts/src/main/resources/js/github.js
0 → 100644
View file @
633e0fe8
var
name
=
xpath
(
"//h1[@class='entry-title public']/strong/a/text()"
)
var
readme
=
xpath
(
"//div[@id='readme']/tidyText()"
)
var
star
=
xpath
(
"//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
)
var
fork
=
xpath
(
"//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
)
var
url
=
page
.
getUrl
().
toString
()
if
(
name
!=
null
){
println
(
name
)
println
(
readme
)
println
(
star
)
println
(
url
)
}
urls
(
"(https://github
\\
.com/
\\
w+/
\\
w+)"
)
urls
(
"(https://github
\\
.com/
\\
w+)"
)
\ No newline at end of file
webmagic-scripts/src/main/resources/js/oschina.js
View file @
633e0fe8
...
...
@@ -8,4 +8,4 @@ var config = {
}
title
=
$
(
"div.BlogTitle h1"
),
content
=
$
(
"div.BlogContent"
)
urls
(
"http://my
\\
.oschina
\\
.net/flashsword/blog/
\\
d+"
)
\ No newline at end of file
urls
(
"http://my
\\
.oschina
\\
.net/flashsword/blog/
\\
d+"
)
webmagic-scripts/src/main/resources/ruby/github.rb
0 → 100644
View file @
633e0fe8
name
=
xpath
"//h1[@class='entry-title public']/strong/a/text()"
readme
=
xpath
"//div[@id='readme']/tidyText()"
star
=
xpath
"//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
fork
=
xpath
"//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
url
=
$page
.
getUrl
().
toString
()
puts
name
,
readme
,
star
,
fork
,
url
unless
name
==
nil
urls
"(https://github
\\
.com/
\\
w+/
\\
w+)"
urls
"(https://github
\\
.com/
\\
w+)"
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment