Commits · 20422f1b632d71c0e890d736860d3f7726465c50 · 沈俊林 / webmagic

29 Sep, 2014 1 commit
- Merge pull request #161 from zhugw/patch-4 · 20422f1b
  Yihua Huang authored Sep 29, 2014
```
Update FileCacheQueueScheduler.java
```
  20422f1b
14 Sep, 2014 1 commit

Update FileCacheQueueScheduler.java · eb3c78b9

zhugw authored Sep 14, 2014

这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中.
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中.

希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!

eb3c78b9

12 Sep, 2014 2 commits

Merge pull request #159 from zhugw/patch-3 · 3a9c1d30
Yihua Huang authored Sep 12, 2014
```
Update Site.java
```
3a9c1d30

Update Site.java · bc666e92

zhugw authored Sep 12, 2014

setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?

bc666e92

11 Sep, 2014 3 commits

update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157 · 42a30074
yihua.huang authored Sep 11, 2014

42a30074
Merge pull request #157 from zhugw/patch-1 · 689e89a9
Yihua Huang authored Sep 11, 2014
```
Update FileCacheQueueScheduler.java
```
689e89a9

Update FileCacheQueueScheduler.java · 1db940a0

zhugw authored Sep 11, 2014

在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.

1db940a0

09 Sep, 2014 1 commit
- remove duplicate setPath in ProxyPool · 147401ce
  yihua.huang authored Sep 09, 2014
  
  147401ce
21 Aug, 2014 2 commits
- fix package name =.= · 3734865a
  yihua.huang authored Aug 21, 2014
  
  3734865a
- fix SourceRegion error and add some tests on it #144 · e7668e01
  yihua.huang authored Aug 21, 2014
  
  e7668e01
18 Aug, 2014 2 commits
- fix test cont' · 4e5ba020
  yihua.huang authored Aug 18, 2014
  
  4e5ba020
- fix test · 4446669c
  yihua.huang authored Aug 18, 2014
  
  4446669c
14 Aug, 2014 1 commit
- Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to... · 9866297e
  yihua.huang authored Aug 14, 2014
```
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it.  #149
```
  9866297e
13 Aug, 2014 1 commit
- more friendly exception message in PlainText #144 · 4e6e946d
  yihua.huang authored Aug 13, 2014
  
  4e6e946d
25 Jun, 2014 3 commits
- update assertj to test scope · ebb931e0
  yihua.huang authored Jun 25, 2014
  
  ebb931e0
- move thread package out of selector (because it is add by mistake at the beginning) · af993962
  yihua.huang authored Jun 25, 2014
  
  af993962
- change path seperator for varient OS #139 · 2fd8f05f
  yihua.huang authored Jun 25, 2014
  
  2fd8f05f
10 Jun, 2014 1 commit
- new sample · eae37c86
  yihua.huang authored Jun 10, 2014
  
  eae37c86
09 Jun, 2014 4 commits
- some fix for tests #130 · b3a282e5
  yihua.huang authored Jun 09, 2014
  
  b3a282e5
- t push origin masterMerge branch 'yxssfxwzy-proxy' · b75e64a6
  yihua.huang authored Jun 09, 2014
  
  b75e64a6
- Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy · 074d767f
  yihua.huang authored Jun 09, 2014
  
  074d767f
- add test and fix bug of proxy module · 2f89cfc3
  zwf authored Jun 09, 2014
  
  2f89cfc3
04 Jun, 2014 8 commits
- fix test · eb89d665
  yihua.huang authored Jun 04, 2014
  
  eb89d665
- contributor · 2a15bc02
  yihua.huang authored Jun 04, 2014
  
  2a15bc02
- contributor · 5e8ca02e
  yihua.huang authored Jun 04, 2014
  
  5e8ca02e
- update version in docs · db0195ba
  yihua.huang authored Jun 04, 2014
  
  db0195ba
- update version · 5f8c3fd5
  yihua.huang authored Jun 04, 2014
  
  5f8c3fd5
- update pom · 0e9042ee
  yihua.huang authored Jun 04, 2014
  
  0e9042ee
- update pom · 03170178
  yihua.huang authored Jun 04, 2014
  
  03170178
- update pom for deploy · c83b74f0
  yihua.huang authored Jun 04, 2014
  
  c83b74f0
03 Jun, 2014 1 commit
- Bugfix: selector does not works well in element #113 · 7a64847a
  yihua.huang authored Jun 03, 2014
  
  7a64847a
28 May, 2014 1 commit
- change back return proxy from spider to httpclientdownloader #128 · 8d67fd03
  yihua.huang authored May 28, 2014
  
  8d67fd03
27 May, 2014 8 commits
- change return proxy from spider to httpclientdownloader #128 · 40bf8ca5
  yihua.huang authored May 27, 2014
  
  40bf8ca5
- spell mistake fix #128 · 1f21d9cc
  yihua.huang authored May 27, 2014
  
  1f21d9cc
- Merge pull request #128 from yxssfxwzy/proxy · e310139d
  Yihua Huang authored May 27, 2014
```
多个代理的管理
```
  e310139d
- Bugfix:Type convert error in JsonPathSelector #129 · b1650904
  yihua.huang authored May 27, 2014
  
  b1650904
- update xsoup version to release #113 · 95bdb302
  yihua.huang authored May 27, 2014
  
  95bdb302
- fix ut #113 · a5d1b56e
  yihua.huang authored May 27, 2014
  
  a5d1b56e
- Bugfix: nodes() only return the first element #113 · 3939074a
  yihua.huang authored May 27, 2014
  
  3939074a
- refactor of selectable cont' #113 · 41c2ea94
  yihua.huang authored May 27, 2014
```
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
```
  41c2ea94