Commits · 05a1f3956936910d8d5d82001d8976cc4576c3aa · 沈俊林 / webmagic

18 Feb, 2015 1 commit
- Merge pull request #193 from EdwardsBean/fix-mppipeline · 05a1f395
  Yihua Huang authored Feb 18, 2015
```
Bug fix:MultiPagePipeline and DoubleKeyMap concurrent bug
```
  05a1f395
13 Feb, 2015 1 commit
- fix bug:MultiPagePipeline and DoubleKeyMap concurrent bug · 74962d69
  edwardsbean authored Feb 13, 2015
  
  74962d69
22 Jan, 2015 1 commit
- Merge pull request #188 from EdwardsBean/retry_time · 6b9d21fc
  Yihua Huang authored Jan 22, 2015
```
add retry sleep time
```
  6b9d21fc
21 Jan, 2015 1 commit
- add retry sleep time · 49786656
  edwardsbean authored Jan 21, 2015
  
  49786656
13 Jan, 2015 1 commit
- add NPE check for POST method · 8ffc1a70
  yihua.huang authored Jan 13, 2015
  
  8ffc1a70
29 Sep, 2014 2 commits
- remove commented code · 8551b668
  yihua.huang authored Sep 29, 2014
  
  8551b668
- Merge pull request #161 from zhugw/patch-4 · 20422f1b
  Yihua Huang authored Sep 29, 2014
```
Update FileCacheQueueScheduler.java
```
  20422f1b
14 Sep, 2014 1 commit

Update FileCacheQueueScheduler.java · eb3c78b9

zhugw authored Sep 14, 2014

这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中.
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中.

希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!

eb3c78b9

12 Sep, 2014 2 commits

Merge pull request #159 from zhugw/patch-3 · 3a9c1d30
Yihua Huang authored Sep 12, 2014
```
Update Site.java
```
3a9c1d30

Update Site.java · bc666e92

zhugw authored Sep 12, 2014

setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?

bc666e92

11 Sep, 2014 3 commits

update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157 · 42a30074
yihua.huang authored Sep 11, 2014

42a30074
Merge pull request #157 from zhugw/patch-1 · 689e89a9
Yihua Huang authored Sep 11, 2014
```
Update FileCacheQueueScheduler.java
```
689e89a9

Update FileCacheQueueScheduler.java · 1db940a0

zhugw authored Sep 11, 2014

在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.

1db940a0

09 Sep, 2014 1 commit
- remove duplicate setPath in ProxyPool · 147401ce
  yihua.huang authored Sep 09, 2014
  
  147401ce
21 Aug, 2014 2 commits
- fix package name =.= · 3734865a
  yihua.huang authored Aug 21, 2014
  
  3734865a
- fix SourceRegion error and add some tests on it #144 · e7668e01
  yihua.huang authored Aug 21, 2014
  
  e7668e01
18 Aug, 2014 2 commits
- fix test cont' · 4e5ba020
  yihua.huang authored Aug 18, 2014
  
  4e5ba020
- fix test · 4446669c
  yihua.huang authored Aug 18, 2014
  
  4446669c
14 Aug, 2014 1 commit
- Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to... · 9866297e
  yihua.huang authored Aug 14, 2014
```
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it.  #149
```
  9866297e
13 Aug, 2014 1 commit
- more friendly exception message in PlainText #144 · 4e6e946d
  yihua.huang authored Aug 13, 2014
  
  4e6e946d
25 Jun, 2014 3 commits
- update assertj to test scope · ebb931e0
  yihua.huang authored Jun 25, 2014
  
  ebb931e0
- move thread package out of selector (because it is add by mistake at the beginning) · af993962
  yihua.huang authored Jun 25, 2014
  
  af993962
- change path seperator for varient OS #139 · 2fd8f05f
  yihua.huang authored Jun 25, 2014
  
  2fd8f05f
10 Jun, 2014 1 commit
- new sample · eae37c86
  yihua.huang authored Jun 10, 2014
  
  eae37c86
09 Jun, 2014 4 commits
- some fix for tests #130 · b3a282e5
  yihua.huang authored Jun 09, 2014
  
  b3a282e5
- t push origin masterMerge branch 'yxssfxwzy-proxy' · b75e64a6
  yihua.huang authored Jun 09, 2014
  
  b75e64a6
- Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy · 074d767f
  yihua.huang authored Jun 09, 2014
  
  074d767f
- add test and fix bug of proxy module · 2f89cfc3
  zwf authored Jun 09, 2014
  
  2f89cfc3
04 Jun, 2014 8 commits
- fix test · eb89d665
  yihua.huang authored Jun 04, 2014
  
  eb89d665
- contributor · 2a15bc02
  yihua.huang authored Jun 04, 2014
  
  2a15bc02
- contributor · 5e8ca02e
  yihua.huang authored Jun 04, 2014
  
  5e8ca02e
- update version in docs · db0195ba
  yihua.huang authored Jun 04, 2014
  
  db0195ba
- update version · 5f8c3fd5
  yihua.huang authored Jun 04, 2014
  
  5f8c3fd5
- update pom · 0e9042ee
  yihua.huang authored Jun 04, 2014
  
  0e9042ee
- update pom · 03170178
  yihua.huang authored Jun 04, 2014
  
  03170178
- update pom for deploy · c83b74f0
  yihua.huang authored Jun 04, 2014
  
  c83b74f0
03 Jun, 2014 1 commit
- Bugfix: selector does not works well in element #113 · 7a64847a
  yihua.huang authored Jun 03, 2014
  
  7a64847a
28 May, 2014 1 commit
- change back return proxy from spider to httpclientdownloader #128 · 8d67fd03
  yihua.huang authored May 28, 2014
  
  8d67fd03
27 May, 2014 2 commits
- change return proxy from spider to httpclientdownloader #128 · 40bf8ca5
  yihua.huang authored May 27, 2014
  
  40bf8ca5
- spell mistake fix #128 · 1f21d9cc
  yihua.huang authored May 27, 2014
  
  1f21d9cc