- 11 Jul, 2015 1 commit
-
-
bingoko authored
The configuration file is config.ini The dependencies are updated in pom.xml. Update SeleniumDownloader and WebDriverPool to support PhantomJS. NOTE: The versions of GhostDriver, Selenium, and PhantomJS are stable and validated. A GooglePlay Example is under samples package: GooglePlayProcessor.java
-
- 09 Mar, 2015 2 commits
-
-
Yihua Huang authored
修正site.setHttpProxy()不起作用的bug
-
高军 authored
-
- 18 Feb, 2015 1 commit
-
-
Yihua Huang authored
Bug fix:MultiPagePipeline and DoubleKeyMap concurrent bug
-
- 13 Feb, 2015 1 commit
-
-
edwardsbean authored
-
- 22 Jan, 2015 1 commit
-
-
Yihua Huang authored
add retry sleep time
-
- 21 Jan, 2015 1 commit
-
-
edwardsbean authored
-
- 13 Jan, 2015 1 commit
-
-
yihua.huang authored
-
- 29 Sep, 2014 2 commits
-
-
yihua.huang authored
-
Yihua Huang authored
Update FileCacheQueueScheduler.java
-
- 14 Sep, 2014 1 commit
-
-
zhugw authored
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中. 但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法). 这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了. 解决方案一: 判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url. 方案二: 对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中. 希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
-
- 12 Sep, 2014 2 commits
-
-
Yihua Huang authored
Update Site.java
-
zhugw authored
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler. 而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
-
- 11 Sep, 2014 3 commits
-
-
yihua.huang authored
-
Yihua Huang authored
Update FileCacheQueueScheduler.java
-
zhugw authored
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
-
- 09 Sep, 2014 1 commit
-
-
yihua.huang authored
-
- 21 Aug, 2014 2 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
- 18 Aug, 2014 2 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
- 14 Aug, 2014 1 commit
-
-
yihua.huang authored
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149
-
- 13 Aug, 2014 1 commit
-
-
yihua.huang authored
-
- 25 Jun, 2014 3 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
- 10 Jun, 2014 1 commit
-
-
yihua.huang authored
-
- 09 Jun, 2014 4 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
zwf authored
-
- 04 Jun, 2014 8 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
- 03 Jun, 2014 1 commit
-
-
yihua.huang authored
-