- 29 Sep, 2014 1 commit
-
-
Yihua Huang authored
Update FileCacheQueueScheduler.java
-
- 14 Sep, 2014 1 commit
-
-
zhugw authored
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中. 但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法). 这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了. 解决方案一: 判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url. 方案二: 对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中. 希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
-
- 12 Sep, 2014 2 commits
-
-
Yihua Huang authored
Update Site.java
-
zhugw authored
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler. 而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
-
- 11 Sep, 2014 3 commits
-
-
yihua.huang authored
-
Yihua Huang authored
Update FileCacheQueueScheduler.java
-
zhugw authored
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
-
- 09 Sep, 2014 1 commit
-
-
yihua.huang authored
-
- 21 Aug, 2014 2 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
- 18 Aug, 2014 2 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
- 14 Aug, 2014 1 commit
-
-
yihua.huang authored
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149
-
- 13 Aug, 2014 1 commit
-
-
yihua.huang authored
-
- 25 Jun, 2014 3 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
- 10 Jun, 2014 1 commit
-
-
yihua.huang authored
-
- 09 Jun, 2014 4 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
zwf authored
-
- 04 Jun, 2014 8 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
- 03 Jun, 2014 1 commit
-
-
yihua.huang authored
-
- 28 May, 2014 1 commit
-
-
yihua.huang authored
-
- 27 May, 2014 8 commits
-
-
yihua.huang authored
-
yihua.huang authored
-
Yihua Huang authored
多个代理的管理
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
-
yihua.huang authored
1. remove lazy init of Html 2. rename strings to sourceTexts for better meaning 3. make getSourceTexts abstract and DO NOT always store strings 4. instead store parsed elements of document in HtmlNode
-