Commit f946fcdf authored by yihua.huang's avatar yihua.huang

backup docs in Chinese

parent 5cb45af3
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page]]></key>
<data><![CDATA[ <pre class="zh">
Page保存了上一次抓取的结果,并可定义待抓取的链接内容。
主要方法:
{@link #getUrl()} 获取页面的Url
{@link #getHtml()} 获取页面的html内容
{@link #putField(String, Object)} 保存抽取的结果
{@link #getResultItems()} 获取抽取的结果,在 {@link us.codecraft.webmagic.pipeline.Pipeline} 中调用
{@link #addTargetRequests(java.util.List)} {@link #addTargetRequest(String)} 添加待抓取的链接
</pre>
<pre class="en">
Store extracted result and urls to be crawled.
Main method:
{@link #getUrl()} get url of current page
{@link #getHtml()} get content of current page
{@link #putField(String, Object)} save extracted result
{@link #getResultItems()} get extract results to be used in {@link us.codecraft.webmagic.pipeline.Pipeline}
{@link #addTargetRequests(java.util.List)} {@link #addTargetRequest(String)} add urls to crawl
</pre>
@author code4crafter@gmail.com <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.putField(java.lang.String, java.lang.Object)]]></key>
<data><![CDATA[
@param key 结果的key
@param field 结果的value
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.getHtml()]]></key>
<data><![CDATA[ 获取页面的html内容
@return html 页面的html内容
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.addTargetRequests(java.util.List<java.lang.String>)]]></key>
<data><![CDATA[ 添加待抓取的链接
@param requests 待抓取的链接
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.addTargetRequest(java.lang.String)]]></key>
<data><![CDATA[ 添加待抓取的链接
@param requestString 待抓取的链接
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.addTargetRequest(us.codecraft.webmagic.Request)]]></key>
<data><![CDATA[ 添加待抓取的页面,在需要传递附加信息时使用
@param request 待抓取的页面
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.getUrl()]]></key>
<data><![CDATA[ 获取页面的Url
@return url 当前页面的url,可用于抽取
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.setUrl(us.codecraft.webmagic.selector.Selectable)]]></key>
<data><![CDATA[ 设置url
@param url
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Page.getRequest()]]></key>
<data><![CDATA[ 获取抓取请求
@return request 抓取请求
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.PagedModel]]></key>
<data><![CDATA[ 实现此接口以进行支持爬虫分页抓取。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-4 <br>
Time: 下午5:18 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.Request]]></key>
<data><![CDATA[ <div class="zh">
Request对象封装了待抓取的url信息。<br/>
在PageProcessor中,Request对象可以通过{@link us.codecraft.webmagic.Page#getRequest()} 获取。<br/>
<br/>
Request对象包含一个extra属性,可以写入一些必须的上下文,这个特性在某些场合会有用。<br/>
<pre>
Example:
抓取<a href="${link}">${linktext}</a>时,希望提取链接link,并保存linktext的信息。
在上一个页面:
public void process(Page page){
Request request = new Request(link,linktext);
page.addTargetRequest(request)
}
在下一个页面:
public void process(Page page){
String linktext = (String)page.getRequest().getExtra()[0];
}
</pre>
</div>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午11:37
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Request(java.lang.String)]]></key>
<data><![CDATA[ 构建一个request对象
@param url 必须参数,待抓取的url
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Request.setPriority(double)]]></key>
<data><![CDATA[ 设置优先级,用于URL队列排序<br>
需扩展Scheduler<br>
目前还没有对应支持优先级的Scheduler实现 =。= <br>
@param priority 优先级,越大则越靠前
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Request.getUrl()]]></key>
<data><![CDATA[ 获取待抓取的url
@return url 待抓取的url
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.ResultItems]]></key>
<data><![CDATA[ 保存抽取结果的类,由PageProcessor处理得到,传递给{@link us.codecraft.webmagic.pipeline.Pipeline}进行持久化。<br>
@author code4crafter@gmail.com <br>
Date: 13-7-25 <br>
Time: 下午12:20 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.ResultItems.isSkip()]]></key>
<data><![CDATA[ 是否忽略这个页面,用于pipeline来判断是否对这个页面进行处理
@return 是否忽略 true 忽略
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.ResultItems.setSkip(boolean)]]></key>
<data><![CDATA[ 设置是否忽略这个页面,用于pipeline来判断是否对这个页面进行处理
@param skip
@return this
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site]]></key>
<data><![CDATA[ Site定义一个待抓取的站点的各种信息。<br>
这个类的所有getter方法,一般都只会被爬虫框架内部进行调用。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午12:13
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.me()]]></key>
<data><![CDATA[ 创建一个Site对象,等价于new Site()
@return 新建的对象
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.addCookie(java.lang.String, java.lang.String)]]></key>
<data><![CDATA[ 为这个站点添加一个cookie,可用于抓取某些需要登录访问的站点。这个cookie的域名与{@link #getDomain()}是一致的
@param name cookie的名称
@param value cookie的值
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.setUserAgent(java.lang.String)]]></key>
<data><![CDATA[ 为这个站点设置user-agent,很多网站都对user-agent进行了限制,不设置此选项可能会得到期望之外的结果。
@param userAgent userAgent
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getCookies()]]></key>
<data><![CDATA[ 获取已经设置的所有cookie
@return 已经设置的所有cookie
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getUserAgent()]]></key>
<data><![CDATA[ 获取已设置的user-agent
@return 已设置的user-agent
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getDomain()]]></key>
<data><![CDATA[ 获取已设置的domain
@return 已设置的domain
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.setDomain(java.lang.String)]]></key>
<data><![CDATA[ 设置这个站点所在域名,必须项。<br>
目前不支持多个域名的抓取。抓取多个域名请新建一个Spider。
@param domain 爬虫会抓取的域名
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.setCharset(java.lang.String)]]></key>
<data><![CDATA[ 设置页面编码,若不设置则自动根据Html meta信息获取。<br>
一般无需设置encoding,如果发现下载的结果是乱码,则可以设置此项。<br>
@param charset 编码格式,主要是"utf-8"、"gbk"两种
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getCharset()]]></key>
<data><![CDATA[ 获取已设置的编码
@return 已设置的domain
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.setAcceptStatCode(java.util.Set<java.lang.Integer>)]]></key>
<data><![CDATA[ 设置可接受的http状态码,仅当状态码在这个集合中时,才会读取页面内容。<br>
默认为200,正常情况下,无须设置此项。<br>
某些站点会错误的返回状态码,此时可以对这个选项进行设置。<br>
@param acceptStatCode 可接受的状态码
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getAcceptStatCode()]]></key>
<data><![CDATA[ 获取可接受的状态码
@return 可接受的状态码
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getStartUrls()]]></key>
<data><![CDATA[ 获取初始页面的地址列表
@return 初始页面的地址列表
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.addStartUrl(java.lang.String)]]></key>
<data><![CDATA[ 增加初始页面的地址,可反复调用此方法增加多个初始地址。
@param startUrl 初始页面的地址
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.setSleepTime(int)]]></key>
<data><![CDATA[ 设置两次抓取之间的间隔,避免对目标站点压力过大(或者避免被防火墙屏蔽...)。
@param sleepTime 单位毫秒
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getSleepTime()]]></key>
<data><![CDATA[ 获取两次抓取之间的间隔
@return 两次抓取之间的间隔,单位毫秒
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.getRetryTimes()]]></key>
<data><![CDATA[ 获取重新下载的次数,默认为0
@return 重新下载的次数
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Site.setRetryTimes(int)]]></key>
<data><![CDATA[ 设置获取重新下载的次数,默认为0
@return this
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider]]></key>
<data><![CDATA[ <pre>
webmagic爬虫的入口类。
示例:
定义一个最简单的爬虫:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();
使用FilePipeline保存结果到文件:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.pipeline(new FilePipeline("/data/temp/webmagic/")).run();
使用FileCacheQueueScheduler缓存URL,关闭爬虫后下次自动从停止的页面继续抓取:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
</pre>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午6:53
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider(us.codecraft.webmagic.processor.PageProcessor)]]></key>
<data><![CDATA[ 使用已定义的抽取规则新建一个Spider。
@param pageProcessor 已定义的抽取规则
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.create(us.codecraft.webmagic.processor.PageProcessor)]]></key>
<data><![CDATA[ 使用已定义的抽取规则新建一个Spider。
@param pageProcessor 已定义的抽取规则
@return 新建的Spider
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.startUrls(java.util.List<java.lang.String>)]]></key>
<data><![CDATA[ 重新设置startUrls,会覆盖Site本身的startUrls。
@param startUrls
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.setUUID(java.lang.String)]]></key>
<data><![CDATA[ 为爬虫设置一个唯一ID,用于标志任务,默认情况下使用domain作为uuid,对于单domain多任务的情况,请为重复任务设置不同的ID。
@param uuid 唯一ID
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.scheduler(us.codecraft.webmagic.scheduler.Scheduler)]]></key>
<data><![CDATA[ 设置调度器。调度器用于保存待抓取URL,并可以进行去重、同步、持久化等工作。默认情况下使用内存中的阻塞队列进行调度。
@param scheduler 调度器
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.pipeline(us.codecraft.webmagic.pipeline.Pipeline)]]></key>
<data><![CDATA[ 设置处理管道。处理管道用于最终抽取结果的后处理,例如:保存到文件、保存到数据库等。默认情况下会输出到控制台。
@param pipeline 处理管道
@return this
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.test(java.lang.String...)]]></key>
<data><![CDATA[ 用某些特定URL进行爬虫测试
@param urls 要抓取的url
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Spider.thread(int)]]></key>
<data><![CDATA[ 建立多个线程下载
@param threadNum 线程数
@return this
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.Task]]></key>
<data><![CDATA[ 抓取任务的抽象接口。<br>
@author code4crafter@gmail.com <br>
Date: 13-6-18
Time: 下午2:57
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Task.getUUID()]]></key>
<data><![CDATA[ 返回唯一标志该任务的字符串,以区分不同任务。
@return uuid
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.Task.getSite()]]></key>
<data><![CDATA[ 返回任务抓取的站点信息
@return site
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.Destroyable]]></key>
<data><![CDATA[ 比较占用资源的服务可以实现该接口,Spider会在结束时调用destroy()释放资源。<br>
@author code4crafter@gmail.com <br>
Date: 13-7-26 <br>
Time: 下午3:10 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.Downloader]]></key>
<data><![CDATA[ Downloader是webmagic下载页面的接口。webmagic默认使用了HttpComponent作为下载器,一般情况,你无需自己实现这个接口。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午12:14
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.Downloader.download(us.codecraft.webmagic.Request, us.codecraft.webmagic.Task)]]></key>
<data><![CDATA[ 下载页面,并保存信息到Page对象中。
@param request
@param task
@return page
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.Downloader.setThread(int)]]></key>
<data><![CDATA[ 设置线程数,多线程程序一般需要Downloader支持<br>
如果不考虑多线程的可以不实现这个方法<br>
@param thread 线程数量
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.FileDownloader]]></key>
<data><![CDATA[ 使用缓存到本地的文件来模拟下载,可以在Spider框架中仅进行抽取工作。<br>
@author code4crafer@gmail.com
Date: 13-6-24
Time: 上午7:24
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.HttpClientDownloader]]></key>
<data><![CDATA[ 封装了HttpClient的下载器。已实现指定次数重试、处理gzip、自定义UA/cookie等功能。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午12:15
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.HttpClientDownloader.download(java.lang.String)]]></key>
<data><![CDATA[ 直接下载页面的简便方法
@param url
@return
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.HttpClientPool]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午12:29
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader]]></key>
<data><![CDATA[
包含了页面下载的接口Downloader和实现类HttpClientDownloader,该实现类封装了HttpComponent库。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.AfterExtractor]]></key>
<data><![CDATA[ 实现这个接口即可在抽取后进行后处理。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 上午9:42 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.ConsolePageModelPipeline]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 下午3:41 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.HasKey]]></key>
<data><![CDATA[ 标志一个Model的key。<br>
实现了这个接口的Model在输出时会使用getKey()作为标志(例如JsonFilePageModelPipeline中持久化的文件名)。<br>
如果持久化的文件名是乱码,请再运行的环境变量里加上LANG=zh_CN.UTF-8 。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-10 <br>
Time: 上午7:39 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.HasKey.key()]]></key>
<data><![CDATA[ 在输出时会使用key作为标志(例如JsonFilePageModelPipeline中持久化的文件名)。
@return key
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.OOSpider]]></key>
<data><![CDATA[ 基于Model的Spider,封装后的入口类。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 上午9:51 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.OOSpider(us.codecraft.webmagic.Site, us.codecraft.webmagic.model.PageModelPipeline, java.lang.Class...)]]></key>
<data><![CDATA[ 创建一个爬虫。<br>
@param site
@param pageModelPipeline
@param pageModels
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.PageModelPipeline]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 上午9:34 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ComboExtract]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-16 <br>
Time: 下午11:09 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy]]></key>
<data><![CDATA[ 定义类或者字段的抽取规则。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy.value]]></key>
<data><![CDATA[ 抽取规则
@return 抽取规则
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy.type]]></key>
<data><![CDATA[ 抽取规则类型,支持XPath、Css selector、正则表达式,默认是XPath
@return 抽取规则类型
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy.notNull]]></key>
<data><![CDATA[ 是否是不能为空的关键字段,若notNull为true,则对应字段抽取不到时,丢弃整个类,默认为false
@return 是否是不能为空的关键字段
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy.multi]]></key>
<data><![CDATA[ 是否抽取多个结果<br>
用于字段时,需要List<String>来盛放结果<br>
用于类时,表示单页抽取多个对象<br>
@return 是否抽取多个结果
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy2]]></key>
<data><![CDATA[ 定义类或者字段的抽取规则,只能在Extract、ExtractByRaw之后使用。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractBy3]]></key>
<data><![CDATA[ 定义类或者字段的抽取规则,只能在Extract、ExtractByRaw之后使用。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByRaw]]></key>
<data><![CDATA[ 对于在Class级别就使用过ExtractBy的类,在字段中想抽取全部内容可使用此方法。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByRaw.value]]></key>
<data><![CDATA[ 抽取规则
@return 抽取规则
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByRaw.type]]></key>
<data><![CDATA[ 抽取规则类型,支持XPath、Css selector、正则表达式,默认是XPath
@return 抽取规则类型
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByRaw.notNull]]></key>
<data><![CDATA[ 是否是不能为空的关键字段,若notNull为true,则对应字段抽取不到时,丢弃整个类,默认为false
@return 是否是不能为空的关键字段
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByRaw.multi]]></key>
<data><![CDATA[ 是否抽取多个结果<br>
需要List<String>来盛放结果<br>
@return 是否抽取多个结果
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByUrl]]></key>
<data><![CDATA[ 定义类或者字段的抽取规则(从url中抽取,只支持正则表达式)。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByUrl.value]]></key>
<data><![CDATA[ 抽取规则,支持正则表达式
@return 抽取规则
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByUrl.notNull]]></key>
<data><![CDATA[ 是否是不能为空的关键字段,若notNull为true,则对应字段抽取不到时,丢弃整个类,默认为false
@return 是否是不能为空的关键字段
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.ExtractByUrl.multi]]></key>
<data><![CDATA[ 是否抽取多个结果<br>
用于字段时,需要List<String>来盛放结果<br>
用于类时,表示单页抽取多个对象<br>
@return 是否抽取多个结果
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.HelpUrl]]></key>
<data><![CDATA[ 定义辅助爬取的url。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.HelpUrl.value]]></key>
<data><![CDATA[ 某个类对应的URL规则列表<br>
webmagic对正则表达式进行了修改,"."仅表示字符"."而不代表任意字符,而"\*"则代表了".\*",例如"http://\*.oschina.net/\*"代表了oschina所有的二级域名下的URL。<br>
@return 抽取规则
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.HelpUrl.sourceRegion]]></key>
<data><![CDATA[ 指定提取URL的区域(仅支持XPath)
@return 指定提取URL的区域
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.TargetUrl]]></key>
<data><![CDATA[ 定义某个类抽取的范围和来源,sourceRegion可以用xpath语法限定抽取区域。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-1 <br>
Time: 下午8:40 <br>
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.TargetUrl.value]]></key>
<data><![CDATA[ 某个类对应的URL规则列表<br>
webmagic对正则表达式进行了修改,"."仅表示字符"."而不代表任意字符,而"\*"则代表了".\*",例如"http://\*.oschina.net/\*"代表了oschina所有的二级域名下的URL。<br>
@return 抽取规则
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation.TargetUrl.sourceRegion]]></key>
<data><![CDATA[ 指定提取URL的区域(仅支持XPath)
@return 指定提取URL的区域
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.annotation]]></key>
<data><![CDATA[
webmagic注解抓取方式所定义的注解。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model]]></key>
<data><![CDATA[
webmagic对抓取器编写的面向模型(称为PageModel)的封装。基于POJO及注解即可实现一个PageProcessor。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic]]></key>
<data><![CDATA[
<div class="en">
Main class "Spider" and models.
</div>
<div class="zh">
包括webmagic入口类Spider和一些数据传递的实体类。
</div>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.ConsolePipeline]]></key>
<data><![CDATA[ 命令行输出抽取结果。可用于测试。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午1:45
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.FilePipeline]]></key>
<data><![CDATA[ 持久化到文件的接口。
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午6:28
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.FilePipeline()]]></key>
<data><![CDATA[ 新建一个FilePipeline,使用默认保存路径"/data/webmagic/"
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.FilePipeline(java.lang.String)]]></key>
<data><![CDATA[ 新建一个FilePipeline
@param path 文件保存路径
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline]]></key>
<data><![CDATA[ JSON格式持久化到文件的接口。<br>
如果持久化的文件名是乱码,请再运行的环境变量里加上LANG=zh_CN.UTF-8。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午6:28
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline()]]></key>
<data><![CDATA[ 新建一个JsonFilePageModelPipeline,使用默认保存路径"/data/webmagic/"
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline(java.lang.String)]]></key>
<data><![CDATA[ 新建一个JsonFilePageModelPipeline
@param path 文件保存路径
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.JsonFilePipeline]]></key>
<data><![CDATA[ JSON格式持久化到文件的接口。
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午6:28
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.JsonFilePipeline()]]></key>
<data><![CDATA[ 新建一个JsonFilePipeline,使用默认保存路径"/data/webmagic/"
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.JsonFilePipeline(java.lang.String)]]></key>
<data><![CDATA[ 新建一个JsonFilePipeline
@param path 文件保存路径
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.PagedPipeline]]></key>
<data><![CDATA[ 用于实现分页的Pipeline。<br>
在使用redis做分布式爬虫时,请不要使用此功能。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-4 <br>
Time: 下午5:15 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline.Pipeline]]></key>
<data><![CDATA[ Pipeline是数据离线处理和持久化的接口。通过实现Pipeline以实现不同的持久化方式(例如保存到数据库)。
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午1:39
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.pipeline]]></key>
<data><![CDATA[
包含了处理页面抽取结果的接口Pipeline和它的几个实现类。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.processor.PageProcessor]]></key>
<data><![CDATA[ 定制爬虫的核心接口。通过实现PageProcessor可以实现一个定制的爬虫。<br>
extends the class to implements various spiders.<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午11:42
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.processor.PageProcessor.process(us.codecraft.webmagic.Page)]]></key>
<data><![CDATA[ 定义如何处理页面,包括链接提取、内容抽取等。
@param page
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.processor.PageProcessor.getSite()]]></key>
<data><![CDATA[ 定义任务一些配置信息,例如开始链接、抓取间隔、自定义cookie、自定义UA等。
@return site
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.processor.SimplePageProcessor]]></key>
<data><![CDATA[ 非常简单的抽取器。链接抽取使用定义的通配符,并保存抽取整个内容到content字段。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-22
Time: 下午9:15
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.processor]]></key>
<data><![CDATA[
包含了封装页面处理逻辑的接口PageProcessor和一个实现类SimplePageProcessor。实现PageProcessor即可定制一个自己的爬虫。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler.FileCacheQueueScheduler]]></key>
<data><![CDATA[ 磁盘文件实现的url管理模块,可以保证在长时间执行的任务中断后,下次启动从中断位置重新开始。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午1:13
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler.QueueScheduler]]></key>
<data><![CDATA[ 内存队列实现的线程安全Scheduler。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午1:13
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler.RedisScheduler]]></key>
<data><![CDATA[ 使用redis管理url,构建一个分布式的爬虫。<br>
@author code4crafter@gmail.com <br>
Date: 13-7-25 <br>
Time: 上午7:07 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler.Scheduler]]></key>
<data><![CDATA[ 包含url管理和调度的接口。包括url抓取队列,url去重等功能。<br>
Scheduler的接口包含一个Task参数,该参数是为单Scheduler多Task预留的(Spider就是一个Task)。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午1:12
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler.Scheduler.push(us.codecraft.webmagic.Request, us.codecraft.webmagic.Task)]]></key>
<data><![CDATA[ 加入一个待抓取的链接
@param request 待抓取的链接
@param task 定义的任务,以满足单Scheduler多Task的情况
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler.Scheduler.poll(us.codecraft.webmagic.Task)]]></key>
<data><![CDATA[ 返回下一个要抓取的链接
@param task 定义的任务,以满足单Scheduler多Task的情况
@return 下一个要抓取的链接
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.scheduler]]></key>
<data><![CDATA[
包含url管理和调度的接口Scheduler及它的几个实现类。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.AndSelector]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 下午5:29 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.CssSelector]]></key>
<data><![CDATA[ css风格的选择器。包装了Jsoup。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午9:39
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Html]]></key>
<data><![CDATA[ 可抽取的html文本。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午7:54
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.JsonPathSelector]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-12 <br>
Time: 下午12:54 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.OrSelector]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 下午5:29 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.PlainText]]></key>
<data><![CDATA[ 可抽取的纯文本,不包括xpath和css selector实现。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午7:54
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.RegexSelector]]></key>
<data><![CDATA[ 正则表达式抽取器。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午7:09
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.ReplaceSelector]]></key>
<data><![CDATA[ 对文本进行替换。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午7:09
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable]]></key>
<data><![CDATA[ 可进行抽取的文本。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-20
Time: 下午7:51
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.xpath(java.lang.String)]]></key>
<data><![CDATA[ select list with xpath
@param xpath
@return new Selectable after extract
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.$(java.lang.String)]]></key>
<data><![CDATA[ select list with css selector
@param selector css selector expression
@return new Selectable after extract
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.smartContent()]]></key>
<data><![CDATA[ select smart content with ReadAbility algorithm
@return content
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.links()]]></key>
<data><![CDATA[ select all links
@return all links
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.regex(java.lang.String)]]></key>
<data><![CDATA[ select list with regex
@param regex
@return new Selectable after extract
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.replace(java.lang.String, java.lang.String)]]></key>
<data><![CDATA[ replace with regex
@param regex
@param replacement
@return new Selectable after extract
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.toString()]]></key>
<data><![CDATA[ single string result
@return single string result
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selectable.all()]]></key>
<data><![CDATA[ multi string result
@return multi string result
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.Selector]]></key>
<data><![CDATA[ 抽取器。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-20
Time: 下午8:02
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.SelectorFactory]]></key>
<data><![CDATA[ 产生selector的工厂。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午7:56
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.SmartContentSelector]]></key>
<data><![CDATA[ readability算法,基础是找到所有p标签的父节点
写的比较乱,最终效果还在尝试中
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午4:42
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector.XpathSelector]]></key>
<data><![CDATA[ xpath的选择器。包装了HtmlCleaner。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 上午9:39
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.selector]]></key>
<data><![CDATA[
提供了便捷抽取页面内容的工具,对外核心接口是Selectable,内部抽取则是通过实现Selector来定制。
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap]]></key>
<data><![CDATA[ @author code4crafter@gmail.com
Date Dec 14, 2012
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap(java.util.Map<K1, java.util.Map<K2, V>>, java.lang.Class<? extends java.util.Map>)]]></key>
<data><![CDATA[ init map with protoMapClass
@param protoMapClass
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap.get(K1)]]></key>
<data><![CDATA[ @param key
@return map
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap.get(K1, K2)]]></key>
<data><![CDATA[ @param key1
@param key2
@return value
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap.put(K1, java.util.Map<K2, V>)]]></key>
<data><![CDATA[ @param key1
@param submap
@return
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap.put(K1, K2, V)]]></key>
<data><![CDATA[ @param key1
@param key2
@param value
@return
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap.remove(K1, K2)]]></key>
<data><![CDATA[ @param key1
@param key2
@return
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.DoubleKeyMap.remove(K1)]]></key>
<data><![CDATA[ @param key1
@return
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.FilePersistentBase]]></key>
<data><![CDATA[ 文件持久化的基础类。<br>
@author code4crafter@gmail.com <br>
Date: 13-8-11 <br>
Time: 下午4:21 <br>
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.MultiKeyMapBase]]></key>
<data><![CDATA[ multikey map, some basic objects *
@author yihua.huang
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.ThreadUtils]]></key>
<data><![CDATA[ 线程工具类。<br>
@author code4crafer@gmail.com
Date: 13-6-23
Time: 下午7:11
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.UrlUtils]]></key>
<data><![CDATA[ url及html处理工具类。<br>
@author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午1:52
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils.UrlUtils.canonicalizeUrl(java.lang.String, java.lang.String)]]></key>
<data><![CDATA[ 将url想对地址转化为绝对地址
@param url url地址
@param refer url地址来自哪个页面
@return url绝对地址
]]></data>
</comment>
</javadoc>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<javadoc>
<meta>
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.utils]]></key>
<data><![CDATA[
提供一些处理链接的静态工具类。
]]></data>
</comment>
</javadoc>
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment