关于周期性任务提取的问题

期望效果:
定期抓取数据列表线索的数据.
使用到如下两个主题:
1. DFamily_罗列影片排期与影评_Google
2. DFamily_影片排期与影评_Google
crontab.xml片段如下:

<?xml version="1.0" encoding="UTF-8"?>
 <crontab>
   <thread name="DFamily_罗列影片排期与影评_Google周期提取">
     <parameter>
       <auto>true</auto>
       <start>10</start>
       <period>20</period>
       <waitOnload>false</waitOnload>
       <minIdle>2</minIdle>
       <maxIdle>10</maxIdle>
     </parameter>
 
     <step name="renewClue">
       <theme>DFamily_罗列影片排期与影评_Google</theme>
     </step>

     <step name="crawl">
       <theme>DFamily_罗列影片排期与影评_Google</theme>
       <loadTimeout>30</loadTimeout>
       <lazyCycle>3</lazyCycle>
       <updateClue>true</updateClue>
       <dupRatio>80</dupRatio>
       <depth>-1</depth>
       <width>-1</width>
       <renew>false</renew>
       <period>0</period>
     </step>
     <step name="crawl">
       <theme>DFamily_影片排期与影评_Google</theme>
       <updateClue>false</updateClue>
       <dupRatio>80</dupRatio>
       <depth>-1</depth>
       <width>-1</width>
       <renew>false</renew>
       <period>0</period>
       <resumePageLoad>true</resumePageLoad>
       <resumeMaxCount>3</resumeMaxCount>
     </step>     
   </thread>
 </crontab>

抓取结果描述:
1. 能成功抓取主题为"DFamily_罗列影片排期与影评_Google"的结果文件
2. 抓取日志窗口出现如下消息:Duplication ratio is over the threshold.The pipe line stops.处理器名称:ExtractSpiderClue_Simp
3. 没有预期的抓取到主题为"DFamily_影片排期与影评_Google"的结果文件.

说明:手工方式使用,能正常抓取这两个主题的结果文件.

请问是不是crontab.xml文件配置有问题?还是有其他方面需要注意的呢?

望请回复指引.

谢谢.

Tue, 05/18/2010 - 17:40 — Fuller

关闭重复率检查

在主题“DFamily_罗列影片排期与影评_Google”的crawl步，将参数dupRatio设置成100，就可以关闭重复率检查。如果设置成80，那么，如果为主题“DFamily_影片排期与影评_Google”提取线索时，发现有80%的重复，就停止了。设置成100就不检查了。

只有周期性提取可以通过设置该参数关闭检查，手工提取时没有界面设置可以关闭该检查。

Wed, 05/19/2010 - 15:03 — lijj2010

感谢回复

已经可以了,谢谢您的协助.

另外:了解下是否还有关于周期性抓取需要注意的事项.

Wed, 05/19/2010 - 16:09 — Fuller

周期性抓取的参数

周期性抓取的参数说明都在文章：http://www.gooseeker.com/cn/node/technology/files/pss。每当有新功能加入都会修改这个文章。

GooSeeker

关于周期性任务提取的问题

关闭重复率检查

感谢回复

周期性抓取的参数

切换语言