集搜客GooSeeker网络爬虫

标题: 如何批量获得网址中PeerReviewFile的pdf链接 [打印本页]

作者: 2574586329 时间: 2021-8-5 19:07
标题: 如何批量获得网址中PeerReviewFile的pdf链接
我希望知道我手上这些网址有没有PeerReviewFile的字段，如果有就给我我这个PeerReviewFile的pdf链接，如果没有那就回复No（或者别的也行，我可以在excel里后续编辑）
编辑规则的网址：https://www.nature.com/articles/s41467-018-02825-9#Sec20

别的可以用来尝试网址：
https://www.nature.com/articles/s41467-021-22035-0
https://www.nature.com/articles/s41467-021-22702-2
https://www.nature.com/articles/s41467-021-23070-7
https://www.nature.com/articles/s41467-021-22860-3
https://www.nature.com/articles/s41467-021-23010-5
https://www.nature.com/articles/s41467-021-22703-1
https://www.nature.com/articles/s41467-021-22840-7
https://www.nature.com/articles/s41467-021-22806-9
https://www.nature.com/articles/s41467-021-22837-2
https://www.nature.com/articles/s41467-021-22826-5
https://www.nature.com/articles/s41467-021-22853-2
https://www.nature.com/articles/s41467-021-22825-6
https://www.nature.com/articles/s41467-021-22748-2
https://www.nature.com/articles/s41467-021-22805-w
https://www.nature.com/articles/s41467-021-22747-3
https://www.nature.com/articles/s41467-021-21551-3
https://www.nature.com/articles/s41467-021-22765-1
https://www.nature.com/articles/s41467-021-22315-9
https://www.nature.com/articles/s41467-021-22423-6
https://www.nature.com/articles/s41467-021-22739-3

作者: Fuller 时间: 2021-8-5 19:13
为了采集这个pdf网址，不需要做动作，只需要做好抓取内容标注就行了。要做如下设置1，如果有些网页上没有这个网址，那么就不要勾选关键内容
2，要用自定义xpath，专门采集#text是“Peer Review File”的节点

作者: 2574586329 时间: 2021-8-5 19:17

Fuller 发表于 2021-8-5 19:13
为了采集这个pdf网址，不需要做动作，只需要做好抓取内容标注就行了。要做如下设置1，如果有些网页上没有这 ...

怎么设置呢？要用自定义xpath，专门采集#text是“Peer Review File”的节点
我的定位节点已经写好了：//div[contains(.//text(),'Peer Review File')]

作者: Fuller 时间: 2021-8-5 19:21
第一步：正常做内容标注，找一个含有pdf的网页作为样本页面
1，用文章标题做第一个抓取内容。一定需要一个所有网页总是有的抓取内容，用来设置“关键内容”
2，用那个链接@href作为第二个抓取内容

[attach]14660[/attach]

作者: Fuller 时间: 2021-8-5 19:27
第二步：查看采集规则，编辑自动生成的采集规则，编辑成自定义xpath
如下图，点击“测试”按钮，再点击“采集规则”，把红框里面的xpath拷贝出来
[attach]14661[/attach]

following-sibling::div[position()=1]//*[@class='print-link']/@href 这个xpath不只是定位到Peer Review File，还能定位到其他的。所以，要修改一下，变成
following-sibling::div[position()=1]//*[@class='print-link' and text()='Peer Review File']/@href

如下图，双击那个抓取内容，在设置页面上输入自定义xpath： following-sibling::div[position()=1]//*[@class='print-link' and text()='Peer Review File']/@href

[attach]14662[/attach]

作者: Fuller 时间: 2021-8-5 19:31
第三步：存规则，记得把以前做的动作删除了。然后会会员中心的规则管理那里把其他网址添加进去

如下图，点击左边栏按钮进入会员中心
[attach]14663[/attach]

添加网址，然后运行一下试试
[attach]14664[/attach]

作者: Fuller 时间: 2021-8-5 19:32
第四步：导出数据。因为抓取链接那个抓取内容没有勾关键内容，凡是没有pdf文件的，导出数据这个字段就是空的

作者: Fuller 时间: 2021-8-5 19:36
最终可以看到这样的结果
[attach]14665[/attach]

作者: 2574586329 时间: 2021-8-6 19:24

Fuller 发表于 2021-8-5 19:36
最终可以看到这样的结果

我突然发现有的网址PeerReviewFile的text对应的不是PeerReviewFile，后面可能有一些别的内容
比如：https://www.nature.com/articles/s41467-018-03565-6#Sec23
这个时候该如何修改Xpath语言呢
麻烦您看下，谢谢啦

作者: 2574586329 时间: 2021-8-6 19:40
在我爬取这个Pdf链接时发现，有的Pdf对应的的text有区别，那像这个时候Xpath语言要怎么改呢
本来的Xpath语言：following-sibling::div[position()=1]//*[@class='print-link' and text()='Peer Review File']/@href
网址：https://www.nature.com/articles/s41467-018-03565-6#Sec23

作者: Fuller 时间: 2021-8-6 20:16

2574586329 发表于 2021-8-6 19:40
在我爬取这个Pdf链接时发现，有的Pdf对应的的text有区别，那像这个时候Xpath语言要怎么改呢
本来的Xpath语 ...

改成： following-sibling::div[position()=1]//*[@class='print-link' and contains(text(),'Peer Review File')]/@href

欢迎光临集搜客GooSeeker网络爬虫 (http://120.55.75.51/doc/)