We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http://www.qikan.com.cn/articleinfo/dinb20222801.html
http://www.qikan.com.cn/articleinfo/dinb20222801-1.html
这是默认详情页和分页 $configs = array( 'name' => 'diannaobao', 'log_show' => true, 'max_fields' => 1, //最大采集2条 每次 'domains' => array( 'www.qikan.com.cn' ),
//入口 'scan_urls' => array( "http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html" // http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html ), //内容 也对了 'content_url_regexes' => array( "http://www.qikan.com.cn/article/\S+", //http://www.qikan.com.cn/article/dinb20222701.html // "http://www.qikan.com.cn/articleinfo/\s+" ), 'fields' => array( array( 'name' => "contents", 'selector' => "//div[contains(@class,'art-pre')]//a//@href", ////div[contains(@class,'art-pre')]//a//@href ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5] ////div[contains(@class,'art-pre')]//a//@href 'repeated' => true, 'required' => true,//必填 'children' => array( array( // 抽取出其他分页的url待用 'name' => 'content_page_url', 'selector' => "//text()" ), array( // 抽取其他分页的内容 'name' => 'page_content', // 发送 attached_url 请求获取其他的分页数据 // attached_url 使用了上面抓取的 content_page_url 'source_type' => 'attached_url', 'attached_url' => 'content_page_url', // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments", 'selector' => "//div[contains(@class,'textWrap')]" ), ), ),
采集到了分页,但是内容都是重复的,我就不明白content_page_url到底是啥意思
The text was updated successfully, but these errors were encountered:
@owner888
Sorry, something went wrong.
你搞定了吗,我也是没搞懂, 内容是重复的
No branches or pull requests
http://www.qikan.com.cn/articleinfo/dinb20222801.html
http://www.qikan.com.cn/articleinfo/dinb20222801-1.html
这是默认详情页和分页
$configs = array(
'name' => 'diannaobao',
'log_show' => true,
'max_fields' => 1, //最大采集2条 每次
'domains' => array(
'www.qikan.com.cn'
),
采集到了分页,但是内容都是重复的,我就不明白content_page_url到底是啥意思
The text was updated successfully, but these errors were encountered: