Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于分页采集 怎么搞都不对 #166

Open
kavt opened this issue Jul 31, 2022 · 2 comments
Open

关于分页采集 怎么搞都不对 #166

kavt opened this issue Jul 31, 2022 · 2 comments

Comments

@kavt
Copy link

kavt commented Jul 31, 2022

http://www.qikan.com.cn/articleinfo/dinb20222801.html

http://www.qikan.com.cn/articleinfo/dinb20222801-1.html

这是默认详情页和分页
$configs = array(
'name' => 'diannaobao',
'log_show' => true,
'max_fields' => 1, //最大采集2条 每次
'domains' => array(
'www.qikan.com.cn'
),

//入口



'scan_urls' => array(
    "http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html"   //  http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),





//内容 也对了
 'content_url_regexes' => array(
        "http://www.qikan.com.cn/article/\S+",  //http://www.qikan.com.cn/article/dinb20222701.html
       // "http://www.qikan.com.cn/articleinfo/\s+"
    ),


'fields' => array(



    array(
        'name' => "contents",
        'selector' => "//div[contains(@class,'art-pre')]//a//@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
               
                // 发送 attached_url 请求获取其他的分页数据
                // attached_url 使用了上面抓取的 content_page_url
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),
    ),

采集到了分页,但是内容都是重复的,我就不明白content_page_url到底是啥意思

@kavt
Copy link
Author

kavt commented Jul 31, 2022

@owner888

@ishwy
Copy link

ishwy commented Aug 11, 2022

你搞定了吗,我也是没搞懂, 内容是重复的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants