Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to links active/inactive flag #9

Open
3 tasks
chaoSefat opened this issue Oct 23, 2024 · 2 comments
Open
3 tasks

Add to links active/inactive flag #9

chaoSefat opened this issue Oct 23, 2024 · 2 comments

Comments

@chaoSefat
Copy link
Collaborator

  • Check if get_text(link) == None and link in seed_urls:
  • add flag inactive.
  • otherwise add active.
@chaoSefat
Copy link
Collaborator Author

chaoSefat commented Oct 23, 2024

  1. remove self links from links to scrape i.e. all_links and get a dictionary of scraped link. add active '1' and also save the engines.
  2. convert all_data to a format where it is a dictionary with 'link' as keys for fast look up.
    Then use it to check
    if 'common-crawl' in engines :
    save 'cc-text'
    save engines
    link for all entries are active or not.
    if in active then active:1 else active: 0
    else:
    use trafilatura scrape and filter

Then get info of that. check if it's category is commoncrawl and has snippet or not.

@chaoSefat
Copy link
Collaborator Author

chaoSefat commented Oct 23, 2024

def check_urls(url_list):
    result = []
    
    for url in url_list:
        try:
            response = requests.get(url, timeout=5)
            if response.status_code == 200:
                result.append({"link": url, "active": 1})
            else:
                result.append({"link": url, "active": 0})
        except requests.RequestException:
            result.append({"link": url, "active": 0})
    
    return result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant