Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check robot.txt and ai.txt #13

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

GrayHat12
Copy link

@GrayHat12 GrayHat12 commented Nov 11, 2020

Hello.
I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you.
read_disallows(url) : takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url.
I've tested it by providing "https://github.com/GrayHat12" as input to the function
It extracted the baseurl "https://github.com" and went on to read robots.txt using a GET request on "https://github.com/robots.txt"
Then I used a regex to extract all disallowed urls.
Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com)
for example :
One disallowed url is : "/*/stargazers"
I converted it to : "/[^/]*/stargazers" compiled it to a pattern object and added it to a disallowed list which is returned by the function.

Now when you compare a url "https://github.com/chiphuyen/lazynlp/stargazers" with pattern ""/[^/]*/stargazers"" there will be a match found using re.match and you can choose to not crawl it.

Hope this was explanatory enough. I didn't understand the ai.txt part in the issue though. Will be great if someone could elaborate on that. 🐰

Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant