Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello.
I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you.
read_disallows(url)
: takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url.I've tested it by providing
"https://github.com/GrayHat12"
as input to the functionIt extracted the baseurl
"https://github.com"
and went on to read robots.txt using aGET
request on"https://github.com/robots.txt"
Then I used a regex to extract all disallowed urls.
Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com)
for example :
One disallowed url is :
"/*/stargazers"
I converted it to :
"/[^/]*/stargazers"
compiled it to a pattern object and added it to a disallowed list which is returned by the function.Now when you compare a url
"https://github.com/chiphuyen/lazynlp/stargazers"
with pattern""/[^/]*/stargazers""
there will be a match found usingre.match
and you can choose to not crawl it.Hope this was explanatory enough. I didn't understand the
ai.txt
part in the issue though. Will be great if someone could elaborate on that. 🐰Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through