-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermediate URL redirect #140
Comments
Try this
after getting the urls of news, you have to do it one by one. here is an example
|
@talhaanwarch , thank you. As I said, Selenium is overkill for my use case so I dropped this lib and solved what I needed with Bing Search API. Anyway, thank you and maybe somebody finds your snippet useful. Regards. |
This is much simpler than it seems, you don't even need to BeautifulSoup it. Use this:
You still have to GET the intermediate URL but if you do:
It relies on a bit of code in the intermediate page that you're supposed to see if it doesn't redirect fast enough that tells you it's "Opening". You just use normal python |
Be aware sending too many requests to Google may get 429 errors. Each link will send to Google first then get the actual link. |
I ended up really wanting async support so I wrote my own which skips the intermediate URL altogether. In doing this DIY, I'm not actually sure where the intermediate URL comes from as the real URL is right there. It's not very pretty or typed so its not ready to be its own repo but if somebody wants to clean it up and incorporate it here or publish it elsewhere then please do:
it assumes you have a file headers.py with a dict of headers in a variable called HEADERS. Google doesn't actually seem to mind if you don't use browser headers so it's probably superfluous. |
This is a demo I wrote by analyzing the URL redirection process
|
I discovered that this repository includes a script for decoding Google News article URLs. Here's the link for reference: |
It looks google now only provide their intermediate URL that redirects to real news site URL:
'news.google.com/articles/CBMiU2h0dHBzOi8vd3d3LnRoZXZlcmdlLmNvbS8yMDI0LzIvMTQvMjQwNzI3OTIvYXBwbGUtdmlzaW9uLXByby1lYXJseS1hZG9wdGVycy1yZXR1cm5z0gEA?hl=en-US&gl=US&ceid=US%3Aen'
Tried to get redirected URL with requests, but it seems Google use javascript and this won't do. I get to consent page. I don't know how to tackle it without Selenium or similar and this is overhead I don't want for my project.
If someone has solution or pointer in right direction, I will be grateful.
The text was updated successfully, but these errors were encountered: