-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google scholar search result has very limited metadata #3
Comments
I want to first describe how Academic Tracker uses Google Scholar. Google Scholar has at least 2 endpoints, one to search for authors and one to search for publications. The one to search for publications is very sensitive (Google will block you) and hard to use programmatically. Using the scholarly package right out of the box I was not able to get it to work. This is the endpoint most people are used to using from the site and you can get more information about the specific publication this way. Academic Tracker does not use this endpoint. The author endpoint lets you search for authors that Google Scholar has identified and see what publications they have associated with them. You can get some publication information from this endpoint, but not as much as from the publication search endpoint. Specifically, when using scholarly you have to search the author (one query), then fill their publications (another query), and then you can fill each individual publication to get slightly more information (a query per publication). Since the added information when filled is not consistent (you may or may not get a "journal" field even after a fill for example), and Google Scholar is likely to ban your IP if you make too many requests, I chose not to fill every publication for every author and only fill publications that do not have a DOI on Crossref. Our original use case is also simply to help keep track of whether PI's have properly reported their papers so they don't lose funding, so a title is often enough to be able to ask them if it needs to be reported properly. Below is an example of the publication information provided by Google Scholar through the author endpoint and a filled version of the same publication:
As you can see even when filled the information is pretty sparse. What information specifically are you after and is it available in the above examples? It might be possible to provide an option to fill every publication or to pass the whole dictionary from Google Scholar through to the final output, but it would be helpful to know what information you are actually seeking and if it is even available from this source. |
The metadata can be used in many different ways. For example, If you check openAlex data at https://api.openalex.org/works/W4223491415 You can see fields like "is_oa", "is_corresponding" (corresponding authors), "source" "best_oa_location" etc,
use cases here:
|
Same example: https://api.openalex.org/works/W4223491415 see fields like "referenced_works", you can use this field to figure out who cited whom. In other words, data mining to find out the related researchers. For field like "related_works", you can use for further reading suggestions. For field like "counts_by_year", you can do calculation on the total number of citations by author by year. |
Thank you for your explanation of how this Google Scholar module works. I understand that Google Scholar is intentionally to be closed. |
We might be able to add OpenAlex as a source for Academic Tracker, but this original thread was about metadata available via Google Scholar. |
Also, OpenAlex has a 100k/day and 10 request / second limits. |
OpenAlex has developed new things in the past 6 months. Its grant info has not been the focus (but was added later 2023). So check https://docs.openalex.org/ |
With Google scholar's scholar id setup, the result from Google Scholar has very limited metadata. Is it possible to get all the metadata from Google Scholar (e.g. data, journal title, citations etc)?
For example, this article https://scholar.google.com/citations?view_op=view_citation&hl=en&user=eo4KWGcAAAAJ&sortby=pubdate&citation_for_view=eo4KWGcAAAAJ:An6A6Jpfc1oC
The result:
"https://doi.org/10.6017/ital.v40i1.12553": {
"PMCID": null,
"abstract": null,
"authors": [
{
"affiliation": "University of Arizona",
"author_id": "Yan Han",
"firstname": "Yan",
"initials": null,
"lastname": "Han"
}
],
"conclusions": null,
"copyrights": null,
"doi": "10.6017/ital.v40i1.12553",
"grants": null,
"journal": null,
"keywords": null,
"methods": null,
"publication_date": {
"day": 11,
"month": 3,
"year": 2021
},
"pubmed_id": null,
"results": null,
"title": "Development of a Gold-standard Pashto Dataset and a Segmentation App"
The text was updated successfully, but these errors were encountered: