google scholar search result has very limited metadata #3

yhan818 · 2023-07-19T13:40:41Z

With Google scholar's scholar id setup, the result from Google Scholar has very limited metadata. Is it possible to get all the metadata from Google Scholar (e.g. data, journal title, citations etc)?

For example, this article https://scholar.google.com/citations?view_op=view_citation&hl=en&user=eo4KWGcAAAAJ&sortby=pubdate&citation_for_view=eo4KWGcAAAAJ:An6A6Jpfc1oC

The result:
"https://doi.org/10.6017/ital.v40i1.12553": {
"PMCID": null,
"abstract": null,
"authors": [
{
"affiliation": "University of Arizona",
"author_id": "Yan Han",
"firstname": "Yan",
"initials": null,
"lastname": "Han"
}
],
"conclusions": null,
"copyrights": null,
"doi": "10.6017/ital.v40i1.12553",
"grants": null,
"journal": null,
"keywords": null,
"methods": null,
"publication_date": {
"day": 11,
"month": 3,
"year": 2021
},
"pubmed_id": null,
"results": null,
"title": "Development of a Gold-standard Pashto Dataset and a Segmentation App"

ptth222 · 2023-07-23T17:55:38Z

I want to first describe how Academic Tracker uses Google Scholar. Google Scholar has at least 2 endpoints, one to search for authors and one to search for publications. The one to search for publications is very sensitive (Google will block you) and hard to use programmatically. Using the scholarly package right out of the box I was not able to get it to work. This is the endpoint most people are used to using from the site and you can get more information about the specific publication this way. Academic Tracker does not use this endpoint.

The author endpoint lets you search for authors that Google Scholar has identified and see what publications they have associated with them. You can get some publication information from this endpoint, but not as much as from the publication search endpoint. Specifically, when using scholarly you have to search the author (one query), then fill their publications (another query), and then you can fill each individual publication to get slightly more information (a query per publication). Since the added information when filled is not consistent (you may or may not get a "journal" field even after a fill for example), and Google Scholar is likely to ban your IP if you make too many requests, I chose not to fill every publication for every author and only fill publications that do not have a DOI on Crossref. Our original use case is also simply to help keep track of whether PI's have properly reported their papers so they don't lose funding, so a title is often enough to be able to ask them if it needs to be reported properly.

Below is an example of the publication information provided by Google Scholar through the author endpoint and a filled version of the same publication:

{
      "container_type": "Publication",
      "source": "AUTHOR_PUBLICATION_ENTRY",
      "bib": {
        "title": "A FAIR approach for detecting and sharing PFAS hot-spot areas and water systems",
        "pub_year": "2022",
        "citation": ""
      },
      "filled": false,
      "author_pub_id": "ctE_FZMAAAAJ:LPZeul_q3PIC",
      "num_citations": 1,
      "citedby_url": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=3468568981640329512",
      "cites_id": [
        "3468568981640329512"
      ]
    }

# Filled
{
  "container_type": "Publication",
  "source": "AUTHOR_PUBLICATION_ENTRY",
  "bib": {
    "title": "A FAIR approach for detecting and sharing PFAS hot-spot areas and water systems",
    "pub_year": 2022,
    "citation": "",
    "author": "Sweta Ojha and P Travis Thompson and Christian D Powell and Hunter NB Moseley and Kelly G Pennell",
    "abstract": "Per- and polyfluoroalkyl substances (PFAS) contamination in water sources near potential PFAS users is well known. Therefore, it is useful for PFAS stakeholders to visualize hot-spot areas and bring attention to the water systems that are near to those areas. Towards this end, we extracted information about PFAS sources, drinking water information, sewer water information, and Source Water Assessment Protection Program (SWAPP) information from publicly available sources to create five different maps in ArcGIS Online that highlight PFAS contamination in relation to potential PFAS users. Following the FAIR (Findable, Accessible, Interoperable and Reusable) principles, we created a Figshare repository that includes all data and associated metadata with these five ArcGIS maps. Moreover, the Figshare repository includes a metadata description of the maps in JSON format that adheres to a draft Minimum Information about Geospatial Information System (MIAGIS) standard we have developed.  We hope this MIAGIS draft will assist in establishing a GIS standards group that will develop the draft into a full standard for the wider GIS community. We have also developed a miagis Python package that facilitates the generation of a MIAGIS-compliant JSON metadata file."
  },
  "filled": true,
  "author_pub_id": "ctE_FZMAAAAJ:LPZeul_q3PIC",
  "num_citations": 1,
  "citedby_url": "/scholar?hl=en&cites=3468568981640329512",
  "cites_id": [
    "3468568981640329512"
  ],
  "pub_url": "https://chemrxiv.org/engage/chemrxiv/article-details/62da093f13e3659590e0d5eb",
  "url_related_articles": "/scholar?oi=bibs&hl=en&q=related:KDmcjRzVIjAJ:scholar.google.com/",
  "cites_per_year": {
    "2023": 1
  }
}

As you can see even when filled the information is pretty sparse. What information specifically are you after and is it available in the above examples? It might be possible to provide an option to fill every publication or to pass the whole dictionary from Google Scholar through to the final output, but it would be helpful to know what information you are actually seeking and if it is even available from this source.

yhan818 · 2023-08-23T01:00:48Z

The metadata can be used in many different ways.

For example, If you check openAlex data at https://api.openalex.org/works/W4223491415

You can see fields like "is_oa", "is_corresponding" (corresponding authors), "source" "best_oa_location" etc,


locations: 

[ 

{ 

is_oa: true, 

landing_page_url: "https://doi.org/10.1093/ofid/ofac186", 

pdf_url: null, 

source: 

{ 

id: "https://openalex.org/S2735126445", 

display_name: "Open Forum Infectious Diseases", 

issn_l: "2328-8957", 

issn: 

[ 

"2328-8957" 

], 

is_oa: true, 

is_in_doaj: true, 

host_organization: "https://openalex.org/P4310311648", 

host_organization_name: "Oxford University Press", 

host_organization_lineage: 

[ 

"https://openalex.org/P4310311647", 

"https://openalex.org/P4310311648" 

], 

host_organization_lineage_names: 

[ 

"University of Oxford", 

"Oxford University Press" 

], 

type: "journal" 

}, 

license: "cc-by-nc-nd", 

version: "publishedVersion", 

is_accepted: true, 

is_published: true 

}, 

{ 

is_oa: true, 

landing_page_url: "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9047202", 

pdf_url: null, 

 

best_oa_location: 

{ 

is_oa: true, 

landing_page_url: "https://doi.org/10.1093/ofid/ofac186", 

pdf_url: null, 

source: 

{ 

id: "https://openalex.org/S2735126445", 

display_name: "Open Forum Infectious Diseases", 

issn_l: "2328-8957", 

issn: 

[ 

"2328-8957" 

], 

is_oa: true, 

is_in_doaj: true, 

host_organization: "https://openalex.org/P4310311648", 

host_organization_name: "Oxford University Press", 

host_organization_lineage: 

[ 

"https://openalex.org/P4310311647", 

"https://openalex.org/P4310311648" 

], 

host_organization_lineage_names: 

[ 

"University of Oxford", 

"Oxford University Press" 

], 

type: "journal" 

}, 

license: "cc-by-nc-nd", 

version: "publishedVersion", 

is_accepted: true, 

is_published: true 

}

use cases here:

APC info openAlex APC info ropensci/openalexR#148
Open access location: openAlex location info ropensci/openalexR#149 You can use this info to see your institution repository and evaluate its success (Note: I am sure Google Scholar probably has more detailed info than OpenAlex).

yhan818 · 2023-08-23T01:04:58Z

Same example: https://api.openalex.org/works/W4223491415

see fields like "referenced_works", you can use this field to figure out who cited whom. In other words, data mining to find out the related researchers.

For field like "related_works", you can use for further reading suggestions.

For field like "counts_by_year", you can do calculation on the total number of citations by author by year.

yhan818 · 2023-08-23T01:05:53Z

Thank you for your explanation of how this Google Scholar module works. I understand that Google Scholar is intentionally to be closed.

hunter-moseley · 2023-08-29T18:55:53Z

We might be able to add OpenAlex as a source for Academic Tracker, but this original thread was about metadata available via Google Scholar.

hunter-moseley · 2023-08-29T18:59:54Z

Also, OpenAlex has a 100k/day and 10 request / second limits.

yhan818 · 2024-03-11T18:42:18Z

Also, OpenAlex has a 100k/day and 10 request / second limits.

OpenAlex has developed new things in the past 6 months. Its grant info has not been the focus (but was added later 2023). So check https://docs.openalex.org/

Webinars: https://help.openalex.org/events/webinars

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

google scholar search result has very limited metadata #3

google scholar search result has very limited metadata #3

yhan818 commented Jul 19, 2023

ptth222 commented Jul 23, 2023 •

edited

Loading

yhan818 commented Aug 23, 2023

yhan818 commented Aug 23, 2023

yhan818 commented Aug 23, 2023

hunter-moseley commented Aug 29, 2023

hunter-moseley commented Aug 29, 2023

yhan818 commented Mar 11, 2024

google scholar search result has very limited metadata #3

google scholar search result has very limited metadata #3

Comments

yhan818 commented Jul 19, 2023

ptth222 commented Jul 23, 2023 • edited Loading

yhan818 commented Aug 23, 2023

yhan818 commented Aug 23, 2023

yhan818 commented Aug 23, 2023

hunter-moseley commented Aug 29, 2023

hunter-moseley commented Aug 29, 2023

yhan818 commented Mar 11, 2024

ptth222 commented Jul 23, 2023 •

edited

Loading