Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

Open
mojomonger opened this issue May 31, 2023 · 5 comments

Comments

@mojomonger
Copy link
Collaborator

IARE:

https://internetarchive.github.io/iare/?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf

produces only 1 URL link.

There are hundreds in the document, as you can see by looking at the document directly:

https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf

@dpriskorn
Copy link
Collaborator

Would it be useful to have a debug=true parameter that dumps all the text and annotations?

@mojomonger
Copy link
Collaborator Author

if that is the best way to dump the text, then yes!

@dpriskorn
Copy link
Collaborator

dpriskorn commented Jun 3, 2023

investigation:
image
unknown document producer
cause of not finding any links:

  • no link-annotations
  • all the urls have spaces in them so the regex does not find them

You cannot see this when rendered, but it is a broken/non-standard pdf. an edge case.

see https://archive.org/services/context/iari/v2/statistics/pdf?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf&debug=true&refresh=true

this works (no spaces):
image
the spaces here cause the regex to not find the links:
image

@dpriskorn
Copy link
Collaborator

possible solution #852

@dpriskorn
Copy link
Collaborator

in this edge case it would work to not remove the linebreaks and instead remove all spaces

@mojomonger mojomonger added the bug Something isn't working label Jun 3, 2023
@dpriskorn dpriskorn moved this from New to Save for future sprint in Internet Archive Reference Inventory Jun 7, 2023
@dpriskorn dpriskorn changed the title PDF Link Parsing Error: Only 1 Link found, where there should be hundreds as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status Jun 7, 2023
@dpriskorn dpriskorn added link extraction and removed bug Something isn't working labels Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Save for future sprint
Development

No branches or pull requests

2 participants