as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

mojomonger · 2023-05-31T16:19:59Z

IARE:

https://internetarchive.github.io/iare/?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf

produces only 1 URL link.

There are hundreds in the document, as you can see by looking at the document directly:

https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf

dpriskorn · 2023-06-01T10:47:53Z

Would it be useful to have a debug=true parameter that dumps all the text and annotations?

mojomonger · 2023-06-01T19:13:32Z

if that is the best way to dump the text, then yes!

dpriskorn · 2023-06-03T06:23:53Z

investigation:

unknown document producer
cause of not finding any links:

no link-annotations
all the urls have spaces in them so the regex does not find them

You cannot see this when rendered, but it is a broken/non-standard pdf. an edge case.

see https://archive.org/services/context/iari/v2/statistics/pdf?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf&debug=true&refresh=true

this works (no spaces):

the spaces here cause the regex to not find the links:

dpriskorn · 2023-06-03T09:34:07Z

possible solution #852

dpriskorn · 2023-06-03T09:50:24Z

in this edge case it would work to not remove the linebreaks and instead remove all spaces

mojomonger added the pdf endpoint label May 31, 2023

mojomonger assigned dpriskorn May 31, 2023

dpriskorn added this to Internet Archive Reference Inventory Jun 3, 2023

github-project-automation bot moved this to New in Internet Archive Reference Inventory Jun 3, 2023

dpriskorn moved this from New to Blocked in Internet Archive Reference Inventory Jun 3, 2023

dpriskorn added the edge case label Jun 3, 2023

dpriskorn moved this from Blocked to New in Internet Archive Reference Inventory Jun 3, 2023

mojomonger added the bug Something isn't working label Jun 3, 2023

dpriskorn moved this from New to Save for future sprint in Internet Archive Reference Inventory Jun 7, 2023

dpriskorn changed the title ~~PDF Link Parsing Error: Only 1 Link found, where there should be hundreds~~ as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status Jun 7, 2023

dpriskorn added link extraction and removed bug Something isn't working labels Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

mojomonger commented May 31, 2023

dpriskorn commented Jun 1, 2023

mojomonger commented Jun 1, 2023

dpriskorn commented Jun 3, 2023 •

edited

Loading

dpriskorn commented Jun 3, 2023

dpriskorn commented Jun 3, 2023

as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

Comments

mojomonger commented May 31, 2023

dpriskorn commented Jun 1, 2023

mojomonger commented Jun 1, 2023

dpriskorn commented Jun 3, 2023 • edited Loading

dpriskorn commented Jun 3, 2023

dpriskorn commented Jun 3, 2023

dpriskorn commented Jun 3, 2023 •

edited

Loading