Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079

XYZ567AB · 2024-10-25T01:41:11Z

Describe the changes you have made here: what, why, ...
Link the issue that will be closed, e.g., "Closes #333". If your PR closes a koppor issue, link it using its URL, e.g., "Closes koppor#47".

Add 'ARXIVID' field to 'StandardField' enum class
Add 'getArxivId' method to 'PdfContentImporter' class
Add logic to handle arxivId in 'getEntryFromPDFContent' method
Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class

Closes #12000
Closes #12000

Mandatory checks

I own the copyright of the code submitted and I licence it under the MIT license
Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

2. Add 'getArxivId' method to 'PdfContentImporter' class 3. Add logic to handle arxivId in 'getEntryFromPDFContent' method 4. Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class

Siedlerchr

Arxiv is not a Standard field

Siedlerchr · 2024-10-25T09:57:30Z

src/main/java/org/jabref/model/entry/field/StandardField.java

@@ -25,6 +25,7 @@ public enum StandardField implements Field {
    ARCHIVEPREFIX("archiveprefix"),
    ASSIGNEE("assignee", FieldProperty.PERSON_NAMES),
    AUTHOR("author", FieldProperty.PERSON_NAMES),
+    ARXIVID("arXivId", FieldProperty.VERBATIM),


Arxiv is no StandardField, you need to use eprint for this
see chapter 3.14.7 Electronic Publishing Information of the biblatex spec
http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf

Siedlerchr · 2024-10-25T09:57:51Z

src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java

+                .withField(StandardField.AUTHOR, "Review Article")
+                .withField(StandardField.TITLE, "British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296 arXiv:2408.06224v1 q The Authors")
+                .withField(StandardField.YEAR, "2024")
+                .withField(StandardField.ARXIVID, "2408.06224v1");


Needs to be eprint

Thanks for your feedback! I see, I will change it.

You then should also add the correpsonding eprinttype field to arxiv

…v IDs.

… into 12000-fetch-arxiv-info-from-pdf

koppor · 2024-10-25T10:25:29Z

@XYZ567AB You can add a code comment into StandardField.java to guide someone also working here. Saying how the arXiv ID is stored and also add a link to the Java class handling arXiv Ids and the functionality introduced at #11627.

…ccordingly

koppor · 2024-10-25T13:07:01Z

CHANGELOG.md

@@ -11,6 +11,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv

 ### Added

+- We added functionality to handle arXiv ID in `PdfContentImporter` and implemented related test case. [#12000](https://github.com/JabRef/jabref/issues/12000)


These comments are user facing. A user does not see PdfContentImporter in the UI. We are updating the docs at JabRef/user-documentation#537 - and the updated section should be linked.

koppor · 2024-10-25T13:07:37Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

+    private String getArxivId(String arxivId) {
+        int pos;
+        if (arxivId == null) {
+            pos = curString.indexOf("arxiv");
+            if (pos < 0) {
+                pos = curString.indexOf("arXiv");
+            }
+            if (pos >= 0) {
+                String arxivText = curString.substring(pos);
+                return ArXivIdentifier.parse(arxivText).map(ArXivIdentifier::asString).orElse(null);
+            }
+        }
+        return arxivId;
+    }


No! We have the class org.jabref.model.entry.identifier.ArXivIdentifier use this.

This is still unfixed.

koppor · 2024-10-25T13:08:16Z

src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java

-        Pattern oldIdentifierPattern = Pattern.compile("(" + ARXIV_PREFIX + ")?\\s?:?\\s?(?<id>(?<classification>[a-z\\-]+(\\.[A-Z]{2})?)/\\d{7})(v(?<version>\\d+))?");
-        Matcher oldIdentifierMatcher = oldIdentifierPattern.matcher(identifier);
-        if (oldIdentifierMatcher.matches()) {
-            return getArXivIdentifier(oldIdentifierMatcher);
-        }
+       Pattern oldIdentifierPattern = Pattern.compile("(" + ARXIV_PREFIX + ")?\\s?:?\\s?(?<id>(?<classification>[a-z\\-]+(\\.[A-Z]{2})?)/\\d{7})(v(?<version>\\d+))?");
+       Matcher oldIdentifierMatcher = oldIdentifierPattern.matcher(identifier);
+       if (oldIdentifierMatcher.matches()) {
+           return getArXivIdentifier(oldIdentifierMatcher);
+       }

-        return Optional.empty();
+       return Optional.empty();


Wrong indent. Please revert.

koppor · 2024-10-25T13:08:43Z

src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java

@@ -123,4 +123,38 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296

        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n"));
    }
+
+    @Test
+    void extractArxivIdFromPage1() {


Please also use a real arXiv PDF. I think, there is a link to one in the issue?

koppor · 2024-10-27T00:22:28Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

+            if (arxivId != null) {
+                year = "20" + arxivId.substring(0, 2);
+            }


This should go to the else branch. - If there is no year then one can check for arxivId and guess a year out of it.

koppor · 2024-10-27T00:22:36Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

+    private String getArxivId(String arxivId) {
+        int pos;
+        if (arxivId == null) {
+            pos = curString.indexOf("arxiv");
+            if (pos < 0) {
+                pos = curString.indexOf("arXiv");
+            }
+            if (pos >= 0) {
+                String arxivText = curString.substring(pos);
+                return ArXivIdentifier.parse(arxivText).map(ArXivIdentifier::asString).orElse(null);
+            }
+        }
+        return arxivId;
+    }


This is still unfixed.

Simon7878912 and others added 3 commits October 25, 2024 12:08

1. Add 'ARXIVID' field to 'StandardField' enum class

19dcb44

2. Add 'getArxivId' method to 'PdfContentImporter' class 3. Add logic to handle arxivId in 'getEntryFromPDFContent' method 4. Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class

Merge branch 'JabRef:main' into 12000-fetch-arxiv-info-from-pdf

36107f6

Merge branch 'main' into 12000-fetch-arxiv-info-from-pdf

5d04b96

Siedlerchr requested changes Oct 25, 2024

View reviewed changes

koppor added the status: changes required Pull requests that are not yet complete label Oct 25, 2024

Simon7878912 added 2 commits October 25, 2024 21:23

Updated the implementation to use the 'EPRINT' StandardField for arXi…

31e2619

…v IDs.

Merge remote-tracking branch 'origin/12000-fetch-arxiv-info-from-pdf'…

d32ee7a

… into 12000-fetch-arxiv-info-from-pdf

XYZ567AB and others added 2 commits October 25, 2024 21:30

Merge branch 'main' into 12000-fetch-arxiv-info-from-pdf

353f44e

Since ArxivId is stored in the eprint field, getArxivId has changed a…

478d8be

…ccordingly

koppor requested changes Oct 25, 2024

View reviewed changes

Simon7878912 added 2 commits October 26, 2024 00:18

Fix wrong indentation.

19e62f6

Remove the log in CHANGELOG

1565cc2

koppor requested changes Oct 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079

Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079

XYZ567AB commented Oct 25, 2024

Siedlerchr left a comment

Siedlerchr Oct 25, 2024

Siedlerchr Oct 25, 2024

XYZ567AB Oct 25, 2024

Siedlerchr Oct 25, 2024

koppor commented Oct 25, 2024

koppor Oct 25, 2024

koppor Oct 25, 2024

koppor Oct 27, 2024

koppor Oct 25, 2024

koppor Oct 25, 2024

koppor Oct 27, 2024

koppor Oct 27, 2024

		@@ -11,6 +11,7 @@ Note that this project does not adhere to [Semantic Versioning](https://semv

		### Added

		- We added functionality to handle arXiv ID in `PdfContentImporter` and implemented related test case. [#12000](https://github.com/JabRef/jabref/issues/12000)

Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079

Are you sure you want to change the base?

Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079

Conversation

XYZ567AB commented Oct 25, 2024

Mandatory checks

Siedlerchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koppor commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment