Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

XYZ567AB
Copy link

Describe the changes you have made here: what, why, ...
Link the issue that will be closed, e.g., "Closes #333". If your PR closes a koppor issue, link it using its URL, e.g., "Closes koppor#47".

Add 'ARXIVID' field to 'StandardField' enum class
Add 'getArxivId' method to 'PdfContentImporter' class
Add logic to handle arxivId in 'getEntryFromPDFContent' method
Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class

Closes #12000
Closes #12000

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Simon7878912 and others added 3 commits October 25, 2024 12:08
2. Add 'getArxivId' method to 'PdfContentImporter' class
3. Add logic to handle arxivId in 'getEntryFromPDFContent' method
4. Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class
Copy link
Member

@Siedlerchr Siedlerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arxiv is not a Standard field

@@ -25,6 +25,7 @@ public enum StandardField implements Field {
ARCHIVEPREFIX("archiveprefix"),
ASSIGNEE("assignee", FieldProperty.PERSON_NAMES),
AUTHOR("author", FieldProperty.PERSON_NAMES),
ARXIVID("arXivId", FieldProperty.VERBATIM),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arxiv is no StandardField, you need to use eprint for this
see chapter 3.14.7 Electronic Publishing Information of the biblatex spec
http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf

.withField(StandardField.AUTHOR, "Review Article")
.withField(StandardField.TITLE, "British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296 arXiv:2408.06224v1 q The Authors")
.withField(StandardField.YEAR, "2024")
.withField(StandardField.ARXIVID, "2408.06224v1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be eprint

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback! I see, I will change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You then should also add the correpsonding eprinttype field to arxiv

@koppor koppor added the status: changes required Pull requests that are not yet complete label Oct 25, 2024
@koppor
Copy link
Member

koppor commented Oct 25, 2024

@XYZ567AB You can add a code comment into StandardField.java to guide someone also working here. Saying how the arXiv ID is stored and also add a link to the Java class handling arXiv Ids and the functionality introduced at #11627.

CHANGELOG.md Outdated
@@ -11,6 +11,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv

### Added

- We added functionality to handle arXiv ID in `PdfContentImporter` and implemented related test case. [#12000](https://github.com/JabRef/jabref/issues/12000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are user facing. A user does not see PdfContentImporter in the UI. We are updating the docs at JabRef/user-documentation#537 - and the updated section should be linked.

Comment on lines +493 to +506
private String getArxivId(String arxivId) {
int pos;
if (arxivId == null) {
pos = curString.indexOf("arxiv");
if (pos < 0) {
pos = curString.indexOf("arXiv");
}
if (pos >= 0) {
String arxivText = curString.substring(pos);
return ArXivIdentifier.parse(arxivText).map(ArXivIdentifier::asString).orElse(null);
}
}
return arxivId;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! We have the class org.jabref.model.entry.identifier.ArXivIdentifier use this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unfixed.

Comment on lines 48 to 54
Pattern oldIdentifierPattern = Pattern.compile("(" + ARXIV_PREFIX + ")?\\s?:?\\s?(?<id>(?<classification>[a-z\\-]+(\\.[A-Z]{2})?)/\\d{7})(v(?<version>\\d+))?");
Matcher oldIdentifierMatcher = oldIdentifierPattern.matcher(identifier);
if (oldIdentifierMatcher.matches()) {
return getArXivIdentifier(oldIdentifierMatcher);
}
Pattern oldIdentifierPattern = Pattern.compile("(" + ARXIV_PREFIX + ")?\\s?:?\\s?(?<id>(?<classification>[a-z\\-]+(\\.[A-Z]{2})?)/\\d{7})(v(?<version>\\d+))?");
Matcher oldIdentifierMatcher = oldIdentifierPattern.matcher(identifier);
if (oldIdentifierMatcher.matches()) {
return getArXivIdentifier(oldIdentifierMatcher);
}

return Optional.empty();
return Optional.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indent. Please revert.

@@ -123,4 +123,38 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296

assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n"));
}

@Test
void extractArxivIdFromPage1() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also use a real arXiv PDF. I think, there is a link to one in the issue?

Comment on lines +468 to +470
if (arxivId != null) {
year = "20" + arxivId.substring(0, 2);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go to the else branch. - If there is no year then one can check for arxivId and guess a year out of it.

Comment on lines +493 to +506
private String getArxivId(String arxivId) {
int pos;
if (arxivId == null) {
pos = curString.indexOf("arxiv");
if (pos < 0) {
pos = curString.indexOf("arXiv");
}
if (pos >= 0) {
String arxivText = curString.substring(pos);
return ArXivIdentifier.parse(arxivText).map(ArXivIdentifier::asString).orElse(null);
}
}
return arxivId;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unfixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: changes required Pull requests that are not yet complete
Projects
None yet
4 participants