-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a title guess method to get "better" title #12018
Conversation
Add title guess method
fix unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
update unit test to JDK 21 style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to have a test case.
But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);
Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent
What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.
One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.
@@ -123,4 +124,11 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296 | |||
|
|||
assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n")); | |||
} | |||
|
|||
@Test | |||
void se2Pdfs() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void se2Pdfs() throws Exception { | |
void se2Pdf() throws Exception { |
src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java
Outdated
Show resolved
Hide resolved
update unit test
update get title by area
Wow, I thought I wrote something cool and AI-like method, but it looks like it’s not quite what we need! 😅 I’ve now added a method to grab the title by position. This is just a demo code to see if the approach works for us—there’s definitely room for improvement, especially in formatting and commented code. If we’re happy with it, I’ll clean things up and refactor the org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent input! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
remove StringUtils.isBlank and add @AllowedToUseAwt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
Please list the files available for others to be able to review. I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/. If not, pleae add. I see following dirctories containing pre-formatted files
And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format. I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach. First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS. Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first. For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data. |
Sorry, so far I have only tested it with the single file mentioned in the issue, se2paper.pdf. I haven’t written academic papers in years, so I’m not very familiar with the IEEE and LNCS formats. I’ll do my best to learn and research them from the test data tomorrow |
ToDo:find a minimal pdf for test
change to get title by font size add more unittest
I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder, new code can pass them all so far, if we need more test I can add it. |
RemoveTestPrefix
… into close-issue-11999
I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE?
fix the unit test and open rewrite issue
Remove commented code. You have it in your git history if you ever need it. |
sorry it is a dirty code now, I want to set up a demo code to check if we like this solution or not, if the solution confirmed I'll refactor it. |
remove commented code
Please propose a PR to the user documentation at https://docs.jabref.org/collect/findunlinkedfiles#pdfs-for-which-it-works - explaining the larger titles. |
I am currently very busy, thus quick replies only. Sorry for that.
Sounds good. The idea using the largest object is smart! For future work, maybe, the second page needs to be checked, too. Sometimes, there is a cover page in front. I currently do not have an example at hand, but I don't have an example at hand - and I think, the heuristics will get harder then.
Can you add follwing papers as test cases:
When adding, add a In all these papers, you can also test for the abstract. In all thesse papers, you can test for the authors (which is probably a brain-teaser and left for future work). |
1. revert the temp change of unit test to original one. 2. all of title's character should stay together, add a `isFarAway` method to it to pass the unit test for hello world case. 3. change guess title variable name form old version `titleByPosition` to `titleByFontSize` 4. remove all commented code. 5. Add a `@VisibleForTesting` to `getEntryFromPDFContent`. 6. rewrite the javaDoc for `getEntryFromPDFContent` 7. fix the logic issue for setting title.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.
remove Blank line at start of block
rename and replace unit test file
add bib and readme.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this functionality, therefore merging.
Please try to adress the comments in a follow-up PR.
@@ -0,0 +1,110 @@ | |||
@inproceedings{se2paper, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The bib file should be in THE SAME directory as the .pdf files
- Set
.
as library-specific directory (in the library properties) - Have a
file
attached to each entry in thebib
file. - Try to improve BibTeX data in the
bib
file:- Add DOI
- Fix authors
- Add Year
- Generate citation key
- Rename PDF files to nicer name.
I added a CHANGELOG.md entry |
|
||
@ParameterizedTest | ||
@MethodSource("providePdfData") | ||
void pdfTitleExtraction(String filePath, String expectedTitle) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up pull request: swap the arguments.
at assert
in JUnit, first the expected value appears, then the input data. - This should also be done when wrapping inside @ParameterizedTests
@leaf-soba I think, you need to do some cleanup of the filenames before a merge can go through. See test output on Windows test:
Sorry für that. Maybe, you can adress all comments by me then. |
rename the file to pass CI
Head branch was pushed to by a user without write access
address all comments
Sorry I know it should be in another follow-up PR as you said before we should keep PR small but this one is too big, but I need to change the file name to pass the CI workflow, so it is a little hard to separate them.
Do you mean:
But I don't know how to change it in code or project file as a PR, could you explain it in the follow-up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
fix the file name in unit test
[...]]
Yes. Then save the file. The Since this is not a show-stopper, I let this PR an and we can do that in a follow-up PR. |
PdfMergeMetadataImporterTest
now, since I'm not sure if we like this kind of method or not, so I want to set up the PR with a demo code to discuss it first. I'll add it toPdfMergeMetadataImporter
with a better clear code if we like it.Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)