Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a title guess method to get "better" title #12018

Merged
merged 24 commits into from
Oct 30, 2024

Conversation

leaf-soba
Copy link
Contributor

  • Closes:Improve BibTeX-from-PDF import #11999
  • As the issue said I should "traditional" (non-AI) code, so I write a AI-liked code in a simple traditional way. Since I don't have enough data, so the rules is limited now.
  • I didn't add it PdfMergeMetadataImporterTest now, since I'm not sure if we like this kind of method or not, so I want to set up the PR with a demo code to discuss it first. I'll add it to PdfMergeMetadataImporter with a better clear code if we like it.

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Add title guess method
fix unit test
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

update unit test to JDK 21 style
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have a test case.

But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);

Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent

What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.

One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.

@@ -123,4 +124,11 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296

assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n"));
}

@Test
void se2Pdfs() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void se2Pdfs() throws Exception {
void se2Pdf() throws Exception {

update unit test
update get title by area
@leaf-soba
Copy link
Contributor Author

Good to have a test case.

But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);

Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent

What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.

One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.

Wow, I thought I wrote something cool and AI-like method, but it looks like it’s not quite what we need! 😅 I’ve now added a method to grab the title by position. This is just a demo code to see if the approach works for us—there’s definitely room for improvement, especially in formatting and commented code. If we’re happy with it, I’ll clean things up and refactor the org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent input!

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

remove StringUtils.isBlank and add @AllowedToUseAwt
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@leaf-soba leaf-soba requested a review from koppor October 21, 2024 12:26
@koppor
Copy link
Member

koppor commented Oct 21, 2024

  • Since I don't have enough data, so the rules is limited now.

Please list the files available for others to be able to review.

I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/.

If not, pleae add.

I see following dirctories containing pre-formatted files

And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format.


I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach.

First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS.

Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first.

For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data.

image

@leaf-soba
Copy link
Contributor Author

  • Since I don't have enough data, so the rules is limited now.

Please list the files available for others to be able to review.

I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/.

If not, pleae add.

I see following dirctories containing pre-formatted files

And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format.

I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach.

First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS.

Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first.

For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data.

image

Sorry, so far I have only tested it with the single file mentioned in the issue, se2paper.pdf. I haven’t written academic papers in years, so I’m not very familiar with the IEEE and LNCS formats. I’ll do my best to learn and research them from the test data tomorrow

ToDo:find a minimal pdf for test
@koppor koppor added the status: changes required Pull requests that are not yet complete label Oct 23, 2024
@koppor koppor marked this pull request as draft October 23, 2024 19:10
@leaf-soba
Copy link
Contributor Author

I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder, new code can pass them all so far, if we need more test I can add it.
The new idea is the title should be the largest font size in most of papers which I checked in the test data.

@leaf-soba leaf-soba marked this pull request as ready for review October 25, 2024 03:19
RemoveTestPrefix
I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE?
fix the unit test and open rewrite issue
@koppor
Copy link
Member

koppor commented Oct 25, 2024

Remove commented code. You have it in your git history if you ever need it.

@leaf-soba
Copy link
Contributor Author

Remove commented code. You have it in your git history if you ever need it.

sorry it is a dirty code now, I want to set up a demo code to check if we like this solution or not, if the solution confirmed I'll refactor it.

remove commented code
@koppor
Copy link
Member

koppor commented Oct 25, 2024

Please propose a PR to the user documentation at https://docs.jabref.org/collect/findunlinkedfiles#pdfs-for-which-it-works - explaining the larger titles.

@koppor
Copy link
Member

koppor commented Oct 25, 2024

I am currently very busy, thus quick replies only. Sorry for that.

I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder,

Sounds good.

The idea using the largest object is smart! For future work, maybe, the second page needs to be checked, too. Sometimes, there is a cover page in front. I currently do not have an example at hand, but I don't have an example at hand - and I think, the heuristics will get harder then.

if we need more test I can add it.

Can you add follwing papers as test cases:

  1. https://link.springer.com/article/10.1007/s10664-023-10367-y
  2. https://link.springer.com/article/10.1007/s10664-020-09875-y --> note that it would be great if the sub title was parsed
  3. https://onlinelibrary.wiley.com/doi/10.1002/spe.3169
  4. https://peerj.com/articles/cs-213/#
  5. https://dl.acm.org/doi/10.1145/3597503.3639130 (ACM format)

When adding, add a README.md to the folder where there are stored with the source of the PDF. I mean, a table with filename as first column and link as second column.


In all these papers, you can also test for the abstract.

In all thesse papers, you can test for the authors (which is probably a brain-teaser and left for future work).

1. revert the temp change of unit test to original one.
2. all of title's character should stay together, add a `isFarAway` method to it to pass the unit test for hello world case.
3. change guess title variable name form old version `titleByPosition` to `titleByFontSize`
4. remove all commented code.
5. Add a `@VisibleForTesting` to `getEntryFromPDFContent`.
6. rewrite the javaDoc for `getEntryFromPDFContent`
7. fix the logic issue for setting title.
@leaf-soba
Copy link
Contributor Author

  1. revert the temp change of unit test to original one.
  2. all of title's character should stay together, add a isFarAway method to it to pass the unit test for hello world case.
  3. change guess title variable name form old version titleByPosition to titleByFontSize
  4. remove all commented code.
  5. Add a @VisibleForTesting to getEntryFromPDFContent.
  6. rewrite the javaDoc for getEntryFromPDFContent
  7. fix the logic issue for setting title.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

remove Blank line at start of block
rename and replace unit test file
add bib and readme.md
koppor
koppor previously approved these changes Oct 30, 2024
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this functionality, therefore merging.

Please try to adress the comments in a follow-up PR.

@@ -0,0 +1,110 @@
@inproceedings{se2paper,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The bib file should be in THE SAME directory as the .pdf files
  • Set . as library-specific directory (in the library properties)
  • Have a file attached to each entry in the bib file.
  • Try to improve BibTeX data in the bib file:
    • Add DOI
    • Fix authors
    • Add Year
    • Generate citation key
    • Rename PDF files to nicer name.

@koppor koppor enabled auto-merge October 30, 2024 00:01
@koppor koppor disabled auto-merge October 30, 2024 00:01
@koppor
Copy link
Member

koppor commented Oct 30, 2024

I added a CHANGELOG.md entry

@koppor koppor enabled auto-merge October 30, 2024 00:03

@ParameterizedTest
@MethodSource("providePdfData")
void pdfTitleExtraction(String filePath, String expectedTitle) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up pull request: swap the arguments.

at assert in JUnit, first the expected value appears, then the input data. - This should also be done when wrapping inside @ParameterizedTests

@koppor
Copy link
Member

koppor commented Oct 30, 2024

@leaf-soba I think, you need to do some cleanup of the filenames before a merge can go through. See test output on Windows test:

 Error: error: invalid path 'src/test/resources/pdfs/PdfContentImporter/peerj-cs-213 - On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis.pdf'

Sorry für that.

Maybe, you can adress all comments by me then.

rename the file to pass CI
auto-merge was automatically disabled October 30, 2024 03:04

Head branch was pushed to by a user without write access

address all comments
@leaf-soba
Copy link
Contributor Author

leaf-soba commented Oct 30, 2024

Sorry I know it should be in another follow-up PR as you said before we should keep PR small but this one is too big, but I need to change the file name to pass the CI workflow, so it is a little hard to separate them.

Set . as library-specific directory (in the library properties)
This one is only comment I didn't finish, since I don't understand it very clearly.

Do you mean:

  1. Open JabRef, load my BibTeX file.
  2. Go to Library → Library Properties.
  3. In File Directory, enter a .
  4. save my settings

But I don't know how to change it in code or project file as a PR, could you explain it in the follow-up PR.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

fix the file name in unit test
@leaf-soba leaf-soba requested a review from koppor October 30, 2024 03:42
@koppor koppor added this pull request to the merge queue Oct 30, 2024
@koppor
Copy link
Member

koppor commented Oct 30, 2024

Do you mean:

[...]]

4. save my settings

Yes.

Then save the file. The .bib file is modified then. Then you can normally usie "git" to commit ato a new branch.

Since this is not a show-stopper, I let this PR an and we can do that in a follow-up PR.

@koppor koppor removed the status: changes required Pull requests that are not yet complete label Oct 30, 2024
Merged via the queue into JabRef:main with commit 7232836 Oct 30, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants