Add a title guess method to get "better" title #12018

leaf-soba · 2024-10-18T06:43:14Z

Closes:Improve BibTeX-from-PDF import #11999
As the issue said I should "traditional" (non-AI) code, so I write a AI-liked code in a simple traditional way. Since I don't have enough data, so the rules is limited now.
I didn't add it PdfMergeMetadataImporterTest now, since I'm not sure if we like this kind of method or not, so I want to set up the PR with a demo code to discuss it first. I'll add it to PdfMergeMetadataImporter with a better clear code if we like it.

Mandatory checks

I own the copyright of the code submitted and I licence it under the MIT license
Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Add title guess method

fix unit test

github-actions

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

update unit test to JDK 21 style

koppor

Good to have a test case.

But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);

Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent

What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.

One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.

koppor · 2024-10-18T09:16:11Z

src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java

@@ -123,4 +124,11 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296

        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n"));
    }
+
+    @Test
+    void se2Pdfs() throws Exception {


Suggested change

void se2Pdfs() throws Exception {

void se2Pdf() throws Exception {

src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java

update unit test

update get title by area

leaf-soba · 2024-10-21T02:03:30Z

Good to have a test case.

But I think, there should be some earlier switch. See the hint stripper.setSortByPosition(true);

Please try again longer how to handle two-column PDFs. -- The title should be parsed corectly by org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent

What I meant is: Do two passes: One with sorted by coordinates and one keeping text blocks together. The abstract is probably better prased with keeping text blocks together; the title better with position.

One can tweak org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent to see if the paper is a single-column or two column paper. One could even go further and detect if it is Springer, IEEE or another format.

Wow, I thought I wrote something cool and AI-like method, but it looks like it’s not quite what we need! 😅 I’ve now added a method to grab the title by position. This is just a demo code to see if the approach works for us—there’s definitely room for improvement, especially in formatting and commented code. If we’re happy with it, I’ll clean things up and refactor the org.jabref.logic.importer.fileformat.PdfContentImporter#getEntryFromPDFContent input!

github-actions

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

remove StringUtils.isBlank and add @AllowedToUseAwt

github-actions

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

koppor · 2024-10-21T12:46:22Z

Since I don't have enough data, so the rules is limited now.

Please list the files available for others to be able to review.

I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/.

If not, pleae add.

I see following dirctories containing pre-formatted files

And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format.

I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach.

First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS.

Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first.

For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data.

leaf-soba · 2024-10-21T13:01:37Z

Since I don't have enough data, so the rules is limited now.

Please list the files available for others to be able to review.

I think, JabRef should have an example file for the first four templates listed at https://latextemplates.github.io/.

If not, pleae add.

I see following dirctories containing pre-formatted files

https://github.com/JabRef/jabref/tree/main/src/test/resources/org/jabref/logic/importer/util

https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs/IEEE

And some more at https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs - where I think, most of them do not follow a publisher format.

I thought that first dowing a parsing regarding position sorted for title and author, then reparsing with onsorted, but with blocks for abstract is the most working approach.

First, the format of the PDF should be guessed. If IEEE, then set parameters for IEEE. If LNCS, set format for LNCS.

Guessing the title by a rectangle works if you know the publisher format. Thus, that guessing should go first.

For 563 TB of test data - outside Europe and US and maybe many other countries, go to https://annas-archive.org/. -- You can also start with 32 TB of test data.

Sorry, so far I have only tested it with the single file mentioned in the issue, se2paper.pdf. I haven’t written academic papers in years, so I’m not very familiar with the IEEE and LNCS formats. I’ll do my best to learn and research them from the test data tomorrow

ToDo:find a minimal pdf for test

change to get title by font size add more unittest

leaf-soba · 2024-10-25T03:19:40Z

I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder, new code can pass them all so far, if we need more test I can add it.
The new idea is the title should be the largest font size in most of papers which I checked in the test data.

RemoveTestPrefix

… into close-issue-11999

I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE?

fix the unit test and open rewrite issue

koppor · 2024-10-25T07:38:21Z

Remove commented code. You have it in your git history if you ever need it.

leaf-soba · 2024-10-25T08:07:33Z

Remove commented code. You have it in your git history if you ever need it.

sorry it is a dirty code now, I want to set up a demo code to check if we like this solution or not, if the solution confirmed I'll refactor it.

remove commented code

koppor · 2024-10-25T11:20:59Z

Please propose a PR to the user documentation at https://docs.jabref.org/collect/findunlinkedfiles#pdfs-for-which-it-works - explaining the larger titles.

koppor · 2024-10-25T11:44:10Z

I am currently very busy, thus quick replies only. Sorry for that.

I added 4 more unit test case with "IEEE", "LNCS", "scientificThesis", and a "thesis-example" I found in the pdf folder,

Sounds good.

The idea using the largest object is smart! For future work, maybe, the second page needs to be checked, too. Sometimes, there is a cover page in front. I currently do not have an example at hand, but I don't have an example at hand - and I think, the heuristics will get harder then.

if we need more test I can add it.

Can you add follwing papers as test cases:

https://link.springer.com/article/10.1007/s10664-023-10367-y
https://link.springer.com/article/10.1007/s10664-020-09875-y --> note that it would be great if the sub title was parsed
https://onlinelibrary.wiley.com/doi/10.1002/spe.3169
https://peerj.com/articles/cs-213/#
https://dl.acm.org/doi/10.1145/3597503.3639130 (ACM format)

When adding, add a README.md to the folder where there are stored with the source of the PDF. I mean, a table with filename as first column and link as second column.

In all these papers, you can also test for the abstract.

In all thesse papers, you can test for the authors (which is probably a brain-teaser and left for future work).

1. revert the temp change of unit test to original one. 2. all of title's character should stay together, add a `isFarAway` method to it to pass the unit test for hello world case. 3. change guess title variable name form old version `titleByPosition` to `titleByFontSize` 4. remove all commented code. 5. Add a `@VisibleForTesting` to `getEntryFromPDFContent`. 6. rewrite the javaDoc for `getEntryFromPDFContent` 7. fix the logic issue for setting title.

leaf-soba · 2024-10-29T02:59:10Z

revert the temp change of unit test to original one.
all of title's character should stay together, add a isFarAway method to it to pass the unit test for hello world case.
change guess title variable name form old version titleByPosition to titleByFontSize
remove all commented code.
Add a @VisibleForTesting to getEntryFromPDFContent.
rewrite the javaDoc for getEntryFromPDFContent
fix the logic issue for setting title.

github-actions

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

remove Blank line at start of block

rename and replace unit test file

add bib and readme.md

koppor

I need this functionality, therefore merging.

Please try to adress the comments in a follow-up PR.

koppor · 2024-10-30T00:00:52Z

src/test/resources/org/jabref/logic/importer/pdfContentImporterTest-pdfs.bib

@@ -0,0 +1,110 @@
+@inproceedings{se2paper,


The bib file should be in THE SAME directory as the .pdf files

Set . as library-specific directory (in the library properties)

Have a file attached to each entry in the bib file.

Try to improve BibTeX data in the bib file:

Add DOI

Fix authors

Add Year

Generate citation key

Rename PDF files to nicer name.

koppor · 2024-10-30T00:03:10Z

I added a CHANGELOG.md entry

koppor · 2024-10-30T00:04:17Z

src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java

+
+    @ParameterizedTest
+    @MethodSource("providePdfData")
+    void pdfTitleExtraction(String filePath, String expectedTitle) throws Exception {


Follow-up pull request: swap the arguments.

at assert in JUnit, first the expected value appears, then the input data. - This should also be done when wrapping inside @ParameterizedTests

koppor · 2024-10-30T00:06:17Z

@leaf-soba I think, you need to do some cleanup of the filenames before a merge can go through. See test output on Windows test:

 Error: error: invalid path 'src/test/resources/pdfs/PdfContentImporter/peerj-cs-213 - On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis.pdf'

Sorry für that.

Maybe, you can adress all comments by me then.

rename the file to pass CI

address all comments

leaf-soba · 2024-10-30T03:15:24Z

Sorry I know it should be in another follow-up PR as you said before we should keep PR small but this one is too big, but I need to change the file name to pass the CI workflow, so it is a little hard to separate them.

Set . as library-specific directory (in the library properties)
This one is only comment I didn't finish, since I don't understand it very clearly.

Do you mean:

Open JabRef, load my BibTeX file.
Go to Library → Library Properties.
In File Directory, enter a .
save my settings

But I don't know how to change it in code or project file as a PR, could you explain it in the follow-up PR.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

fix the file name in unit test

koppor · 2024-10-30T09:16:36Z

Do you mean:

[...]]

4. save my settings

Yes.

Then save the file. The .bib file is modified then. Then you can normally usie "git" to commit ato a new branch.

Since this is not a show-stopper, I let this PR an and we can do that in a follow-up PR.

leaf-soba added 2 commits October 18, 2024 14:32

Add title guess method

7e86e9e

Add title guess method

fix unit test

2993315

fix unit test

github-actions bot reviewed Oct 18, 2024

View reviewed changes

update unit test to JDK 21 style

f25f43b

update unit test to JDK 21 style

koppor requested changes Oct 18, 2024

View reviewed changes

leaf-soba added 2 commits October 21, 2024 09:18

update unit test

ce3619f

update unit test

update get title by area

9acd52d

update get title by area

github-actions bot reviewed Oct 21, 2024

View reviewed changes

remove StringUtils.isBlank and add @AllowedToUseAwt

f7a3f84

remove StringUtils.isBlank and add @AllowedToUseAwt

github-actions bot reviewed Oct 21, 2024

View reviewed changes

leaf-soba requested a review from koppor October 21, 2024 12:26

add unit test

d88418e

ToDo:find a minimal pdf for test

koppor added the status: changes required Pull requests that are not yet complete label Oct 23, 2024

koppor marked this pull request as draft October 23, 2024 19:10

leaf-soba added 2 commits October 25, 2024 11:16

change to get title by font size

16d94c7

change to get title by font size add more unittest

Merge branch 'main' into close-issue-11999

3cbaec2

leaf-soba marked this pull request as ready for review October 25, 2024 03:19

leaf-soba added 4 commits October 25, 2024 11:28

RemoveTestPrefix

da8472b

RemoveTestPrefix

Merge branch 'close-issue-11999' of https://github.com/leaf-soba/jabref…

f531f4e

… into close-issue-11999

temp fix the unit test

3a7018e

I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE?

fix the unit test and open rewrite issue

7edb9b7

fix the unit test and open rewrite issue

remove commented code

b386f85

remove commented code

github-actions bot reviewed Oct 29, 2024

View reviewed changes

leaf-soba added 3 commits October 29, 2024 14:18

remove Blank line at start of block

65965dd

remove Blank line at start of block

rename and replace unit test file

68eb97c

rename and replace unit test file

add bib and readme.md

a530d40

add bib and readme.md

koppor previously approved these changes Oct 30, 2024

View reviewed changes

koppor enabled auto-merge October 30, 2024 00:01

koppor disabled auto-merge October 30, 2024 00:01

koppor added 2 commits October 30, 2024 01:01

Merge branch 'main' into close-issue-11999

e782799

Update CHANGELOG.md

c8e8d5f

koppor dismissed their stale review via c8e8d5f October 30, 2024 00:02

koppor enabled auto-merge October 30, 2024 00:03

koppor reviewed Oct 30, 2024

View reviewed changes

rename the file to pass CI

a056481

rename the file to pass CI

auto-merge was automatically disabled October 30, 2024 03:04
Head branch was pushed to by a user without write access

address all comments

b11717a

address all comments

github-actions bot reviewed Oct 30, 2024

View reviewed changes

fix the file name in unit test

ae22b25

fix the file name in unit test

leaf-soba requested a review from koppor October 30, 2024 03:42

koppor approved these changes Oct 30, 2024

View reviewed changes

koppor added this pull request to the merge queue Oct 30, 2024

koppor removed the status: changes required Pull requests that are not yet complete label Oct 30, 2024

Merged via the queue into JabRef:main with commit 7232836 Oct 30, 2024
23 checks passed

koppor mentioned this pull request Oct 30, 2024

Parsing of "distributed systems online" paper does not work #12135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a title guess method to get "better" title #12018

Add a title guess method to get "better" title #12018

leaf-soba commented Oct 18, 2024

github-actions bot left a comment

koppor left a comment

koppor Oct 18, 2024

leaf-soba commented Oct 21, 2024

github-actions bot left a comment

github-actions bot left a comment

koppor commented Oct 21, 2024 •

edited

Loading

leaf-soba commented Oct 21, 2024

leaf-soba commented Oct 25, 2024

koppor commented Oct 25, 2024

leaf-soba commented Oct 25, 2024

koppor commented Oct 25, 2024

koppor commented Oct 25, 2024

leaf-soba commented Oct 29, 2024

github-actions bot left a comment

koppor left a comment

koppor Oct 30, 2024

koppor commented Oct 30, 2024

koppor Oct 30, 2024

koppor commented Oct 30, 2024

leaf-soba commented Oct 30, 2024 •

edited

Loading

github-actions bot left a comment

koppor commented Oct 30, 2024

	void se2Pdfs() throws Exception {
	void se2Pdf() throws Exception {

Add a title guess method to get "better" title #12018

Add a title guess method to get "better" title #12018

Conversation

leaf-soba commented Oct 18, 2024

Mandatory checks

github-actions bot left a comment

Choose a reason for hiding this comment

koppor left a comment

Choose a reason for hiding this comment

koppor Oct 18, 2024

Choose a reason for hiding this comment

leaf-soba commented Oct 21, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

koppor commented Oct 21, 2024 • edited Loading

leaf-soba commented Oct 21, 2024

leaf-soba commented Oct 25, 2024

koppor commented Oct 25, 2024

leaf-soba commented Oct 25, 2024

koppor commented Oct 25, 2024

koppor commented Oct 25, 2024

leaf-soba commented Oct 29, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

koppor left a comment

Choose a reason for hiding this comment

koppor Oct 30, 2024

Choose a reason for hiding this comment

koppor commented Oct 30, 2024

koppor Oct 30, 2024

Choose a reason for hiding this comment

koppor commented Oct 30, 2024

leaf-soba commented Oct 30, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

koppor commented Oct 30, 2024

koppor commented Oct 21, 2024 •

edited

Loading

leaf-soba commented Oct 30, 2024 •

edited

Loading