Import entry from Web: Fetch as many fields as possible #8414

ThiloteE · 2022-01-10T15:40:56Z

Emerged in #8372

via:

Entry preview not rendering the citation properly #8372 (comment)

The Problem

Bibliographic data is usually provided for and formatted by major providers (i.e. Reed-Elsevier, Taylor & Francis, Wiley-Blackwell, Springer and Sage, crossref etc.) in a Bibtex conform standard. When users of JabRef try to fetch bibliographic metadata from the web, some fields exist in RIS* that are not present in Bibtex, but that could be fetched for Biblatex conform datasets.

* substitute RIS with your standard of choice

How to reproduce

https://journals.plos.org/plosone/article/citation?id=10.1371/journal.pone.0193972

RIS data provides the article-number (e0193972) and the issue (3):

TY  - JOUR
T1  - Teaching medicine with the help of “Dr. House”
A1  - Jerrentrup, Andreas
A1  - Mueller, Tobias
A1  - Glowalla, Ulrich
A1  - Herder, Meike
A1  - Henrichs, Nadine
A1  - Neubauer, Andreas
A1  - Schaefer, Juergen R.
Y1  - 2018/03/13
JF  - PLOS ONE
JA  - PLOS ONE
VL  - 13
IS  - 3
UR  - https://doi.org/10.1371/journal.pone.0193972
SP  - e0193972
EP  - 
PB  - Public Library of Science
M3  - doi:10.1371/journal.pone.0193972
ER  -

Whereas Bibtex only provides number (3):

@article{10.1371/journal.pone.0193972,
    doi = {10.1371/journal.pone.0193972},
    author = {Jerrentrup, Andreas AND Mueller, Tobias AND Glowalla, Ulrich AND Herder, Meike AND Henrichs, Nadine AND Neubauer, Andreas AND Schaefer, Juergen R.},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {Teaching medicine with the help of “Dr. House”},
    year = {2018},
    month = {03},
    volume = {13},
    url = {https://doi.org/10.1371/journal.pone.0193972},
    pages = {1-11},
    number = {3},
}

Edit:
Here the mapping to avoid confusion about what relates to what:

Bibtex	Biblatex	Ris	CSL
number	number	IS (Issue number)	issue
number	issue	IS	issue
pages *	eid	SP (Start Page) *	number
pages	pages	SP	page

* Some providers of bibliographic metadata put the article-number (= Biblatex eid) into the Bibtex pages field or the RIS SP field, because article-numbers do not exist in these standards. This is probably because prior to the digital age, there was no need to come up with article-numbers. Page and issue-number was enough to identify an article. Nowadays webpages may not have proper page-numbers, but may still contain multiple articles.

It is important to note that this was just an example.

Desired solution

When JabRef users fetch bibliographic metadata from the web, somehow fetch as many fields for the entry as possible. Take other standards apart from BibTeX/Biblatex into account.

Example A)
Fetch Bibtex/Biblatex data. Fetch RIS IS field and move the containing data to the Bibtex/Biblatex number field, if the Bibtex/Biblatex numberfield is empty.
Example B)
We assume RIS provides more data than BibTex/Biblatex --> Always fetch RIS data and convert to Bibtex/Biblatex

Additional context

Jabref offers both Biblatex and Bibtex Library Modes under library > library properties > library mode
Bibtex is not maintained anymore and has last been changed 2010. The package information on ctan recommends to use Biblatex instead (https://ctan.org/pkg/bibtex).
Biblatex on the other hand has been maintained regularly up to this day (https://github.com/plk/biblatex/)
Biblatex offers more fine grained fields and fields that are not existent in Bibtex.

Conformity with Biblatex

Entry preview not rendering the citation properly #8372 (comment)
As a general reminder about the difference between number and issue, the biblatex documentation (https://ctan.kako-dev.de/macros/latex/contrib/biblatex/doc/biblatex.pdf) on page 22 shows this:
issue field (literal) The issue of a journal. This field is intended for journals whose individual issues are identified by a designation such as ‘Spring’ or ‘Summer’ rather than the month or a number. The placement of issue is similar to month and number. Integer ranges and short designators are better written to the number field. See also month, number and §§ 2.3.10 and 2.3.11.
and on page 23:
number field (literal) The number of a journal or the volume/number of a book in a series. See also issue as well as §§ 2.3.7, 2.3.10, 2.3.11. With @patent entries, this is the number or record token of a patent or patent request. Normally this field will be an integer or an integer range, but it may also be a short designator that is not entirely numeric such as “S1”, “Suppl. 2”, “3es”. In these cases the output should be scrutinised carefully. Since number is—maybe counterintuitively given its name—a literal field, sorting templates will not treat its contents as integers, but as literal strings, which means that “11” may sort between “1” and “2”. If integer sorting is desired, the field can be declared an integer field in a custom data model (see § 4.5.4). But then the sorting of non-integer values is not well defined.
And here in the biblatex documentation p. 40 (https://ctan.kako-dev.de/macros/latex/contrib/biblatex/doc/biblatex.pdf#subsubsection.2.3.10):
2.3.11 Journal Numbers and Issues The words ‘number’ and ‘issue’ are often used synonymously by journals to refer to the subdvision of a volume. The fact that biblatex’s data model has fields of both names can sometimes lead to confusion about which field should be used. First and foremost the word that the journal uses for the subdivsion of a volume should be of minor importance, what matters is the role in the data model. As a rule of thumb number is the right field in most circumstances. In the standard styles number modifies volume, whereas issue modifies the date (year) of the entry. Numeric identifiers and short designators that are not necessarily (entirely) numeric such as ‘A’, ‘S1’, ‘C2’, ‘Suppl. 3’, ‘4es’ would go into the number field, because they usually modify the volume. The output of—especially longer—non-numeric input for number should be checked since it could potentially look odd with some styles. The field issue can be used for designations such as ‘Spring’, ‘Winter’ or ‘Michaelmas term’ if that is commonly used to refer to the journal.

The text was updated successfully, but these errors were encountered:

Siedlerchr · 2022-01-10T15:44:25Z

Refs #1018 (comment)

ThiloteE · 2022-01-11T16:52:50Z

JabRef 5.4--2021-12-20--ab44182
Windows 10 10.0 amd64
Java 16.0.2
JavaFX 17.0.1+1

When fetching the entry via JabRef's Import by DOI dialogue, the article-number is fetched, but not via the number field, but rather it replaces the page-range within the pages field.

@Article{Jerrentrup_2018,
  author       = {Andreas Jerrentrup and Tobias Mueller and Ulrich Glowalla and Meike Herder and Nadine Henrichs and Andreas Neubauer and Juergen R. Schaefer},
  date         = {2018-03},
  journaltitle = {{PLOS} {ONE}},
  title        = {Teaching medicine with the help of {\textquotedblleft}Dr. House{\textquotedblright}},
  doi          = {10.1371/journal.pone.0193972},
  editor       = {Thanh G Phan},
  number       = {3},
  pages        = {e0193972},
  volume       = {13},
  publisher    = {Public Library of Science ({PLoS})},
}

Siedlerchr · 2022-01-11T17:04:15Z

@ThiloteE DOI importer get bibtex back from the doi. Some publishers provide weird BibTeX data. This is known. We are not living in a perfect world where we have accurate data.

curl --location --request GET 'https://dx.doi.org/10.1371/journal.pone.0193972' \
--header 'Accept: application/x-bibtex'

powershell:
$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Accept", "application/x-bibtex")

$response = Invoke-RestMethod 'https://dx.doi.org/10.1371/journal.pone.0193972' -Method 'GET' -Headers $headers
$response | ConvertTo-Json

results in

@article{Jerrentrup_2018,
	doi = {10.1371/journal.pone.0193972},
	url = {https://doi.org/10.1371%2Fjournal.pone.0193972},
	year = 2018,
	month = {mar},
	publisher = {Public Library of Science ({PLoS})},
	volume = {13},
	number = {3},
	pages = {e0193972},
	author = {Andreas Jerrentrup and Tobias Mueller and Ulrich Glowalla and Meike Herder and Nadine Henrichs and Andreas Neubauer and Juergen R. Schaefer},
	editor = {Thanh G Phan},
	title = {Teaching medicine with the help of {\textquotedblleft}Dr. House{\textquotedblright}},
	journal = {{PLOS} {ONE}}
}

ThiloteE · 2022-01-11T21:43:53Z

moewew clarified: "Moving the issue number to issue and article number to number would not be my preference, because the issue number is traditionally number and the article number is eid in biblatex"

The long answer: plk/biblatex#726 (comment)

ThiloteE · 2022-01-11T22:17:08Z

Therefore:

Fetching from Bibtex formated data, we could write a RegEx that detects non-page-ranges in the pages field and move those to eid, if eid is empty.
Fetching from RIS, something similar could be done.
- SP denotes "Start Page" according to https://en.wikipedia.org/w/index.php?title=RIS_(file_format)&oldid=1017778965

And of course add options to cleanup actions to trigger the move back and forth manually.

ryan-carpenter · 2022-05-01T07:46:32Z

Some fields exist that are not present in Bibtex, but that could be fetched for Biblatex

For me, this is a huge problem with bibtex data. I usually conduct searches at the source, and then export results to capture the metadata I need. Usually, the choice is between RIS (or a variant such as Natbib), Bibtex, CSV and sometimes a plain text report (with or without field names).

Most of the formats are impoverished compared to the source data, and bibtex is particularly anemic, so I usually end up with RIS or Natbib. I start with the richest format and transform it using regex to create importable data. For example, the RN, MH, and OT fields in PubMed records all import as keywords in Jabref (or in other reference managers), so I modify all the values in advance by adding designators to differentiate the merged keywords from each other.

Somehow fetch the (article-) number, move it to the number field and move the issue-number from the number field into the issue field.

Data in the 'wrong' field is inconvenient, for sure, but not as bad as missing data. Being able to aquire the data easily (even retaining nonstandard names from the source) would be a big improvement.

Also, in my experience with this specific example, providers use these nonstandard 'page numbers' for electronic articles that have (or will have) another value as their unique identifier. Moving the ePage to an identifier field can create a confusing mess when other records from the same source contain a different data type (the 'real' identifier ) in the target field. Plus, this is only one of many field-mismatch scenarios. The problem also applies to reversal of full and abbreviated journal names, original versus translated titles, and whether "supplement" resides with volume/issue/number or stands alone.

Siedlerchr · 2022-05-01T11:39:15Z

Actually, after an import from a fetcher, a conversion to biblatex/bibtex is automatically performed
in #8361 this was also implemented for ID fetching (e.g. DOI)

The only thing which is probably missing is when you import/open from a file. This refs #8298

jabref/src/main/java/org/jabref/logic/importer/ImportCleanup.java

Line 10 in 6dfc2e0

public class ImportCleanup {

ThiloteE · 2022-05-01T12:04:23Z

Hey, there is only a little JabRef can do and somebody would need to take an interest and start doing it.

Options would be:

JabRef fetches as much data as possible --> e.g. If we ASSUME RIS provides more data, JabRef should prioritize fetching from RIS.
- positive: Probably more data is fetched
- negative:
  - JabRef's codebase would need to be changed. Every single fetcher for the websearch and for import by DOI and so on would probably needed to be touched--> Takes a lot of work
  - Some bibliographic data might be missing (depends on providers of bibliographic data).
JabRef is Bib(La)TeX native --> JabRef continues to fetch BibTeX data
- positive: JabRef maintainers can spend their time to implement other features
- negative:
  - Some bibliographic data might be missing (depends on providers of bibliographic data).
  - changing this would take quite an overhaul of the codebase and doing this work takes time and people that do it.
Providers of Bibliographic data provide more data --> e.g. switching from BibTeX to BibLaTeX standard.
- positive: BibLaTeX actually knows Article-Numbers. BibTeX does not
- negative: Providers unfortunately continue to use BibTeX, which is just an outdated standard. JabRef is dependent on these providers. Even if JabRef is ABLE to fetch from BibLaTeX, if providers fail to provide BibLaTeX data, JabRef will have to continue fetching from BibTeX or RIS or some other standard.

Of course, the best would be option 3.

ThiloteE · 2022-05-01T12:08:36Z

What users can do meanwhile:

Manually import from RIS (or from some other standard) into JabRef
Manually add missing bibliographic data by hand

ThiloteE · 2022-05-01T12:10:54Z

Moving the ePage to an identifier field can create a confusing mess when other records from the same source contain a different data type (the 'real' identifier ) in the target field.

Can you provide an example? I fail to understand. This sentence is too complicated for me xD

ryan-carpenter · 2022-05-02T17:16:56Z

Hey, there is only a little JabRef can do and somebody would need to take an interest and start doing it.

This is definitely a systemic problem, not a JabRef issue per se.

JabRef fetches as much data as possible --> e.g. If we ASSUME RIS provides more data, JabRef should prioritize fetching from RIS.

Sounds like a lot of work for little gain. Compared to BibTeX, RIS does have the advantage of more data fields, but RIS records have limitations of their own. For instance, RIS records often use author initials when BibTeX records from the same source often include full author names. BibTeX and BibLaTeX are also far more consistent than the RIS pseudostandard.

JabRef is Bib(La)TeX native --> JabRef continues to fetch BibTeX data

Easy conversion as supported already is a very reasonable status quo.

Providers of Bibliographic data provide more data --> e.g. switching from BibTeX to BibLaTeX standard. …
Of course, the best would be option 3.

If only.

ryan-carpenter · 2022-05-02T19:43:37Z

Moving the ePage to an identifier field can create a confusing mess when other records from the same source contain a different data type (the 'real' identifier ) in the target field.

Can you provide an example? I fail to understand. This sentence is too complicated for me xD

Not too complicated; just nonsensical. I failed to notice that the example in the original post was about identifiable data that belonged in another field. This obviously a great reason to move the data.

ThiloteE · 2023-07-03T04:23:10Z

Another idea:

Fetch both RIS and BibTeX, then convert RIS to BibTeX. Let the duplicate detection compare the two and in case they differ, let the user decide which fields and field content to keep.

koppor · 2024-12-13T10:49:43Z

The documetation is available at https://github.com/JabRef/jabref/blob/main/docs/code-howtos/fetchers.md. As first step, the linked comment (#8414 (comment)) has to be put into the documetation.

ThiloteE mentioned this issue Jan 11, 2022

number vs issue plk/biblatex#726

Closed

ThiloteE added bib(la)tex cleanup-ops fetcher import labels Jan 13, 2022

ThiloteE changed the title ~~Fetchers for Biblatex library mode. E.g. (article-) number field~~ Import entry from Web: Fetch as many fields as possible Sep 10, 2022

ThiloteE added the status: stale label Jul 3, 2023

ThiloteE removed the status: stale label Jul 3, 2023

ThiloteE added this to Features & Enhancements Jul 3, 2023

github-project-automation bot moved this to Normal priority in Features & Enhancements Jul 3, 2023

ThiloteE moved this from Normal priority to Low priority in Features & Enhancements Jul 3, 2023

calixtus added this to Prioritization Nov 13, 2024

github-project-automation bot moved this to Normal priority in Prioritization Nov 13, 2024

calixtus moved this from Normal priority to Low priority in Prioritization Nov 13, 2024

calixtus removed this from Features & Enhancements Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import entry from Web: Fetch as many fields as possible #8414

Import entry from Web: Fetch as many fields as possible #8414

ThiloteE commented Jan 10, 2022 •

edited

Loading

Siedlerchr commented Jan 10, 2022

ThiloteE commented Jan 11, 2022 •

edited

Loading

Siedlerchr commented Jan 11, 2022 •

edited

Loading

ThiloteE commented Jan 11, 2022

ThiloteE commented Jan 11, 2022 •

edited

Loading

ryan-carpenter commented May 1, 2022

Siedlerchr commented May 1, 2022

ThiloteE commented May 1, 2022 •

edited

Loading

ThiloteE commented May 1, 2022 •

edited

Loading

ThiloteE commented May 1, 2022

ryan-carpenter commented May 2, 2022 •

edited

Loading

ryan-carpenter commented May 2, 2022

ThiloteE commented Jul 3, 2023

koppor commented Dec 13, 2024

Import entry from Web: Fetch as many fields as possible #8414

Import entry from Web: Fetch as many fields as possible #8414

Comments

ThiloteE commented Jan 10, 2022 • edited Loading

The Problem

How to reproduce

Desired solution

Additional context

Siedlerchr commented Jan 10, 2022

ThiloteE commented Jan 11, 2022 • edited Loading

Siedlerchr commented Jan 11, 2022 • edited Loading

ThiloteE commented Jan 11, 2022

ThiloteE commented Jan 11, 2022 • edited Loading

ryan-carpenter commented May 1, 2022

Siedlerchr commented May 1, 2022

ThiloteE commented May 1, 2022 • edited Loading

ThiloteE commented May 1, 2022 • edited Loading

ThiloteE commented May 1, 2022

ryan-carpenter commented May 2, 2022 • edited Loading

ryan-carpenter commented May 2, 2022

ThiloteE commented Jul 3, 2023

koppor commented Dec 13, 2024

ThiloteE commented Jan 10, 2022 •

edited

Loading

ThiloteE commented Jan 11, 2022 •

edited

Loading

Siedlerchr commented Jan 11, 2022 •

edited

Loading

ThiloteE commented Jan 11, 2022 •

edited

Loading

ThiloteE commented May 1, 2022 •

edited

Loading

ThiloteE commented May 1, 2022 •

edited

Loading

ryan-carpenter commented May 2, 2022 •

edited

Loading