Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use periods consistently #79

Open
junoslukan opened this issue Feb 23, 2021 · 10 comments
Open

Use periods consistently #79

junoslukan opened this issue Feb 23, 2021 · 10 comments

Comments

@junoslukan
Copy link

As it stands, some abbreviations include periods, while others don't. Compare for example "Quality Assurance in Health Care" (Qual. Assur. Health Care) and "Quality Assurance in the Fish Industry" (Dev Food Sci).

Is there any reason for this inconsistency? I think it would be best to always use the period or not and to possibly leave the choice up to the user.

@koppor
Copy link
Member

koppor commented Feb 25, 2021

This somehow relates to #54.

I think, we did not properly document how the whole journal abbreviation lists are created, combined, etc.

We have some initial documentation at https://docs.jabref.org/advanced/journalabbreviations. And we are automatically importing the journal lists from here via https://github.com/JabRef/jabref/blob/master/.github/workflows/refresh-journal-lists.yml.

I currently have no time to dive into this topic further. Maybe you can? I know this it hard stuff and will take much time.

@junoslukan
Copy link
Author

Hi koppor, thanks for looking into this.

I am not sure how the lists are created. It says in the workflow you linked to:

      # remove all lists without dot in them
      # we use abbrevatiation lists containing dots in them only (to be consistent)

But on the other hand there is a Python script in the merge you linked to that uses those lists.

Anyway, one idea for adding the dots would be to not add them to non-abbreviated words only. For that, you could look for exact matches in the abbreviated and non-abbreviated columns and assume that the ones which are the same are non-abbreviated words which require no dot.

I am not sure if that makes sense for all cases, as it does not apply for the example I listed ("Quality Assurance in the Fish Industry" -> "Dev Food Sci"), but to be honest, I am not sure if this is not an error anyway.

I also don't know if you would want to go ahead with modifying the lists at all or rather keep them intact and separate them into "with dots" and "dotless" only.

Let me know what needs to be done and I can judge better if I would be able to do that.

@koppor
Copy link
Member

koppor commented Sep 17, 2021

We have no documentation how the lists are created. One has to check the log of each file https://github.com/JabRef/abbrv.jabref.org/tree/main/journals. I hope, someone refines the README.md stating the source of the lists.

One has to note that

@Krzmbrzl
Copy link

Anyway, one idea for adding the dots would be to not add them to non-abbreviated words only. For that, you could look for exact matches in the abbreviated and non-abbreviated columns and assume that the ones which are the same are non-abbreviated words which require no dot.

I second this approach. It seems like it would easily allow to automate this and it also appears to cover most cases (correctly).
Are there objections to adding a script to this repo that applies this rule to all CSV files and which is then also used on a PR-Check to ensure that proper punctuation is continued to be used?

@Siedlerchr
Copy link
Member

@Krzmbrzl I think this sounds like a valid idea! You are welcome to provide a script!

@Krzmbrzl
Copy link

So I gave this a try and came up with

#!/usr/bin/env python3


import argparse
import os
import csv

# A list of CSV filenames that shall be excluded from being processed
blacklist = [
        "journal_abbreviations_annee-philologique.csv"
]

def main():
    parser = argparse.ArgumentParser("This script will make the use of periods in journal name abbreviations consistent (make sure they are used)")
    parser.add_argument("--journal-dir", help="The path to the directory containing the journal CSVs", metavar="PATH", default="journals")

    args = parser.parse_args()

    for currentFileName in os.listdir(args.journal_dir):
        if currentFileName in blacklist or not currentFileName.endswith(".csv"):
            continue

        changedEntries = 0
        changedRows = []

        with open(os.path.join(args.journal_dir, currentFileName), "r", newline="") as currentFile:
            # Assume files are small enough to easily fit in memory
            reader = csv.reader(currentFile, delimiter=";")

            for row in reader:
                if len(row) == 0:
                    # Skip empty lines
                    continue

                # columns are separated by semicolon

                assert len(row) >= 2 and len(row) <= 4, "Invalid column count in CSV file"

                fullName = row[0]
                abbreviation = row[1]
                # shortestUniqueAbbreviation = elements[2]
                # frequency = elements[3]

                specialChars = [",", ":", ";", "(", ")", "[", "]", "{", "}", "\"", "'"]

                fullWords = [x.strip().lower() for x in fullName.split(" ")]
                abbrWords = [x.strip() for x in abbreviation.split(" ")]

                # Replace special chars in full word list
                for currentChar in specialChars:
                    for i in range(len(fullWords)):
                        fullWords[i] = fullWords[i].replace(currentChar, "")

                # Remove empty entries
                fullWords = list(filter(None, fullWords))
                abbrWords = list(filter(None, abbrWords))

                changed = False

                for i in range(len(abbrWords)):
                    if any(char in specialChars for char in abbrWords[i]):
                        # Word contains a special character -> rather leave it alone
                        continue
                    if "-" in abbrWords[i]:
                        # Dashes in words are suspicious as well -> let's rather not touch these
                        continue

                    if abbrWords[i].endswith("."):
                        # Is already using a period
                        if abbrWords[i][ : -1].lower() in fullWords:
                            # The word was used as an abbreviation, but it appears that it wasn't really abbreviated -> remove period
                            abbrWords[i] = abbrWords[i][ : -1]
                            changed = True
                    else: 
                        if abbrWords[i].lower() in fullWords:
                            # Assume that every word that appears in the full journal name as-is, is not
                            # abbreviated and therefore also should not get a period attached to it
                            continue
                        else:
                            # Since the current word is not part of the full journal name, we assume that it
                            # was abbreviated and thus, we add a period to it
                            abbrWords[i] += "."
                            changed = True

                if changed:
                    changedEntries += 1
                    row[1] = " ".join(abbrWords)
                    # print(" ".join(fullWords), "->", " ".join(abbrWords), "(", abbreviation, ")")

                changedRows.append(row)

        if changedEntries > 0:
            # Write out new content
            with open(os.path.join(args.journal_dir, currentFileName), "w", newline="") as currentFile:
                writer = csv.writer(currentFile, delimiter=";", lineterminator="\n")
                writer.writerows(changedRows)

        print("======== Changed %d entries for %s" % (changedEntries, currentFile.name))




if __name__ == "__main__":
    main()

However, as it turns out, there appear too many exceptions that can't be handled properly using this simple approach. From what I have seen so far, I would even go as far as to say that an automated approach is probably not really feasible and any changes have to be performed manually by someone who knows what they are doing 🤷

@Siedlerchr
Copy link
Member

So wee need to have seperate lists with dots and without dots? Maybe your script can generate a basis...

@Krzmbrzl
Copy link

Nah, I think there just is no good way of automating this. There exist several journal abbreviations that add e.g. a Sect. A to the abbreviation, even though the full name does not contain any kind of reference to a section A. Therefore, it is not possible to deduce whether the "A" is an abbreviation for something or whether it is meant literally.
The same issues occurs where certain journal abbreviations use city names that don't appear in the full name.

However, baking all these cases as special cases into the code, would probably result in approximately the same work as fixing this by hand.

Finally, there also seem to exist groups of journals that use no punctuation at all. They just combine the starting letters of the full name parts into a word (fictional example: Science Magazine -> SM).

@northword
Copy link
Contributor

I think we should use ISO 4 journal abbreviations with periods, because removing periods is much more reliable than adding them, as suggested by another similar program, Zotero 1.

JabRef could perhaps have a built-in feature for removing periods, or this repository could store only abbreviations with periods, and output a copy of the abbreviations without periods (in a combine script)

Footnotes

  1. https://www.zotero.org/support/adding_items_to_zotero#journal_abbreviations

@koppor
Copy link
Member

koppor commented Oct 1, 2024

Our documentation at https://docs.jabref.org/advanced/journalabbreviations has journal lists with dots. We have documentation on our lists at https://github.com/JabRef/abbrv.jabref.org/tree/main/journals#readme.

"Entrez", "Index Medicus" provides dotless abbreviations only. How to handle these?

I think JabRef/jabref#10557, needs to be fixed. Then, the combined journal list is obsolete. Then we can do lists per area (e.g., medicine, computer science, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants