-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use periods consistently #79
Comments
This somehow relates to #54. I think, we did not properly document how the whole journal abbreviation lists are created, combined, etc. We have some initial documentation at https://docs.jabref.org/advanced/journalabbreviations. And we are automatically importing the journal lists from here via https://github.com/JabRef/jabref/blob/master/.github/workflows/refresh-journal-lists.yml. I currently have no time to dive into this topic further. Maybe you can? I know this it hard stuff and will take much time. |
Hi koppor, thanks for looking into this. I am not sure how the lists are created. It says in the workflow you linked to:
But on the other hand there is a Python script in the merge you linked to that uses those lists. Anyway, one idea for adding the dots would be to not add them to non-abbreviated words only. For that, you could look for exact matches in the abbreviated and non-abbreviated columns and assume that the ones which are the same are non-abbreviated words which require no dot. I am not sure if that makes sense for all cases, as it does not apply for the example I listed ("Quality Assurance in the Fish Industry" -> "Dev Food Sci"), but to be honest, I am not sure if this is not an error anyway. I also don't know if you would want to go ahead with modifying the lists at all or rather keep them intact and separate them into "with dots" and "dotless" only. Let me know what needs to be done and I can judge better if I would be able to do that. |
We have no documentation how the lists are created. One has to check the log of each file https://github.com/JabRef/abbrv.jabref.org/tree/main/journals. I hope, someone refines the README.md stating the source of the lists. One has to note that
|
I second this approach. It seems like it would easily allow to automate this and it also appears to cover most cases (correctly). |
@Krzmbrzl I think this sounds like a valid idea! You are welcome to provide a script! |
So I gave this a try and came up with #!/usr/bin/env python3
import argparse
import os
import csv
# A list of CSV filenames that shall be excluded from being processed
blacklist = [
"journal_abbreviations_annee-philologique.csv"
]
def main():
parser = argparse.ArgumentParser("This script will make the use of periods in journal name abbreviations consistent (make sure they are used)")
parser.add_argument("--journal-dir", help="The path to the directory containing the journal CSVs", metavar="PATH", default="journals")
args = parser.parse_args()
for currentFileName in os.listdir(args.journal_dir):
if currentFileName in blacklist or not currentFileName.endswith(".csv"):
continue
changedEntries = 0
changedRows = []
with open(os.path.join(args.journal_dir, currentFileName), "r", newline="") as currentFile:
# Assume files are small enough to easily fit in memory
reader = csv.reader(currentFile, delimiter=";")
for row in reader:
if len(row) == 0:
# Skip empty lines
continue
# columns are separated by semicolon
assert len(row) >= 2 and len(row) <= 4, "Invalid column count in CSV file"
fullName = row[0]
abbreviation = row[1]
# shortestUniqueAbbreviation = elements[2]
# frequency = elements[3]
specialChars = [",", ":", ";", "(", ")", "[", "]", "{", "}", "\"", "'"]
fullWords = [x.strip().lower() for x in fullName.split(" ")]
abbrWords = [x.strip() for x in abbreviation.split(" ")]
# Replace special chars in full word list
for currentChar in specialChars:
for i in range(len(fullWords)):
fullWords[i] = fullWords[i].replace(currentChar, "")
# Remove empty entries
fullWords = list(filter(None, fullWords))
abbrWords = list(filter(None, abbrWords))
changed = False
for i in range(len(abbrWords)):
if any(char in specialChars for char in abbrWords[i]):
# Word contains a special character -> rather leave it alone
continue
if "-" in abbrWords[i]:
# Dashes in words are suspicious as well -> let's rather not touch these
continue
if abbrWords[i].endswith("."):
# Is already using a period
if abbrWords[i][ : -1].lower() in fullWords:
# The word was used as an abbreviation, but it appears that it wasn't really abbreviated -> remove period
abbrWords[i] = abbrWords[i][ : -1]
changed = True
else:
if abbrWords[i].lower() in fullWords:
# Assume that every word that appears in the full journal name as-is, is not
# abbreviated and therefore also should not get a period attached to it
continue
else:
# Since the current word is not part of the full journal name, we assume that it
# was abbreviated and thus, we add a period to it
abbrWords[i] += "."
changed = True
if changed:
changedEntries += 1
row[1] = " ".join(abbrWords)
# print(" ".join(fullWords), "->", " ".join(abbrWords), "(", abbreviation, ")")
changedRows.append(row)
if changedEntries > 0:
# Write out new content
with open(os.path.join(args.journal_dir, currentFileName), "w", newline="") as currentFile:
writer = csv.writer(currentFile, delimiter=";", lineterminator="\n")
writer.writerows(changedRows)
print("======== Changed %d entries for %s" % (changedEntries, currentFile.name))
if __name__ == "__main__":
main() However, as it turns out, there appear too many exceptions that can't be handled properly using this simple approach. From what I have seen so far, I would even go as far as to say that an automated approach is probably not really feasible and any changes have to be performed manually by someone who knows what they are doing 🤷 |
So wee need to have seperate lists with dots and without dots? Maybe your script can generate a basis... |
Nah, I think there just is no good way of automating this. There exist several journal abbreviations that add e.g. a However, baking all these cases as special cases into the code, would probably result in approximately the same work as fixing this by hand. Finally, there also seem to exist groups of journals that use no punctuation at all. They just combine the starting letters of the full name parts into a word (fictional example: Science Magazine -> SM). |
I think we should use ISO 4 journal abbreviations with periods, because removing periods is much more reliable than adding them, as suggested by another similar program, Zotero 1. JabRef could perhaps have a built-in feature for removing periods, or this repository could store only abbreviations with periods, and output a copy of the abbreviations without periods (in a combine script) Footnotes |
Our documentation at https://docs.jabref.org/advanced/journalabbreviations has journal lists with dots. We have documentation on our lists at https://github.com/JabRef/abbrv.jabref.org/tree/main/journals#readme. "Entrez", "Index Medicus" provides dotless abbreviations only. How to handle these? I think JabRef/jabref#10557, needs to be fixed. Then, the combined journal list is obsolete. Then we can do lists per area (e.g., medicine, computer science, ...) |
As it stands, some abbreviations include periods, while others don't. Compare for example "Quality Assurance in Health Care" (Qual. Assur. Health Care) and "Quality Assurance in the Fish Industry" (Dev Food Sci).
Is there any reason for this inconsistency? I think it would be best to always use the period or not and to possibly leave the choice up to the user.
The text was updated successfully, but these errors were encountered: