Skip to content

Akashic101/Paderborner-Volksblatt-1849

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cover of Paderborner Volksblatt


GitHub last commit GitHub repo size GitHub issues

This repository is a work-in-progress project which includes a scan of every page of the "Paderborner Volksblatt" newspaper from 1849. The pages are from a facsimile printed in 1979 by the Junfermannsche publisher in Paderborn. You can also find a text-file of every page where the text was extracted using Tesseract OCR. The model used was provided by the Mannheim University Library who used OCR to scan in newspaper-editions of the Reichsanzeiger, a trained model based on theirs to fix small issues in recognizing certain characters is currently in the works.

Special thanks to the municipal library of Aachen which provided the book-scanner to start this project and Stefan Weil of the Mannheim University Library who helped me finding the right Tesseract-model to use. This project could not have been done without those named.

Progress

Scanned: 365/365 ✔️
Cropped: 90/365
Sorted: 90/365
OCR: 31/365
Reviewed: 17/365

Notes

Currently missing page: First page of 11.1.1849

Navigate this repository

Scans

If you are looking for raw scans in PDF-format navigate to the input-folder. There you will find a folder for each month which in turn includes a folder with the raw unedited files and a folder with cropped scans.

├──
├── input                    
│   ├── month
│   |    ├── edited
│   |         └── page_*.pdf        # Cropped PDF's with border around the pages removed
│   |    └── raw
│   |         └── page_*.pdf        # Raw unedited scans with border
└──

OCR

If you are looking for the OCR-results head to the OCR-folder where you will find a folder for every month done so far. In each you will find a folder containing every PDF converted to a PNG and the associated text-file. This text has not yet been reviewed and contains errors. Reviewed files can be found in the done-folder sorted by dates.

├──
├── OCR                    
│   ├── month
│   |    ├── done
│   |    |     └── page_*.pdf
│   |    |            └── date
│   |    |                  ├── page_*.png        # Converted edited PDF's used for OCR-recognition
│   |    |                  └── page_*.txt        # Reviewed text-files
│   |    └── png_and_text
│   |         ├── page_*.png                      # Converted edited PDF's used for OCR-recognition
│   |         └── page_*.txt                      # Unedited text-files with mistakes
└──

Scripts

I use multiple small scripts and commands to aid the process. These can be found in the scripts-folder. Following is an explanation what each file does:

command.sh

The main command to convert images to text. Change path\to\images\*.png to the path corresponding to location of your images. This script will generate the output in a txt-file with the same name as the PNG in the same location as the image. The ´-l frak_de` flag describes the language Tesseract uses to identify characters. This model is not included in Tesseract by default and was instead provided by the Mannheim University Library and can be found here (frak2021_1.069 was used for the Paderborner newspaper).

fix_s.ps1

This script is used to fix a very common issue of the above model where it writes ſ instead of s. While this is factual correct since this letter is the predecessor of the German ß (more about this letter can be read here, it makes the text-files harder to read which is why I chose to replace them. This script goes through each text-file and replaces the character. To run it place it in the folder with the text-files you want to modify and run it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published