Skip to content

Latest commit

 

History

History
20 lines (17 loc) · 530 Bytes

README.md

File metadata and controls

20 lines (17 loc) · 530 Bytes

HSBC Credit Card Statement PDF Parser

Library

  • pdf2image
    • PDF to Image converter
    • Need to install poppler (check README of repo above)
  • pytesseract
    • Tesseract Python Wrapper

Process

  1. Convert PDF file to Image using PDF2Image
  2. Adjust images for improving OCR result
  3. Run Tesseract for OCR images
  4. Tokenize OCR result
  5. Export csv file

How to run

> python main.py [-o output_file_path] pdf_file_path