Text extraction issue with extract_text_to_fp - Uncleaned CID characters #1056

BaillySylvain · 2024-10-25T07:39:21Z

Hello,

While using the extract_text_to_fp function with the latest version of pdfminer.six, I've encountered an issue where CID characters (e.g., CID(123)) appear in the extracted text. These characters seem to be associated with the fonts used in the PDF, but there is no straightforward way to clean them or make them readable.

This makes it difficult to obtain a clean and readable text from the PDF, as these CID characters are not converted into standard Unicode characters or intelligible text. I would like to know if there is a recommended way to handle this issue with pdfminer.six or if improvements can be made to the library to manage such cases more effectively.

Additional Information:

Version of pdfminer.six used: Version: 20240706
Operating system: linux

Example code used for extraction:

            extract_text_to_fp(srcFile, xmlFile, output_type='xml', codec='utf-8', laparams=LAParams(
                detect_vertical=False,
                word_margin=0.2,
                char_margin=3,
                boxes_flow=1,
                all_texts=True,
            ))

Thank you for your incomming answer and your time.
page10.pdf

The text was updated successfully, but these errors were encountered:

Some1Somewhere · 2024-10-29T17:09:15Z

It's been discussed in euske/pdfminer#122 and on stack overflow : https://stackoverflow.com/questions/74416930/how-to-solve-cidx-pdfplumber-python-text-extraction

This is the code I've used to solve cid, might not work for all use cases, but works for mine!

def prune_text(text):
    """
    Replace (cid:x) patterns in the text with corresponding characters.

    Args:
        text (str): The input text containing (cid:x) patterns.

    Returns:
        str: The processed text with (cid:x) replaced.
    """

    def replace_cid(match):
        cid_num = int(match.group(1))
        # Define specific CID to character mappings
        cid_mapping = {
            0: "- ",  # Example: (cid:0) to bullet point
            # Add more mappings as needed
            # e.g., 66: 'B', etc.
        }
        try:

            return cid_mapping.get(
                cid_num, chr(cid_num)
            )  # Return mapped char or empty string if not found
        except:
            return ""

    # Regular expression to find all (cid:x) patterns
    cid_pattern = re.compile(r"\(cid:(\d+)\)")
    pruned_text = re.sub(cid_pattern, replace_cid, text)
    return pruned_text

dhdaines · 2024-11-27T16:51:39Z

The handling of unmapped glyphs is a method on PDFLayoutAnalyzer which can be overriden in a subclass or patched at runtime. So you can do this in your code for instance:

from pdfminer.converter import PDFLayoutAnalyzer
PDFLayoutAnalyzer.handle_undefined_char = lambda *args: ""

dhdaines linked a pull request Nov 27, 2024 that will close this issue

Allow suppression of (cid:N) in pdf2txt #1070

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction issue with extract_text_to_fp - Uncleaned CID characters #1056

Text extraction issue with extract_text_to_fp - Uncleaned CID characters #1056

BaillySylvain commented Oct 25, 2024

Some1Somewhere commented Oct 29, 2024 •

edited

Loading

dhdaines commented Nov 27, 2024 •

edited

Loading

Text extraction issue with extract_text_to_fp - Uncleaned CID characters #1056

Text extraction issue with extract_text_to_fp - Uncleaned CID characters #1056

Comments

BaillySylvain commented Oct 25, 2024

Some1Somewhere commented Oct 29, 2024 • edited Loading

dhdaines commented Nov 27, 2024 • edited Loading

Some1Somewhere commented Oct 29, 2024 •

edited

Loading

dhdaines commented Nov 27, 2024 •

edited

Loading