Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: CSV parse error #14

Open
bernd-wechner opened this issue Mar 18, 2021 · 3 comments · May be fixed by #19
Open

ERROR: CSV parse error #14

bernd-wechner opened this issue Mar 18, 2021 · 3 comments · May be fixed by #19

Comments

@bernd-wechner
Copy link

bernd-wechner commented Mar 18, 2021

I'm trying to diff two CSV files and csv-diff just responds with:

ERROR: CSV parse error on line 2

So I do the same things using it as a python package (that is I write a Python script that loads my two files and runs csv--diff on them as per the README) and I get a different error:

KeyError: 'my_key'

Double check the key and it is there, as column 1 in the files which load fine in LibreOffice Calc and in Excel and look fine in a text editor.

So I look at the the file encoding and Python's magic library tells me:

'UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators'

so if I open the file with an encoding of "utf-8-sig" all works fine.

Seems to me, to be a file encoding issue, and one I have encountered in Python a lot so I wrote this:

def file_encoding(filepath):
    '''
    Text encoding is a bit of a schmozzle in Python. Alas.
    
    A quick summary:
    
    1. I come across CSV files with a UTF-8 or UTF-16 encoding regularly enough.
    2. Python wants to know the encoding when we open the file
    3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
    4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field
    5. In fact Unicode standards recommend against including a BOM with UTF-8
        https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
    6. Python assumes it's not there
    7. Some CSV sources though write with a BOM
    8. The encoding must therefore be specified as:
        utf-16     for UTF-16 files
        utf-8       for UTF-8 files with no BOM
        utf-8-sig for UTF files with a BOM 
    9. The "magic" library reliably determines the encoding efficiently by looking
       at the magic numbers at the start of a file
    10. Alas it returns a rich string describing the encoding.
    11. It contains either UTF-16 or UTF-18
    12. It contains "(with BOM)" if a BOM is detected
    13. Because of this schmozzle a quick function to translate "magic" output
        to standard encoding names is here.
    
    :param filepath: The path to a file
    '''
    m = magic.from_file(filepath)
    utf16 = m.find("UTF-16")>=0
    utf8 = m.find("UTF-8")>=0
    bom = m.find("(with BOM)")>=0
    
    if utf16:
        return "utf-16"
    elif utf8:
        if bom:
            return "utf-8-sig"
        else:
            return "utf-8"

and then if I run:

with open(File1, "r", encoding=file_encoding(File1), newline='') as f1:
    csv1 = load_csv(f1, key=key)
    
with open(File2, "r", encoding=file_encoding(File2), newline='') as f2:
    csv2 = load_csv(f2, key=key)

diff = compare(csv1, csv2)

all is good and I get a reliable diff.

I can't work out how to debug the CLI interface in PyDev alas. I'm a tad green in this space it seems. But setup.py build just creates a build folder with a lib folder with __init__.py and cli.py in it. Yet my Windows box (man I hate Windows but I'm stuck there right now) runs a csvdiff.exe which was presumably installed by pip when I installed csv-diff (pip install csv-diff). But I can't see how to run the CLI interface from the source. Guess I could do some reading on click and setup-tools, but hey for the moment, I have it working via its Python package interface and can run with that.

If the CLI error is in fact related to this encoding issue (hard to know for sure), then it could of course be fixed by including an encoding check as above and opening the files with their appropriate encoding. Frankly it'd be nice if python's open() could better guess the encoding (the way magic can).

@patric-r
Copy link

Having this feature would be awesome.

@mikecoop83
Copy link

Having this feature would be awesome.

If you get a chance, could you try out my PR to see if it solves your problem?

@rene-schwabe
Copy link

Any chance the PR from @mikecoop83 gets merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants