-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68
Comments
This looks like the text isn't actually UTF-8 in the file being analyzed. Have you verified that the file being examined actually complies with UTF-8? If it doesn't comply with UTF-8 (seems likely), see the documentation on various options. Sadly, Python3 doesn't provide good tools for handling non-UTF-8 text files. |
Please run "iconv" or some other tool that does byte-by-byte checking. I think the editors just look at a few lines, and they may accept badly formatted data anyway. Python3 is extremely picky and immediately fails any time the text isn't perfect. Workarounds are documented. |
Also: if the character is just a literal 0x81 byte, that is not valid UTF-8. |
Huh. That doesn't make sense to me at all. The sequence 0xd1 0x81 seems like perfectly fine UTF-8 to me, it shouldn't give you that error message. So we agree it shouldn't happen. But clearly it's happening anyway :-). Can you send me a URL for a mishandled file so I can just use curl/wget to get it? Ideally make the test file as small as possible while still causing the problem. I want to reproduce the problem with the smallest possible failing test. If I can reproduce it, I should be able to fix it, or at least explain it & suggest a workaround. |
Oh, weird thought: This is on Windows. Is it possible it's actually being stored as UTF-16? I doubt that's what is going on, but I'm grasping at straws and maybe this is the straw I needed :-). |
Here is the link to that file: https://github.com/Kitware/CMake/blob/master/Source/CTest/cmCTestBuildHandler.cxx I am not sure how to check if it's stored as UTF-16, I mean it's just a plain text file, I don't see any header when viewing it in hex. |
You posted an image showing the file. However, I need the file contents itself. Can you post it somewhere (ideally shortened) & share the URL to it? A small snippet would be best for my purposes. |
Oh, whups, you did provide a link. Thank you. I ran it on MacOS and it worked just fine. Below is the output. Ugh, it seems to be a Windows 10 specific thing. I don't have any of those platforms.
|
Yeah, I couldn't repro it using WSL Ubuntu. Looks like this issue is not easy to tackle, I wonder if detecting encoding using chardet before opening the file an acceptable solution? https://stackoverflow.com/questions/36303919/python-3-0-open-default-encoding https://peps.python.org/pep-0597/ https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function |
Hmm, it appears that on Windows the default encoding isn't what files use. That seems like a bug in the Windows implementation. I'd prefer flawfinder to NOT always assume UTF-8, because some systems don't use UTF-8. See: https://peps.python.org/pep-0597/ - it seems the "solution" is that people writing code are supposed to magically know what the file encoding is from users. That's rediculous. I have no magic available. I need users to tell me what the encoding is, and use the default if they don't specify something. |
Hmm, it appears you're trying to process UTF-8 files, but the Windows default is NOT UTF-8, and that's the mismatch. Try this: python3 -X utf8 flawfinder.py .... or set
.. .then run flawfinder. |
That CMake repo has Tests/RunCMake/CommandLine/cmake_depends/test_UTF-16LE.h in UTF-16 encoding, if I force it by Is there a way to exclude certain folders that contain non-product code, i.e., in this case |
There isn't an However: can you tell me if PYTHONUTF8=1 resolves the problem with cmCTestBuildHandler.cxx ? If it does, then we're at least making progress. Flawfinder doesn't have a way of scanning different files with different encodings. Most software developers wouldn't want to do that. If you have to do that, I suggest making a copy, changing all the source files to some consistent encoding, and then analyzing them. |
Yes, I guess I can workaround it by analyzing only the |
Okay, glad that PYTHONUTF8=1 solves the immediate problem. I think I need to modify flawfinder to note this as an option - so don't close this yet. Take care! |
Encoding Error on Windows 10
No Encoding Error on WSL
File Content
The text was updated successfully, but these errors were encountered: