Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68

Open
kuchungmsft opened this issue May 19, 2022 · 18 comments
Open

Comments

@kuchungmsft
Copy link

Encoding Error on Windows 10

git clone https://github.com/Kitware/CMake.git

git clone https://github.com/david-a-wheeler/flawfinder.git

python3 flawfinder\flawfinder.py CMake\Source\CTest\cmCTestBuildHandler.cxx

image

image

No Encoding Error on WSL

image

File Content

image

@david-a-wheeler
Copy link
Owner

This looks like the text isn't actually UTF-8 in the file being analyzed. Have you verified that the file being examined actually complies with UTF-8?

If it doesn't comply with UTF-8 (seems likely), see the documentation on various options. Sadly, Python3 doesn't provide good tools for handling non-UTF-8 text files.

@kuchungmsft
Copy link
Author

Notepad++ thinks the file is UTF-8.
image

VS Code thinks the file is UTF-8.
image

Notepad thinks the file is UTF-8.
image

I think the file does comply with UTF-8.

@david-a-wheeler
Copy link
Owner

Please run "iconv" or some other tool that does byte-by-byte checking.

I think the editors just look at a few lines, and they may accept badly formatted data anyway. Python3 is extremely picky and immediately fails any time the text isn't perfect. Workarounds are documented.

@david-a-wheeler
Copy link
Owner

Also: if the character is just a literal 0x81 byte, that is not valid UTF-8.

@kuchungmsft kuchungmsft changed the title Character Encoding Error on UTF-8 Encoded Source File with Russian Character 0x81 Character Encoding Error on UTF-8 Encoded Source File with U+0441 May 19, 2022
@kuchungmsft
Copy link
Author

My bad, it's character U+0441, I have updated the title.
image

Tried "iconv", no difference between original file and the converted one.
image

@david-a-wheeler
Copy link
Owner

Huh. That doesn't make sense to me at all. The sequence 0xd1 0x81 seems like perfectly fine UTF-8 to me, it shouldn't give you that error message.

So we agree it shouldn't happen. But clearly it's happening anyway :-).

Can you send me a URL for a mishandled file so I can just use curl/wget to get it? Ideally make the test file as small as possible while still causing the problem. I want to reproduce the problem with the smallest possible failing test. If I can reproduce it, I should be able to fix it, or at least explain it & suggest a workaround.

@david-a-wheeler
Copy link
Owner

Oh, weird thought: This is on Windows. Is it possible it's actually being stored as UTF-16? I doubt that's what is going on, but I'm grasping at straws and maybe this is the straw I needed :-).

@kuchungmsft
Copy link
Author

Here is the link to that file: https://github.com/Kitware/CMake/blob/master/Source/CTest/cmCTestBuildHandler.cxx

I am not sure how to check if it's stored as UTF-16, I mean it's just a plain text file, I don't see any header when viewing it in hex.
image

@kuchungmsft
Copy link
Author

This character is what you need to reproduce the issue.

image

@david-a-wheeler
Copy link
Owner

You posted an image showing the file. However, I need the file contents itself. Can you post it somewhere (ideally shortened) & share the URL to it? A small snippet would be best for my purposes.

@david-a-wheeler
Copy link
Owner

Oh, whups, you did provide a link. Thank you.

I ran it on MacOS and it worked just fine. Below is the output.

Ugh, it seems to be a Windows 10 specific thing. I don't have any of those platforms.
I want to fix it, but I have to be able to reproduce it. Any ideas?

python3 ./flawfinder.py cmCTestBuildHandler.cxx 
Flawfinder version 2.0.19, (C) 2001-2019 David A. Wheeler.
Number of rules (primarily dangerous function names) in C/C++ ruleset: 222
Examining cmCTestBuildHandler.cxx

FINAL RESULTS:

cmCTestBuildHandler.cxx:6204:  [2] (misc) open:
  Check when opening files - can an attacker redirect it (via symlinks),
  force the opening of special file type (e.g., device files), move things
  around to create a race condition, control its ancestors, or change its
  contents? (CWE-362).

ANALYSIS SUMMARY:

Hits = 1
Lines analyzed = 6240 in approximately 0.28 seconds (22118 lines/second)
Physical Source Lines of Code (SLOC) = 364
Hits@level = [0]   0 [1]   0 [2]   1 [3]   0 [4]   0 [5]   0
Hits@level+ = [0+]   1 [1+]   1 [2+]   1 [3+]   0 [4+]   0 [5+]   0
Hits/KSLOC@level+ = [0+] 2.74725 [1+] 2.74725 [2+] 2.74725 [3+]   0 [4+]   0 [5+]   0
Minimum risk level = 1

Not every hit is necessarily a security vulnerability.
You can inhibit a report by adding a comment in this form:
// flawfinder: ignore
Make *sure* it's a false positive!
You can use the option --neverignore to show these.

There may be other security vulnerabilities; review your code!
See 'Secure Programming HOWTO'
(https://dwheeler.com/secure-programs) for more information.

@kuchungmsft
Copy link
Author

Yeah, I couldn't repro it using WSL Ubuntu. Looks like this issue is not easy to tackle, I wonder if detecting encoding using chardet before opening the file an acceptable solution?

https://stackoverflow.com/questions/36303919/python-3-0-open-default-encoding

https://peps.python.org/pep-0597/

https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function

@david-a-wheeler
Copy link
Owner

Hmm, it appears that on Windows the default encoding isn't what files use. That seems like a bug in the Windows implementation. I'd prefer flawfinder to NOT always assume UTF-8, because some systems don't use UTF-8. See: https://peps.python.org/pep-0597/ - it seems the "solution" is that people writing code are supposed to magically know what the file encoding is from users. That's rediculous. I have no magic available. I need users to tell me what the encoding is, and use the default if they don't specify something.

@david-a-wheeler
Copy link
Owner

Hmm, it appears you're trying to process UTF-8 files, but the Windows default is NOT UTF-8, and that's the mismatch.

Try this:

python3 -X utf8 flawfinder.py ....

or set PYTHONUTF8 to 1. In a shell do this:

export PYTHONUTF8=1  # linux / macOS
set PYTHONUTF8=1  # windows

.. .then run flawfinder.

@kuchungmsft
Copy link
Author

That CMake repo has Tests/RunCMake/CommandLine/cmake_depends/test_UTF-16LE.h in UTF-16 encoding, if I force it by set PYTHONUTF8=1, I get encoding error on that file.
image

Is there a way to exclude certain folders that contain non-product code, i.e., in this case Tests folder.

@david-a-wheeler
Copy link
Owner

There isn't an --exclude option though that's not a bad idea. However, you can expressly list just the files and/or directories to scan, so just be more explicit about it.

However: can you tell me if PYTHONUTF8=1 resolves the problem with cmCTestBuildHandler.cxx ? If it does, then we're at least making progress.

Flawfinder doesn't have a way of scanning different files with different encodings. Most software developers wouldn't want to do that. If you have to do that, I suggest making a copy, changing all the source files to some consistent encoding, and then analyzing them.

@kuchungmsft
Copy link
Author

Yes, PYTHONUTF8=1 resolves problem with cmCTestBuildHandler.cxx, thanks. The suggestion to make a copy and have a consistent encoding would not work for me because test_UTF-16LE.h is meant to validate that CMake can handle UTF-16, just like compilers can handle inconsistent encoding of source files.

I guess I can workaround it by analyzing only the Source folder instead of entire repo. Thanks a lot for your help.

@david-a-wheeler
Copy link
Owner

Okay, glad that PYTHONUTF8=1 solves the immediate problem.

I think I need to modify flawfinder to note this as an option - so don't close this yet.

Take care!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants