Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68

kuchungmsft · 2022-05-19T17:12:09Z

Encoding Error on Windows 10

git clone https://github.com/Kitware/CMake.git

git clone https://github.com/david-a-wheeler/flawfinder.git

python3 flawfinder\flawfinder.py CMake\Source\CTest\cmCTestBuildHandler.cxx

No Encoding Error on WSL

File Content

The text was updated successfully, but these errors were encountered:

david-a-wheeler · 2022-05-19T17:25:21Z

This looks like the text isn't actually UTF-8 in the file being analyzed. Have you verified that the file being examined actually complies with UTF-8?

If it doesn't comply with UTF-8 (seems likely), see the documentation on various options. Sadly, Python3 doesn't provide good tools for handling non-UTF-8 text files.

kuchungmsft · 2022-05-19T18:44:02Z

Notepad++ thinks the file is UTF-8.

VS Code thinks the file is UTF-8.

Notepad thinks the file is UTF-8.

I think the file does comply with UTF-8.

david-a-wheeler · 2022-05-19T20:00:01Z

Please run "iconv" or some other tool that does byte-by-byte checking.

I think the editors just look at a few lines, and they may accept badly formatted data anyway. Python3 is extremely picky and immediately fails any time the text isn't perfect. Workarounds are documented.

david-a-wheeler · 2022-05-19T20:03:30Z

Also: if the character is just a literal 0x81 byte, that is not valid UTF-8.

kuchungmsft · 2022-05-19T21:03:13Z

My bad, it's character U+0441, I have updated the title.

Tried "iconv", no difference between original file and the converted one.

david-a-wheeler · 2022-05-19T22:09:46Z

Huh. That doesn't make sense to me at all. The sequence 0xd1 0x81 seems like perfectly fine UTF-8 to me, it shouldn't give you that error message.

So we agree it shouldn't happen. But clearly it's happening anyway :-).

Can you send me a URL for a mishandled file so I can just use curl/wget to get it? Ideally make the test file as small as possible while still causing the problem. I want to reproduce the problem with the smallest possible failing test. If I can reproduce it, I should be able to fix it, or at least explain it & suggest a workaround.

david-a-wheeler · 2022-05-19T22:11:03Z

Oh, weird thought: This is on Windows. Is it possible it's actually being stored as UTF-16? I doubt that's what is going on, but I'm grasping at straws and maybe this is the straw I needed :-).

kuchungmsft · 2022-05-19T23:19:42Z

Here is the link to that file: https://github.com/Kitware/CMake/blob/master/Source/CTest/cmCTestBuildHandler.cxx

I am not sure how to check if it's stored as UTF-16, I mean it's just a plain text file, I don't see any header when viewing it in hex.

kuchungmsft · 2022-05-19T23:21:10Z

This character is what you need to reproduce the issue.

david-a-wheeler · 2022-05-20T15:10:10Z

You posted an image showing the file. However, I need the file contents itself. Can you post it somewhere (ideally shortened) & share the URL to it? A small snippet would be best for my purposes.

david-a-wheeler · 2022-05-20T16:09:02Z

Oh, whups, you did provide a link. Thank you.

I ran it on MacOS and it worked just fine. Below is the output.

Ugh, it seems to be a Windows 10 specific thing. I don't have any of those platforms.
I want to fix it, but I have to be able to reproduce it. Any ideas?

python3 ./flawfinder.py cmCTestBuildHandler.cxx 
Flawfinder version 2.0.19, (C) 2001-2019 David A. Wheeler.
Number of rules (primarily dangerous function names) in C/C++ ruleset: 222
Examining cmCTestBuildHandler.cxx

FINAL RESULTS:

cmCTestBuildHandler.cxx:6204:  [2] (misc) open:
  Check when opening files - can an attacker redirect it (via symlinks),
  force the opening of special file type (e.g., device files), move things
  around to create a race condition, control its ancestors, or change its
  contents? (CWE-362).

ANALYSIS SUMMARY:

Hits = 1
Lines analyzed = 6240 in approximately 0.28 seconds (22118 lines/second)
Physical Source Lines of Code (SLOC) = 364
Hits@level = [0]   0 [1]   0 [2]   1 [3]   0 [4]   0 [5]   0
Hits@level+ = [0+]   1 [1+]   1 [2+]   1 [3+]   0 [4+]   0 [5+]   0
Hits/KSLOC@level+ = [0+] 2.74725 [1+] 2.74725 [2+] 2.74725 [3+]   0 [4+]   0 [5+]   0
Minimum risk level = 1

Not every hit is necessarily a security vulnerability.
You can inhibit a report by adding a comment in this form:
// flawfinder: ignore
Make *sure* it's a false positive!
You can use the option --neverignore to show these.

There may be other security vulnerabilities; review your code!
See 'Secure Programming HOWTO'
(https://dwheeler.com/secure-programs) for more information.

kuchungmsft · 2022-05-20T18:01:18Z

Yeah, I couldn't repro it using WSL Ubuntu. Looks like this issue is not easy to tackle, I wonder if detecting encoding using chardet before opening the file an acceptable solution?

https://stackoverflow.com/questions/36303919/python-3-0-open-default-encoding

https://peps.python.org/pep-0597/

https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function

david-a-wheeler · 2022-05-20T21:08:42Z

Hmm, it appears that on Windows the default encoding isn't what files use. That seems like a bug in the Windows implementation. I'd prefer flawfinder to NOT always assume UTF-8, because some systems don't use UTF-8. See: https://peps.python.org/pep-0597/ - it seems the "solution" is that people writing code are supposed to magically know what the file encoding is from users. That's rediculous. I have no magic available. I need users to tell me what the encoding is, and use the default if they don't specify something.

david-a-wheeler · 2022-05-20T21:14:50Z

Hmm, it appears you're trying to process UTF-8 files, but the Windows default is NOT UTF-8, and that's the mismatch.

Try this:

python3 -X utf8 flawfinder.py ....

or set PYTHONUTF8 to 1. In a shell do this:

export PYTHONUTF8=1  # linux / macOS
set PYTHONUTF8=1  # windows

.. .then run flawfinder.

kuchungmsft · 2022-05-20T21:44:00Z

That CMake repo has Tests/RunCMake/CommandLine/cmake_depends/test_UTF-16LE.h in UTF-16 encoding, if I force it by set PYTHONUTF8=1, I get encoding error on that file.

Is there a way to exclude certain folders that contain non-product code, i.e., in this case Tests folder.

david-a-wheeler · 2022-05-21T19:51:34Z

There isn't an --exclude option though that's not a bad idea. However, you can expressly list just the files and/or directories to scan, so just be more explicit about it.

However: can you tell me if PYTHONUTF8=1 resolves the problem with cmCTestBuildHandler.cxx ? If it does, then we're at least making progress.

Flawfinder doesn't have a way of scanning different files with different encodings. Most software developers wouldn't want to do that. If you have to do that, I suggest making a copy, changing all the source files to some consistent encoding, and then analyzing them.

kuchungmsft · 2022-05-23T16:16:32Z

Yes, PYTHONUTF8=1 resolves problem with cmCTestBuildHandler.cxx, thanks. The suggestion to make a copy and have a consistent encoding would not work for me because test_UTF-16LE.h is meant to validate that CMake can handle UTF-16, just like compilers can handle inconsistent encoding of source files.

I guess I can workaround it by analyzing only the Source folder instead of entire repo. Thanks a lot for your help.

david-a-wheeler · 2022-05-23T16:19:25Z

Okay, glad that PYTHONUTF8=1 solves the immediate problem.

I think I need to modify flawfinder to note this as an option - so don't close this yet.

Take care!

kuchungmsft changed the title ~~Character Encoding Error on UTF-8 Encoded Source File with Russian Character 0x81~~ Character Encoding Error on UTF-8 Encoded Source File with U+0441 May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68

Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 19, 2022

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 19, 2022

david-a-wheeler commented May 19, 2022

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 19, 2022

david-a-wheeler commented May 19, 2022

kuchungmsft commented May 19, 2022

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 20, 2022

david-a-wheeler commented May 20, 2022

kuchungmsft commented May 20, 2022

david-a-wheeler commented May 20, 2022

david-a-wheeler commented May 20, 2022

kuchungmsft commented May 20, 2022

david-a-wheeler commented May 21, 2022

kuchungmsft commented May 23, 2022

david-a-wheeler commented May 23, 2022

Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68

Character Encoding Error on UTF-8 Encoded Source File with U+0441 #68

Comments

kuchungmsft commented May 19, 2022

Encoding Error on Windows 10

No Encoding Error on WSL

File Content

david-a-wheeler commented May 19, 2022

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 19, 2022

david-a-wheeler commented May 19, 2022

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 19, 2022

david-a-wheeler commented May 19, 2022

kuchungmsft commented May 19, 2022

kuchungmsft commented May 19, 2022

david-a-wheeler commented May 20, 2022

david-a-wheeler commented May 20, 2022

kuchungmsft commented May 20, 2022

david-a-wheeler commented May 20, 2022

david-a-wheeler commented May 20, 2022

kuchungmsft commented May 20, 2022

david-a-wheeler commented May 21, 2022

kuchungmsft commented May 23, 2022

david-a-wheeler commented May 23, 2022