Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IncludesPass and CommentsPass have issues with files using "non-standard" encodings #59

Open
aytey opened this issue Jul 23, 2021 · 11 comments
Assignees

Comments

@aytey
Copy link
Contributor

aytey commented Jul 23, 2021

For:

cat unit.c
/*
* für
*/ 
file unit.c
unit.c: ISO-8859 text

running cvise --start-with-pass IncludesPass gives:

00:00:00 INFO ===< 118085 >===
00:00:00 INFO running 24 interestingness tests in parallel
00:00:00 INFO INITIAL PASSES
00:00:00 INFO ===< IncludesPass >===
00:00:00 WARNING IncludesPass has encountered a non fatal bug: pass got stuck
00:00:00 INFO ===< UnIfDefPass >===
00:00:00 INFO ===< CommentsPass >===
00:00:00 WARNING CommentsPass has encountered a non fatal bug: pass got stuck
00:00:00 INFO ===< IfPass >===
Traceback (most recent call last):
  File "/usr/bin/cvise", line 286, in <module>
    reducer.reduce(pass_group, skip_initial=args.skip_initial_passes)
  File "/usr/share/cvise/cvise.py", line 143, in reduce
    self._run_additional_passes(pass_group['first'])
  File "/usr/share/cvise/cvise.py", line 166, in _run_additional_passes
    self.test_manager.run_pass(p)
  File "/usr/share/cvise/utils/testing.py", line 503, in run_pass
    self.state = self.current_pass.new(self.current_test_case, self.check_sanity)
  File "/usr/share/cvise/passes/ifs.py", line 36, in new
    bs = BinaryState.create(self.__count_instances(test_case))
  File "/usr/share/cvise/passes/ifs.py", line 22, in __count_instances
    for line in in_file.readlines():
  File "/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 6: invalid start byte

(this was after editing cvise/utils/testing.py to set GIVEUP_CONSTANT to 1)

and creates:

Package: cvise 2.3.0
Git version: unknown
LLVM version: 12.0.0
System: uname_result(system='Linux', node='vistrrdslin0001', release='5.12.13-1-default', version='#1 SMP Mon Jun 28 06:37:23 UTC 2021 (74bd8c0)', machine='x86_64', processor='x86_64')
***************************************************

IncludesPass has encountered a bug:
pass got stuck
state: 2

Please consider tarring up cvise_bug_0
and creating an issue at https://github.com/marxin/cvise/issues and we will try to fix the bug.

***************************************************

and

Package: cvise 2.3.0
Git version: unknown
LLVM version: 12.0.0
System: uname_result(system='Linux', node='vistrrdslin0001', release='5.12.13-1-default', version='#1 SMP Mon Jun 28 06:37:23 UTC 2021 (74bd8c0)', machine='x86_64', processor='x86_64')
***************************************************

CommentsPass has encountered a bug:
pass got stuck
state: -1

Please consider tarring up cvise_bug_1
and creating an issue at https://github.com/marxin/cvise/issues and we will try to fix the bug.

***************************************************
@marxin
Copy link
Owner

marxin commented Jul 28, 2021

Well, it's technically a bug, but I don't see why would somebody use a non-utf8 encoding for files?
Doctor it hurts. Then don't do it :D

@aytey
Copy link
Contributor Author

aytey commented Jul 29, 2021

I don't see why would somebody use a non-utf8 encoding for files?

The tool my company builds (and one of the reasons I use cvise) automatically builds data-driven unit test "wrappers" for C/C++ -- as part of this, we just "gobble up" whatever code our customers have.

Our customers 100% have code that isn't utf8/ascii (the above comes from a support case I was helping with).

I think that adding something like:

(code I wrote) would make cvise more robust here.

Would you be happy if I made some utility function like safe_open that wraps open along with a call to get_file_encoding)? I'm thinking that something like:

could then use safe_open vs. "regular" open.

Thoughts?

Alternatively maybe cvise should check if a file is openable as utf8 before starting, and "fail early" (e.g., as part of the initial sanity check) if it isn't?

@marxin
Copy link
Owner

marxin commented Aug 2, 2021

Thank you for your ideas. So you're using a Python package chardet which seems reasonable to me.
A question: if you have a non-utf8 test-case, what about converting it first to utf8 and then let cvise run? At the end you can convert it to a given encoding.
The problem I see with safe_open is that we'll need to use it at various places. Similarly for the writing of the output.

@aytey
Copy link
Contributor Author

aytey commented Aug 2, 2021

How about if cvise (as part of the "initial sanity check") validates if the file can be opened as utf8? If it can't, we have three options:

  1. Bail-out and tell the user "sorry, unsupported"
  2. Add a flag that says "convert to utf8 and try as normal, I understand if the initial tests now might fail and cvise will give up"
  3. Add a flag that says "automatically skip any transformations that need utf8"

We could actually do all three of these; without 2 or 3, 1 happens. If you have 2, then 3 is redundant. If you have 3, 2 is redundant (and it won't run things like IncludesPass).

Thoughts?

@marxin
Copy link
Owner

marxin commented Aug 2, 2021

Well, I like doing 2) using chardet. That's going to help in most cases (hopefully)..

@marxin marxin self-assigned this Aug 2, 2021
marxin added a commit that referenced this issue Aug 13, 2021
@marxin
Copy link
Owner

marxin commented Aug 13, 2021

Well, I like doing 2) using chardet. That's going to help in most cases (hopefully)..

@andrewvaughanj I've just implemented the first version, can you please take a look?

@aytey
Copy link
Contributor Author

aytey commented Aug 13, 2021

I can try this next week; should args.to_utf8 have a "sanity check" afterwards?

@marxin
Copy link
Owner

marxin commented Aug 13, 2021

Well, the conversion does happen before the normal sanity check.
Thanks for testing.

@marxin
Copy link
Owner

marxin commented Oct 22, 2021

Any update about the testing, please?

@marxin
Copy link
Owner

marxin commented Jan 5, 2022

Any update about the testing, please?

PING :)

@marxin
Copy link
Owner

marxin commented Mar 22, 2022

Any news about this @andrewvaughanj ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants